What Functions Do Transformers Prefer to Represent?

Published on ● Video Link: https://www.youtube.com/watch?v=HWLF7FO0HW4



Duration: 31:15
868 views
25


Surbhi Goel (Microsoft Research and University of Pennsylvania)
https://simons.berkeley.edu/talks/stochastic-optimization-under-distributional-drift
Joint IFML/Data-Driven Decision Processes Workshop

Over the past few years, Transformers have revolutionized deep learning, leading to advances in natural language processing and beyond. These models discard recurrence and convolutions, in favor of "self-attention" which directly and globally models interactions within the input context. Despite their success, currently there is limited understanding of why they work. In this talk, I will present our recent results on rigorously quantifying the statistical and representational properties of Transformers which shed light on their ability to capture long range dependencies efficiently. First, I will show how bounded-norm self-attention layers can represent arbitrary sparse functions of the input sequence, with sample complexity scaling only logarithmically with the context length, akin to sparse regression. Subsequently, I will briefly show how this ability of self-attention to compute sparse functions along with its ability to compute averages can be used to construct Transformers that exactly replicate the dynamics of a recurrent model of computation depth $T$ with only $o(T)$ depth. I will conclude the talk with experimental results on synthetic tasks based on learning Boolean functions and automata. Based on joint works with Jordan T. Ash, Ben L. Edelman, Sham M. Kakade, Akshay Krishnamurthy, Bingbin Liu, and Cyril Zhang.




Other Videos By Simons Institute for the Theory of Computing


2022-10-11The Statistical Complexity of Interactive Decision Making
2022-10-11A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With...
2022-10-11A Tutorial on Finite-Sample Guarantees of Contractive Stochastic Approximation With...
2022-10-11Stochastic Bin Packing with Time-Varying Item Sizes
2022-10-10Constant Regret in Exchangeable Action Models: Overbooking, Bin Packing, and Beyond
2022-10-08On The Exploration In Load-Balancing Under Unknown Service Rates
2022-10-08Sample Complexity Of Policy-Based Methods Under Off-Policy Sampling And ...
2022-10-08The Compensated Coupling (or Why the Future is the Best Guide for the Present)
2022-10-08Higher-Dimensional Expansion of Random Geometric Complexes
2022-10-08On the Power of Preconditioning in Sparse Linear Regression
2022-10-07What Functions Do Transformers Prefer to Represent?
2022-10-01Optimality of Variational Inference for Stochastic Block Model
2022-10-01Machine Learning on Large-Scale Graphs
2022-10-01Survey on Sparse Graph Limits + A Toy Example
2022-10-01Long Range Dependence in Evolving Networks
2022-09-30Stochastic Processes on Sparse Graphs: Hydrodynamic Limits and Markov Approximations
2022-09-30Large Deviation Principle for the Norm of the Adjacency Matrix and the Laplacian Matrix of...
2022-09-30Longitudinal Network Models, Log-Linear Multigraph Models, and Implications to Estimation and...
2022-09-30Graphon Games
2022-09-30Vertexons and Embedded Graphon Mean Field Games
2022-09-30Motif Counting via Subgraph Sampling



Tags:
Simons Institute
theoretical computer science
UC Berkeley
Computer Science
Theory of Computation
Theory of Computing
Joint IFML/Data-Driven Decision Processes Workshop
Surbhi Goel