Theoretical Limitations of Multi layer Transformers

Subscribers:
348,000
Published on ● Video Link: https://www.youtube.com/watch?v=Pfn-wbyamcU



Duration: 0:00
993 views
25


A Google TechTalk, presented by Binghui Peng, 2025-02-06
Algorithms Seminar ABSTRACT: Transformers, especially the decoder-only variants, are the backbone of most modern large language models; yet we do not have much understanding of their expressive power except for the simple 1-layer case. Due to the difficulty of analyzing multi-layer models, all previous work relies on unproven complexity conjectures to show limitations for multi-layer Transformers. In this work, we prove the first unconditional lower bound against multi-layer decoder-only transformers. For any constant $L$, we prove that any $L$-layer decoder-only transformer needs a polynomial model dimension ($n^{\Omega(1)}$) to perform sequential composition of $L$ functions over an input of $n$ tokens.

As a consequence, our results give: (1) the first (unconditional) depth-size trade-off for multi-layer transformers, exhibiting that the $L$-step composition task is exponentially harder for $L$-layer models compared to $(L+1)$-layer ones; (2) an unconditional separation between encoder and decoder, exhibiting a hard task for decoders that can be solved by an exponentially shallower and smaller encoder; (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought.

On the technical side, we propose the multi-party autoregressive communication model that captures the computation of a decoder-only Transformer. We also introduce a new proof technique that finds a certain indistinguishable decomposition of all possible inputs iteratively for proving lower bounds in this model. We believe our new communication model and proof technique will be helpful to further understand the computational power of transformers.

Based on joint work with Lijie Chen and Hongxun Wu

ABOUT THE SPEAKER: Binghui Peng is a motwani postdoc at Stanford University, working with Aviad Rubinstein and Amin Saberi. Previously, he was a research fellow at Simons Institute on "Large Language Models and Transformers" program. He obtained his Ph.D. from Columbia University, advised by Xi Chen and Christos Papadimitriou. He has worked on learning theory and game theory, and, most recently, large language models.