How Transformers Learn Causal Structure with Gradient Descent

Channel:

Simons Institute for the Theory of Computing

Subscribers:

69,500

Published on November 12, 2024 12:00:00 AM ● Video Link: https://www.youtube.com/watch?v=xlWBsISnaRA

Duration: 0:00

1,990 views

Jason Lee (Princeton University)
https://simons.berkeley.edu/talks/jason-lee-princeton-university-2024-11-12
Domain Adaptation and Related Areas

The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structure via gradient-based training algorithms remains poorly understood. To better understand this process, we introduce an in-context learning task that requires learning latent causal structure. We prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens. As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, we prove that transformers learn an induction head (Olsson et al., 2022). We confirm our theoretical findings by showing that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.

Other Videos By Simons Institute for the Theory of Computing

2024-11-14	Open-Source and Science in the Era of Foundation Models
2024-11-13	Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains or the Whole Domain
2024-11-13	Language-guided Adaptation
2024-11-13	On Spurious Associations and LLM Alignment
2024-11-13	Causally motivated robustness to shortcut learning
2024-11-13	Talk by Zachary Lipton
2024-11-12	Distribution shift in ecological data: generalization vs. specialization,
2024-11-12	Transfer learning via local convergence rates of the nonparametric least squares estimator
2024-11-12	Transfer learning for weak-to-strong generalization
2024-11-12	User-level and federated local differential privacy
2024-11-11	How Transformers Learn Causal Structure with Gradient Descent
2024-10-16	The Enigma of LLMs: on Creativity, Compositionality, Pluralism, and Paradoxes
2024-10-02	Let’s Try and Be More Tolerant: On Tolerant Property Testing and Distance Approximation
2024-10-02	A Strong Separation for Adversarially Robust L_0 Estimation for Linear Sketches
2024-10-02	Towards Practical Distribution Testing
2024-10-02	Toward Optimal Semi-streaming Algorithm for (1+ε)-approximate Maximum Matching
2024-10-02	Plenary Talk: Privately Evaluating Untrusted Black-Box Functions
2024-10-02	The long path to \sqrt{d} monotonicity testers
2024-10-02	O(log log n) Passes is Optimal for Semi-Streaming Maximal Independent Set
2024-10-02	Distribution Learning Meets Graph Structure Sampling
2024-10-02	On the instance optimality of detecting collisions and subgraphs

Channel	Latest
Dr.파운드	6 hours ago
Gcaothu CHANNEL	6 hours ago
Ekko Challenger	6 hours ago
Indotex Software Digital	6 hours ago
Pizzano Pastano	7 hours ago
Musou Gaming channel / 無双ゲームチャンネル	7 hours ago
Rosanna Pansino	7 hours ago
Chocolate	7 hours ago
Noob2Pro Fixes	7 hours ago
Game Reviews 2025	7 hours ago
BlackZet	7 hours ago
Gredir	7 hours ago
GOATS7N	7 hours ago
Stadium Astro	7 hours ago
Joe Nutsy	7 hours ago
Press-Start Mx	7 hours ago
DeadShadows17	7 hours ago
ZBANKO	7 hours ago
ƒil. 🇵🇭	7 hours ago
Abhishek Gamer	7 hours ago
domisumReplay: Xin Zhao	7 hours ago
Soul Gizzmo	7 hours ago
Pixc	7 hours ago
Anastacia NDHAY BOHOLANA	8 hours ago
Extramanía	8 hours ago