The Data Addition Dilemma

Channel:

Simons Institute for the Theory of Computing

Subscribers:

69,500

Published on November 12, 2024 12:00:00 AM ● Video Link: https://www.youtube.com/watch?v=bEdFK9FZdyM

Duration: 0:00

977 views

Irene Y Chen (UC Berkeley)
https://simons.berkeley.edu/talks/irene-y-chen-uc-berkeley-2024-11-12
Domain Adaptation and Related Areas

When training machine learning methods, combining data from different sources isn't always beneficial. While more data generally helps machine learning models, mixing data from dissimilar sources can sometimes reduce overall accuracy, create unpredictable fairness issues, and worsen performance for underrepresented groups. We identify this situation as the "Data Addition Dilemma", which happens due to a trade-off between the benefits of more data and the drawbacks of combining different data distributions. We find that this possibly arises from an empirically observed trade-off between model performance improvements due to data scaling and model deterioration from distribution shift. We thus establish baseline strategies for navigating this dilemma, introducing distribution shift heuristics to guide decision-making on which data sources to add in data scaling, in order to yield the expected model performance improvements. We conclude with a discussion of the required considerations for data collection and suggestions for studying data composition and scale in the age of increasingly larger models.

Other Videos By Simons Institute for the Theory of Computing

2024-11-14	Open-Source and Science in the Era of Foundation Models
2024-11-13	Toward Understanding the Extrapolation of Nonlinear Models to Unseen Domains or the Whole Domain
2024-11-13	Language-guided Adaptation
2024-11-13	On Spurious Associations and LLM Alignment
2024-11-13	Causally motivated robustness to shortcut learning
2024-11-13	Talk by Zachary Lipton
2024-11-12	Distribution shift in ecological data: generalization vs. specialization,
2024-11-12	Transfer learning via local convergence rates of the nonparametric least squares estimator
2024-11-12	Transfer learning for weak-to-strong generalization
2024-11-12	User-level and federated local differential privacy
2024-11-11	The Data Addition Dilemma
2024-10-16	The Enigma of LLMs: on Creativity, Compositionality, Pluralism, and Paradoxes
2024-10-02	Let’s Try and Be More Tolerant: On Tolerant Property Testing and Distance Approximation
2024-10-02	A Strong Separation for Adversarially Robust L_0 Estimation for Linear Sketches
2024-10-02	Towards Practical Distribution Testing
2024-10-02	Toward Optimal Semi-streaming Algorithm for (1+ε)-approximate Maximum Matching
2024-10-02	Plenary Talk: Privately Evaluating Untrusted Black-Box Functions
2024-10-02	The long path to \sqrt{d} monotonicity testers
2024-10-02	O(log log n) Passes is Optimal for Semi-Streaming Maximal Independent Set
2024-10-02	Distribution Learning Meets Graph Structure Sampling
2024-10-02	On the instance optimality of detecting collisions and subgraphs

Channel	Latest
OneHourGames	6 hours ago
Console Creatures	6 hours ago
2T Game Footages	6 hours ago
Hoàng Luân	6 hours ago
착말	6 hours ago
리기우 LeeKiWoo	6 hours ago
LCK Global	6 hours ago
SauRamos	7 hours ago
Bao Chao	7 hours ago
PaoLUL	7 hours ago
遊戲雜談館	7 hours ago
Konehitsu	7 hours ago
MAVAGRA VARIASI	7 hours ago
KartillO	7 hours ago
오데No.1 Mordekaiser	7 hours ago
Colleev is Still Gaming	7 hours ago
MAUT VIRUS GAMING	7 hours ago
G i n.	7 hours ago
Leon Indie	7 hours ago
PUBG MOBILE Brasil	7 hours ago
Tài Liệu Học Tập	7 hours ago
Azado	7 hours ago
KNTFAMILY	7 hours ago
I GAMERS BD	7 hours ago
GameTrailers	7 hours ago