Pretrainer's Guide to Training Data: Measuring Effects of Age, Domain Coverage, Quality, & Toxicity

Subscribers:
344,000
Published on ● Video Link: https://www.youtube.com/watch?v=4-tV3vLYBOg



Duration: 0:00
656 views
20


Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.




Other Videos By Microsoft Research


2024-12-09Quantum Lattice Enumeration in Limited Depth, Fernando Virdia
2024-12-09Enhancing Security of Bluetooth Secure Connections via Deferrable Authentication
2024-12-09Improving the Security of United States Elections with Robust Optimization
2024-11-18Introducing BiomedParse, a groundbreaking foundation model for biomedical image analysis
2024-11-11Low latency carbon budget 2023
2024-10-31Future Directions for XR Interactions with Advanced Sensing Techniques and Haptic Design Frameworks
2024-10-31Estimating mental workload in a simulated flight task using optical f-NIRS signals
2024-10-17Look Ma, no markers: holistic performance capture without the hassle
2024-10-17Hairmony: Fairness-aware hairstyle classification
2024-10-01Data Formulator: Create Rich Visualization with AI iteratively
2024-09-27Pretrainer's Guide to Training Data: Measuring Effects of Age, Domain Coverage, Quality, & Toxicity
2024-09-18AI for Business Transformation: Lessons from Healthcare
2024-09-18AI for Business Transformation: Multimodal Models
2024-09-18AI for Business Transformation: The Business of Data
2024-09-18Ludic Design for Accessibility
2024-09-16At the Foothills of an AI Era in Science | Gilbert S. Omenn Grand Challenges Address
2024-09-03Fostering appropriate reliance on AI
2024-08-27ML for High-Performance Climate and Earth Virtualization Engines
2024-08-27Final intern talk: Distilling Self-Supervised-Learning-Based Speech Quality Assessment into Compact
2024-08-26Decoding the Human Brain – A Neurosurgeon’s Experience
2024-08-09Mapping the World: Creating a Global and Temporal High-Resolution Building Density Map