Making Sentence Embeddings Robust to User-Generated Content

Subscribers:
342,000
Published on ● Video Link: https://www.youtube.com/watch?v=myTk7LKf7Zs



Duration: 1:02:49
825 views
23


This seminar was hosted by Microsoft Research Africa, Nairobi together with the Microsoft AI for Good team in May 2024.

User-generated content (UGC), e.g. social media posts written in "Internet language", presents a lot of lexical variations and deviates from standard language. As a result, NLP models which were mostly trained on standard texts have been known to perform poorly on UGC, and sentence embedding models like LASER are no exception.

In this talk, we focus on the robustness of LASER to UGC data. We evaluate this robustness by LASER’s ability to represent non-standard sentences and their standard counterparts close to each other in the embedding space. Inspired by previous works extending LASER to other languages and modalities, we propose RoLASER, a robust English encoder trained using a teacher-student approach to reduce the distances between the representations of standard and UGC sentences. We also use data augmentation to generate synthetic UGC-like training data.

We show that RoLASER significantly improves LASER’s robustness to both natural and artificial UGC data by achieving up to 2× and 11× better alignment scores. We also perform a fine-grained analysis on artificial UGC data and find that our model greatly outperforms LASER on its most challenging UGC phenomena such as keyboard typos and social media abbreviations. Evaluation on downstream tasks shows that RoLASER performs comparably to or better than LASER on standard data, while consistently outperforming it on UGC data.

Speaker: Lydia Nishimwe

Learn more about Microsoft Research Lab – Africa, Nairobi: https://www.microsoft.com/en-us/research/lab/microsoft-research-lab-africa-nairobi/seminars/




Other Videos By Microsoft Research


2024-09-18Ludic Design for Accessibility
2024-09-16At the Foothills of an AI Era in Science | Gilbert S. Omenn Grand Challenges Address
2024-09-03Fostering appropriate reliance on AI
2024-08-27ML for High-Performance Climate and Earth Virtualization Engines
2024-08-27Final intern talk: Distilling Self-Supervised-Learning-Based Speech Quality Assessment into Compact
2024-08-26Decoding the Human Brain – A Neurosurgeon’s Experience
2024-08-09Mapping the World: Creating a Global and Temporal High-Resolution Building Density Map
2024-08-08AgriAdvisor Concept Video
2024-07-15Proactive Resume and Pause of Resources for Microsoft Azure SQL Database Serverless
2024-07-12Advances in Natural Language Generation for Indian Languages
2024-06-06Making Sentence Embeddings Robust to User-Generated Content
2024-06-06Keynote: Building Globally Equitable AI
2024-06-04AutoGen Update: Complex Tasks and Agents
2024-06-04MatterGen: A Generative Model for Materials Design
2024-06-04Driving Industry Evolution: Exploring the Impact of Generative AI on Sector Transformation
2024-06-04Challenges and Opportunities of Large Multi-Modal Models for Blind and Low Vision Users: CLIP
2024-06-04Panel Discussion: Generative AI for Global Impact: Challenges and Opportunities
2024-06-04Keynote: Building Globally Equitable AI
2024-05-14Join us for Research Forum on June 4
2024-05-14MSR Talk: Unsupervised Speech Reverberation Control with Diffusion Implicit Bridges
2024-05-10Unlocking Real world solutions with AI – Chris Bishop