How Do You Validate LLM Systems Beyond Benchmarks?

Channel:

LLMs Explained - Aggregate Intellect - AI.SCIENCE

Subscribers:

22,300

Published on March 14, 2024 12:30:43 PM ● Video Link: https://www.youtube.com/watch?v=zhZGkEd1g3c

Duration: 1:28

62 views

Check out my essays: https://aisc.substack.com/
OR book me to talk: https://calendly.com/amirfzpr
OR subscribe to our event calendar: https://lu.ma/aisc-llm-school

AF: We were talking about user experiences and evaluation like, is the system good at math versus validation, which is, is the system good at the type of math that my client cares about?

One thing that might become tricky is that you might end up fooling yourself by thinking that your system is good based on some benchmarks that are not really related to what you're doing, they might resemble, but they're not really.

You need to have a very specific data set that is very relevant to your user's journey and user's workflow. Collecting that data within the workflow is probably one of the best and most reliable ways to do it. If they get bored and distracted, that's not going to be reliable. The quality of the data is going to be "make or break" of this story.

DL: That's the challenge of product development.

It's something that we see with a lot of users that they'd spend hours in VoiceFlow, which is amazing and there's always gaps in our product, where you can optimize, where you can improve that workflow. So it's just that prioritization decision, right? Where is labeling helpful? Where is UX helpful? Where is a whole new feature helpful? How do you prioritize front end improvements, ML improvements integration improvements, these UX features?

Even assuming you have all the developers and designers in the world, how do you create an app for doing all this that's workflow centric, but not overwhelming.

Other Videos By LLMs Explained - Aggregate Intellect - AI.SCIENCE

2024-03-28	What is the right team composition in era of LLMs?
2024-03-28	Building an LLM Teacher-bot
2024-03-27	What is the relationship between LLMs and multi-modality?
2024-03-26	What are the system level considerations for using LLMs?
2024-03-22	What is the relationship between language and intelligence?
2024-03-21	How do you improve your RAG pipeline?
2024-03-20	Are long context LLMs the death of RAG?
2024-03-19	How Do You choose between training, fine-tuning, and using small models?
2024-03-15	Multi-agent LLMs Course #business #startup https://maven.com/forms/30a683
2024-03-15	LLM Evaluation, Validation, and Verification
2024-03-14	How Do You Validate LLM Systems Beyond Benchmarks?
2024-03-13	Can Sherpa (multi-agent llm) Handle Multi-modality?
2024-03-12	What Kind of Risks Are Specific to LLMs?
2024-03-08	LLMs, What Skills to Learn? and What a Time to be Alive!
2024-03-07	How do you Force an LLM to Keep Track of the Assumptions a Document Makes?
2024-03-06	How to Annotate Data for LLM Applications
2024-03-05	What is the Role of Data Quality and Diversity in LLM Systems?
2023-12-16	Testing Strategies for LLMs - SHERPA - Open Source Project Update, 2023-12-08
2023-12-16	Evaluating Job Exposure to Large Language Models
2023-12-16	Empirical Rigor in ML
2023-12-16	Evaluation of Multimodal RAG Systems using the LlamaIndex

Tags:

deep learning

machine learning

Channel	Latest
Scott Jund	6 hours ago
Smutsen	6 hours ago
BeastyqtSC2	6 hours ago
Exalted	6 hours ago
Bonkol Live	6 hours ago
Teh Spearhead	6 hours ago
Ashe Challenger	6 hours ago
Austinmp88	6 hours ago
Ask About Parenting & Care	6 hours ago
GranaDy	7 hours ago
Catninja909	7 hours ago
Sion VOD Gaming	7 hours ago
Outplanet Studios	7 hours ago
RakuInariLP	7 hours ago
Xmilek62	7 hours ago
BranOnline	7 hours ago
ketsueki_randi	7 hours ago
beavsbaut	7 hours ago
PIMPNITE	7 hours ago
JugZone	7 hours ago
ItzMiketheman	7 hours ago
Secretnc	7 hours ago
Jeisonlk	7 hours ago
Kaghoegaming	7 hours ago
Chi Montana	7 hours ago