LLM Evaluation, Validation, and Verification

Channel:

LLMs Explained - Aggregate Intellect - AI.SCIENCE

Subscribers:

22,300

Published on March 15, 2024 12:30:32 PM ● Video Link: https://www.youtube.com/watch?v=_h4JOFK2-1Y

Category:

Discussion

Duration: 4:47

126 views

Check out my essays: https://aisc.substack.com/
OR book me to talk: https://calendly.com/amirfzpr
OR subscribe to our event calendar: https://lu.ma/aisc-llm-school

AF: As a person who has done quite a bit of work in model risk assessment, model validation and evaluation, and created libraries for these kinds of things in the past tell us a little about your point of view and why is this important to talk about?

MH: As engineers working in larger organizations, there are many blind spot and silos happening in the organization. PMs don't talk to engineers. Engineers do not talk to customers. There is always barriers between different teams. When it comes to a use case, it's very easy to get into the trap of only thinking about the performance of the model in a silo.

What we finally care about is what customer is using and what customer sees. That shouldn't be the focus to see how performant is the search? How performant is the LLM? The combination and the user experience is the most important thing. That's why I like the validation versus evaluation. We focus on evaluation a lot and then we lose sight of what you call validation.

AF: Makes a lot of sense. We've been seeing a lot of LLM guardrails coming out. Where do guardrails fit into this discussion?

IY: If there are things that are not up to a certain standard, you shouldn't output them. For example, for our case, we have some internal knowledge base that we know certain things to be true. If the output of the LLMs is not going to be correct, then we have guardrails.

Because LLM is going to keep on changing we have seen where same prompt same temperature same everything still outputs a different thing. So having some sort of either logic based, statistical based, lexical based, or even another LLM to do verification at the usage time would be guardrails.

AF: One of the comments that you made that I really liked was that you should think about metric driven development when it comes to systems like this. Essentially start by the simplest system in terms of implementation and also measuring how it performs, and then slowly add more complexities to it. How does that idea relate to what we are discussing here?

NV: That's a great question actually. The good and the bad thing here is metrics are hard to define and they're always not going to be perfect. But that shouldn't stop you from trying. Even in previous NLP work, when people started working on machine translation, there used to be metrics like BLEU, ROUGE, which are not perfect, but help you determine the extent of which the system works. Now the great thing here is because you have new tools to do evaluation, you have LLM, you can create programs, and you have access to all these different types of data, the power is back to you as a developer to think about the users and craft innovative metrics that apply just to you.

AF: Coming back to Percy, your PhD research is on formal verification of software systems. Obviously LLM based systems is going to be part of it. For those of us who are closer to the cutting edge of what is happening, what scenarios can you imagine where I have to start thinking about formal verification?

PC: Formal verification sounds scary, but it's actually broader than what people think it is. We talk about formal verification, most of the time I think we do all those mathematical tricks and get some property of the neural network.

If the model is too complex, you try to use a smaller model to simulate the larger model. And then you can verify properties of the smaller model and hopefully transport to the larger model.

In real world, most of the time we don't need it, which is actually a good thing, because it's very expensive and takes a lot of time to develop. But if we absolutely care about one aspect of the system, then maybe we want to start to think about it: you have some legal requirements about the safety or bias of the model, and you have to absolutely get that right.

It can also be a thing to showcase what your model is capable of. For example, if you're fine-tuning a language model, you can showcase the robustness. We have different ways to do formal verification of robustness. And you can show your model will not only perform very nicely on the input, but you also perform very consistently on diverse set of input.

What's really valuable about the formal verification is also that a lot of time we understand different properties under restricted the environment. For example, helping analyze internal behavior of the neural network and find that, maybe remove all those neurons from the large model and they actually improve the performance.

Other Videos By LLMs Explained - Aggregate Intellect - AI.SCIENCE

2024-04-02	Intersection Between LLMs and Products
2024-03-28	What is the right team composition in era of LLMs?
2024-03-28	Building an LLM Teacher-bot
2024-03-27	What is the relationship between LLMs and multi-modality?
2024-03-26	What are the system level considerations for using LLMs?
2024-03-22	What is the relationship between language and intelligence?
2024-03-21	How do you improve your RAG pipeline?
2024-03-20	Are long context LLMs the death of RAG?
2024-03-19	How Do You choose between training, fine-tuning, and using small models?
2024-03-15	Multi-agent LLMs Course #business #startup https://maven.com/forms/30a683
2024-03-15	LLM Evaluation, Validation, and Verification
2024-03-14	How Do You Validate LLM Systems Beyond Benchmarks?
2024-03-13	Can Sherpa (multi-agent llm) Handle Multi-modality?
2024-03-12	What Kind of Risks Are Specific to LLMs?
2024-03-08	LLMs, What Skills to Learn? and What a Time to be Alive!
2024-03-07	How do you Force an LLM to Keep Track of the Assumptions a Document Makes?
2024-03-06	How to Annotate Data for LLM Applications
2024-03-05	What is the Role of Data Quality and Diversity in LLM Systems?
2023-12-16	Testing Strategies for LLMs - SHERPA - Open Source Project Update, 2023-12-08
2023-12-16	Evaluating Job Exposure to Large Language Models
2023-12-16	Empirical Rigor in ML

Tags:

deep learning

machine learning

Channel	Latest
Scott Jund	6 hours ago
Smutsen	6 hours ago
BeastyqtSC2	6 hours ago
Exalted	6 hours ago
Bonkol Live	6 hours ago
Teh Spearhead	6 hours ago
Ashe Challenger	6 hours ago
Austinmp88	6 hours ago
Ask About Parenting & Care	6 hours ago
GranaDy	7 hours ago
Catninja909	7 hours ago
Sion VOD Gaming	7 hours ago
Outplanet Studios	7 hours ago
RakuInariLP	7 hours ago
Xmilek62	7 hours ago
BranOnline	7 hours ago
ketsueki_randi	7 hours ago
beavsbaut	7 hours ago
PIMPNITE	7 hours ago
JugZone	7 hours ago
ItzMiketheman	7 hours ago
Secretnc	7 hours ago
Jeisonlk	7 hours ago
Kaghoegaming	7 hours ago
The Missing Level	7 hours ago