LLM Evaluation, Validation, and Verification
Check out my essays: https://aisc.substack.com/
OR book me to talk: https://calendly.com/amirfzpr
OR subscribe to our event calendar: https://lu.ma/aisc-llm-school
AF: As a person who has done quite a bit of work in model risk assessment, model validation and evaluation, and created libraries for these kinds of things in the past tell us a little about your point of view and why is this important to talk about?
MH: As engineers working in larger organizations, there are many blind spot and silos happening in the organization. PMs don't talk to engineers. Engineers do not talk to customers. There is always barriers between different teams. When it comes to a use case, it's very easy to get into the trap of only thinking about the performance of the model in a silo.
What we finally care about is what customer is using and what customer sees. That shouldn't be the focus to see how performant is the search? How performant is the LLM? The combination and the user experience is the most important thing. That's why I like the validation versus evaluation. We focus on evaluation a lot and then we lose sight of what you call validation.
AF: Makes a lot of sense. We've been seeing a lot of LLM guardrails coming out. Where do guardrails fit into this discussion?
IY: If there are things that are not up to a certain standard, you shouldn't output them. For example, for our case, we have some internal knowledge base that we know certain things to be true. If the output of the LLMs is not going to be correct, then we have guardrails.
Because LLM is going to keep on changing we have seen where same prompt same temperature same everything still outputs a different thing. So having some sort of either logic based, statistical based, lexical based, or even another LLM to do verification at the usage time would be guardrails.
AF: One of the comments that you made that I really liked was that you should think about metric driven development when it comes to systems like this. Essentially start by the simplest system in terms of implementation and also measuring how it performs, and then slowly add more complexities to it. How does that idea relate to what we are discussing here?
NV: That's a great question actually. The good and the bad thing here is metrics are hard to define and they're always not going to be perfect. But that shouldn't stop you from trying. Even in previous NLP work, when people started working on machine translation, there used to be metrics like BLEU, ROUGE, which are not perfect, but help you determine the extent of which the system works. Now the great thing here is because you have new tools to do evaluation, you have LLM, you can create programs, and you have access to all these different types of data, the power is back to you as a developer to think about the users and craft innovative metrics that apply just to you.
AF: Coming back to Percy, your PhD research is on formal verification of software systems. Obviously LLM based systems is going to be part of it. For those of us who are closer to the cutting edge of what is happening, what scenarios can you imagine where I have to start thinking about formal verification?
PC: Formal verification sounds scary, but it's actually broader than what people think it is. We talk about formal verification, most of the time I think we do all those mathematical tricks and get some property of the neural network.
If the model is too complex, you try to use a smaller model to simulate the larger model. And then you can verify properties of the smaller model and hopefully transport to the larger model.
In real world, most of the time we don't need it, which is actually a good thing, because it's very expensive and takes a lot of time to develop. But if we absolutely care about one aspect of the system, then maybe we want to start to think about it: you have some legal requirements about the safety or bias of the model, and you have to absolutely get that right.
It can also be a thing to showcase what your model is capable of. For example, if you're fine-tuning a language model, you can showcase the robustness. We have different ways to do formal verification of robustness. And you can show your model will not only perform very nicely on the input, but you also perform very consistently on diverse set of input.
What's really valuable about the formal verification is also that a lot of time we understand different properties under restricted the environment. For example, helping analyze internal behavior of the neural network and find that, maybe remove all those neurons from the large model and they actually improve the performance.