How Do You Validate LLM Systems Beyond Benchmarks?
Check out my essays: https://aisc.substack.com/
OR book me to talk: https://calendly.com/amirfzpr
OR subscribe to our event calendar: https://lu.ma/aisc-llm-school
AF: We were talking about user experiences and evaluation like, is the system good at math versus validation, which is, is the system good at the type of math that my client cares about?
One thing that might become tricky is that you might end up fooling yourself by thinking that your system is good based on some benchmarks that are not really related to what you're doing, they might resemble, but they're not really.
You need to have a very specific data set that is very relevant to your user's journey and user's workflow. Collecting that data within the workflow is probably one of the best and most reliable ways to do it. If they get bored and distracted, that's not going to be reliable. The quality of the data is going to be "make or break" of this story.
DL: That's the challenge of product development.
It's something that we see with a lot of users that they'd spend hours in VoiceFlow, which is amazing and there's always gaps in our product, where you can optimize, where you can improve that workflow. So it's just that prioritization decision, right? Where is labeling helpful? Where is UX helpful? Where is a whole new feature helpful? How do you prioritize front end improvements, ML improvements integration improvements, these UX features?
Even assuming you have all the developers and designers in the world, how do you create an app for doing all this that's workflow centric, but not overwhelming.