Building a LLM Testing API
Check out my essays: https://aisc.substack.com/
OR book me to talk: https://calendly.com/amirfzpr
OR subscribe to our event calendar: https://lu.ma/aisc-llm-school
OR sign up for our LLM course: https://maven.com/aggregate-intellect/llm-systems
β Challenges of testing Conversational AI systems:
π’ There's no single agreed-upon approach for unit testing or regression testing in the world of chatbots.
π’ Traditional metrics (accuracy, precision, recall) might not capture user-facing issues like prompt leakage or language drift.
π’ Annotating data for testing is expensive and time-consuming.
β Framework for automated testing:
π’ Leverages generative models to automatically create question-answer pairs for testing a knowledge base system.
π’ Users can define prompts and the system generates questions and checks the responses for accuracy against the knowledge base.
π’ The framework can be used for integration testing as well as evaluating responses from large language models.
β Key Learnings:
π’ Most users still test chatbots manually.
π’ It's important to focus on testing that reflects real-world use cases and business goals.
π’ Start with a Minimum Viable Product for testing internally and iterate based on user feedback.
π’ Consider a human-in-the-loop approach for data annotation where humans curate outputs from generative models.
β Open questions and future directions:
π’ How to effectively incorporate human feedback into the testing process, considering factors like cultural norms and brand voice.
π’ How to balance the trade-offs between different large language models (e.g., conversational fluency vs. factual accuracy).