Testing Strategies for LLMs - SHERPA - Open Source Project Update, 2023-12-08

Channel:

LLMs Explained - Aggregate Intellect - AI.SCIENCE

Subscribers:

22,600

Published on December 16, 2023 8:24:52 PM ● Video Link: https://www.youtube.com/watch?v=g1CHPoFKGt4

Category:

Vlog

Duration: 30:24

220 views

Percy Chen

Summary
========
In this session, we explored the unique challenges and methods of testing for Large Language Model- (LLM)-based systems, contrasting them with traditional software testing approaches. Using our open-source Sherpa project as a case study, we demonstrated practical implementations of these testing strategies, highlighting how they ensure robustness and reliability of the components interacting with LLMs. The session also included the latest updates to the Sherpa project, ending with some open questions and challenges we will tackle next.

Topics
=====
⃝ Overview of the Sherpa Project
* The Sherpa project aims to create a collaborative multi-agent framework that allows human interaction.
* The project includes a question answering agent in the Slack workspace and integrates various tools via APIs.
* The goal of the current sprint is to test the functionality of these components and ensure the system’s robustness.

⃝ Testing Roadmap
* The testing roadmap progresses from limited functionality to a more comprehensive system-level testing framework.
* Efforts have been made to set up a suitable testing environment using GitHub Actions.
* The roadmap aims to establish a reasonable testing approach that enables continuous integration and intervention.

⃝ Testing Objectives
* The system should be able to handle different types of responses and provide relevant links and references to the user (citation verification).
* Even in cases of incorrect results, the system should notify the user and suggest alternative actions.
* The impact of updates from Open AI APIs on Sherpa’s performance needs to be assessed and mitigated.
* The interactions between agents in the multi-agent framework need to be tested and their sequential dependencies maintained.
* The cost of using open APIs must be considered to avoid unnecessary expenses during testing.

⃝ Separation of Testing Concerns
* Tests have to handle the diversity and variations of the model’s responses.
* Frequent and more deterministic software testing should ensure that the rest of the system functions as expected.
* Given that most prompts are crafted on the fly by the system, structure and correctness of prompts have to be tested.

⃝ Testing Strategies for Large Models
* Unit testing focuses on testing specific functionalities within the system to ensure their correctness for example to ensure a model response has the correct format.
* Integration testing examines the interactions between different components for example to ensure the output of one model fed into another constructs the right prompt.
* Having a robust testing system ensures that issues can be identified and resolved at an early stage given the experimental and in-progress nature of the project.
* Use case testing ensures that the model performs correctly in real-world scenarios.
* Acceptance criteria for tests requires a certain level of determinism which is a challenge for LLMs. That is addressed by categorizing test runs into frequent and infrequent categories.
* For frequent ones, “mock testing” simulates the behavior of LLM components without actually calling external APIs (costly).
* Infrequently, tests are done by calling third party APIs and the system utilizes caching to store input and response for each test case for mock testing.

⃝ Unified Testing Approach
* A unified testing approach combines mock testing and real testing to save time and effort.
* During frequent test runs, the cached responses are loaded, simulating the behavior of the large model.
* This approach ensures consistency and efficiency in testing.

Other Videos By LLMs Explained - Aggregate Intellect - AI.SCIENCE

2024-03-19	How Do You choose between training, fine-tuning, and using small models?
2024-03-15	Multi-agent LLMs Course #business #startup https://maven.com/forms/30a683
2024-03-15	LLM Evaluation, Validation, and Verification
2024-03-14	How Do You Validate LLM Systems Beyond Benchmarks?
2024-03-13	Can Sherpa (multi-agent llm) Handle Multi-modality?
2024-03-12	What Kind of Risks Are Specific to LLMs?
2024-03-08	LLMs, What Skills to Learn? and What a Time to be Alive!
2024-03-07	How do you Force an LLM to Keep Track of the Assumptions a Document Makes?
2024-03-06	How to Annotate Data for LLM Applications
2024-03-05	What is the Role of Data Quality and Diversity in LLM Systems?
2023-12-16	Testing Strategies for LLMs - SHERPA - Open Source Project Update, 2023-12-08
2023-12-16	Evaluating Job Exposure to Large Language Models
2023-12-16	Empirical Rigor in ML
2023-12-16	Evaluation of Multimodal RAG Systems using the LlamaIndex
2023-12-16	Intro to Language Model Operations (LLM-Ops)
2023-12-16	Normie Tools for Validating LLM Outputs
2023-12-16	Automatic Evaluation of Dialogue Systems using LLMs
2023-10-27	SHERPA - Open Source Project Update, 2023-09-29
2023-10-27	Eliciting Business Insights at Scale with Conversational AI
2023-10-27	Challenges and Solutions for LLMs in Production
2023-10-27	Practical Applications, Impact, and ROI of Generative AI

Tags:

deep learning

machine learning

Channel	Latest
Fandy DS	8 hours ago
W4knu Official	8 hours ago
DOOMFESTER	8 hours ago
Jokerd	8 hours ago
ありなみパイセン	8 hours ago
Tomokachi	8 hours ago
Evelone Rofls	9 hours ago
key.amra🌹	9 hours ago
Tekken 8 Re Plays	9 hours ago
Mr. WEN	9 hours ago
ransmo5	9 hours ago
상상상상	9 hours ago
Nando-Friki	9 hours ago
Enoch Hui 2 (鐵路丶巴士丶Switch & 迷你公仔迷)	9 hours ago
Petiru	9 hours ago
ANGGUN PAKSI (ANGGUN)	9 hours ago
Drunken Disciple	9 hours ago
Flame-Of-Justice	9 hours ago
Al Pachino vs 5	9 hours ago
Dividen 365	9 hours ago
judz style	9 hours ago
ASURA_REMIL	9 hours ago
Gamers Pettai	10 hours ago
철권엠아재(MBCtekken)	10 hours ago
Jam jest Jakub	10 hours ago