How do you improve your RAG pipeline?

Published on ● Video Link: https://www.youtube.com/watch?v=MxTaJskT84E



Duration: 7:13
85 views
3


AF: Practically, if you have a bunch of modules that are not deterministic in how to interact with each other, it can get out of hand and chaotic very quickly.

IY: I think eventually the word modular RAG or RAG itself as a term, might change later on. What you're designing here is not just a model anymore. What you're trying to do here is not just a specific workflow anymore. What you're trying to do here is an actual system.

Each of the modules, you can see them as a primitive, vector database itself can become a primitive, you can have multiple primitives created. There are so many possible primitives, and some of them may be not generative.

There are many ways of designing these primitives, and I see two paradigms. Currently, a lot of people are just creating RAG within their stand-alone service. But if you start to work with other teams, like application developers and DevOps, you might want to start thinking about larger picture in terms of your primitives. Sometimes, as a scientist or engineers, you can expose these different primitives for application developers and it's very interesting when you start seeing how other people design those orchestration.

AF: Nikhil spoke about post processing of retrieved results in an advanced RAG setup. He spoke about, re-ranking that you can do, summarization that you can do. I'm seeing scenarios where the retrieved chunks have so much content, or they're very sparse, let's say there are a bunch of tables, to the point that the model gets confused when it is generating the end results. There's so much or so little in a weird structure. So, summarization is not useful. Reranking is not useful. What else can be used to process the information after retrieval if that type of confusion is the problem that the RAG system has?

SP: That's a scenario that is quite often seen by people. It depends on what kind of queries are being issued. In RAG 101 scenarios, maybe those are FAQs where the retrieval results have the direct responses to the question, and that's all okay. But as you go up the complexity ladder, the queries become more ambiguous and also the retrieved context become more ambiguous.

I call this the "refine" stage of the RAG pipeline: you have the retrieved results and you refine them. "Refining" can include many things, including summarization or including creating notes. Summarization actually can help if the summarization is targeted, if it's not like, "hey, just generate a summary out of these retrieval results", but instead having some more specific instructions for the summarization. In the end, you would want the result to convey certain information, like whether the answer to that question is directly included in the retrieval results or not, or whether it just contains some additional context that can be used by the LLM to generate the answers, or whether it does not contain anything at all.

I think there needs to be an intermediate step where you have that spelled out using a smaller LLM and then you generate that refined text and that goes to your larger LLM.

AF: Quickly with the last question, I have a RAG system, it doesn't work. How do I go about figuring out what's wrong? What are the pitfalls that I might face and what's the secret sauce of dealing with this problem?

NV: It's the million dollar question!

The key is evaluation. Whenever you have some version of RAG setup, you need to know what works and what doesn't work. If you have established that, then you have a clear, transparent view into what is working, not working. Then you can easily figure out, okay, do I need to make a component change? Do I need to use a different LLM, make some prompt changes, or do I need to fundamentally change a pattern of operation that happens within the pipeline? That's really the first decision you need to make.

The trick actually is to be disciplined and not make changes haphazardly, because when you do that, you have no idea what works. There are so many different moving pieces. You can keep building on all of these versions of RAG over time. You can always fall back to one of them later if you feel that there is something new I can do. The secret sauce is evaluation.

The pitfall is you might not be able to find the right pattern. There are fundamentally two different patterns that you could use. Both of them could be good. Or not. So which one do I invest in? At some point you need to take a leap of faith and see, okay, there are infinite patterns that I could use, but this is the one that I can move fast on, so I picked that. That's your greedy search strategy.

One of the things I explained in the talk is the reason RAG exploded: the barrier to entry was low and you didn't need to fine-tune. But if you're moving in a direction where those advantages really become kind of shackles on your process then it really defeats the purpose. So you want to focus on agility, evaluation, and quick decision making.







Tags:
deep learning
machine learning