"A deep dive into why RAG doesn’t always work".
It should be easy: "install a popular LLM orchestrator like LangChain or LlamaIndex, turn your data into vectors, index those in a vector database, and quickly set up a pipeline with a default prompt."
However, while "quick-and-dirty demos are great for understanding the basics... [it's] more than just stringing together some code. It’s about navigating the realities of messy data, unforeseen user queries, and the ever-present pressure to deliver tangible business value".
Ahmed Besbes first sets out the "business imperatives ... [then] dive into the common technical hurdles... and discuss strategies to overcome them".
Clarify Business value
- "Know your users and what they do... then discuss how RAGs and LLMs can help. Take sample questions and simulate how the system would handle them...
- Educate non-technical users on the capabilities and limitations ...
- Understand the user journey: What types of questions ... How will the generated answers be used ... how your RAG will integrate into an existing workflow ...
- Anticipate the data to be indexed... understand why and how it will be used to form a good answer ...
- Define success criteria... a clear way to compute the ROI... iterate continuously"
Understand what you’re indexing & Improve chunk quality
Sets out how to combine multimodal data (text, images, tables & code) and explores different chunking techniques: "in general ... analyze the data precisely and come up with a chunking rule that makes sense from a business perspective...:
- Leverage document metadata... to provide contextually relevant chunks"
- chunk length depends on content: verbose documents require longer chunks than those "written in bullet points"
- some data - eg Jira tickets - doesn’t need chunking at all: "it’s relatively short and self-contained"
- "semantic chunking... generates chunks that are semantically relevant... time consuming since it relies on embedding models underneath"
Improve pre-retrieval
Pre-retrieval: refine the query before it gets to the RAG system so it "retrieves the most relevant and high-quality documents...:
- "use an LLM to rephrase" users' badly worded queries, perhaps using a chat to-and-fro.
- Hypothetical Document Embedding (HyDE) uses a GPT to create an answer (and so may be inaccurate), that is then embedded and used to query the RAG system
- "query augmentation combine the original query and the preliminary generated outputs as a new query... inspire the language model to rethink"
Improve retrieval
Given a query, basic retrieval "fetch the most relevant documents from the vector database... can be enhanced with additional techniques...:
- Hybrid search ... combines vector search and keyword search... maintaining control over the semantics while matching the exact terms of the query... [useful] when you search for specific keywords that the embedding model isn’t trained on / or is incapable of matching with vector similarity (e.g. product names, technical terms...
- Filter on metadata: Each vector ... [has] metadata. When querying ... use this metadata to pre-filter the space of vectors". Lowers compute cost and increases result quality... if the metadata is consistently applied, of course.
- "Test multiple embedding models...
- Fine-tune an embedding model... If you have positive pairs of related sentences like (query, context) or (query, answer)"
Improve Post-retrieval
"increase the relevancy of the documents after they are extracted from the database...:
- reranking ... re-ordering the documents based on their alignment/relatedness with the query...
- instruct an LLM to post-process ... rerank the documents and filter out unimportant sections ... [can] induce hallucinations"
Generation
"Once the documents are retrieved and processed ... they are passed to the LLM as a context to generate the answer... enhance answer generation:
- Define a system prompt": define how you want the system to behave - eg "write in a particular style... generate structured outputs" when that matches the use case
- "Include a few shot examples in your system prompt... [eg] add some input/output example pairs"
- use Chain Of Thought to invoke additional steps: "reasoning, summing up results or even taking a step back"
"Multiple frameworks perform these tasks... DSPY. It covers all these functionalities and provides an interesting paradigm to optimize prompt engineering."