Curated Resource ( ? )

What Nobody Tells You About RAGs

Like ai , rag

Curated: 03/09/2024 from towardsdatascience.com/what-nobody-tells-you-about-rags-b35f017e1570

my notes ( ? )

"A deep dive into why RAG doesn’t always work".

It should be easy: "install a popular LLM orchestrator like LangChain or LlamaIndex, turn your data into vectors, index those in a vector database, and quickly set up a pipeline with a default prompt."

However, while "quick-and-dirty demos are great for understanding the basics... [it's] more than just stringing together some code. It’s about navigating the realities of messy data, unforeseen user queries, and the ever-present pressure to deliver tangible business value".

Ahmed Besbes first sets out the "business imperatives ... [then] dive into the common technical hurdles... and discuss strategies to overcome them".

Clarify Business value

"Know your users and what they do... then discuss how RAGs and LLMs can help. Take sample questions and simulate how the system would handle them...
Educate non-technical users on the capabilities and limitations ...
Understand the user journey: What types of questions ... How will the generated answers be used ... how your RAG will integrate into an existing workflow ...
Anticipate the data to be indexed... understand why and how it will be used to form a good answer ...
Define success criteria... a clear way to compute the ROI... iterate continuously"

Understand what you’re indexing & Improve chunk quality

Sets out how to combine multimodal data (text, images, tables & code) and explores different chunking techniques: "in general ... analyze the data precisely and come up with a chunking rule that makes sense from a business perspective...:

Leverage document metadata... to provide contextually relevant chunks"
chunk length depends on content: verbose documents require longer chunks than those "written in bullet points"
some data - eg Jira tickets - doesn’t need chunking at all: "it’s relatively short and self-contained"
"semantic chunking... generates chunks that are semantically relevant... time consuming since it relies on embedding models underneath"

Improve pre-retrieval

Pre-retrieval: refine the query before it gets to the RAG system so it "retrieves the most relevant and high-quality documents...:

"use an LLM to rephrase" users' badly worded queries, perhaps using a chat to-and-fro.
Hypothetical Document Embedding (HyDE) uses a GPT to create an answer (and so may be inaccurate), that is then embedded and used to query the RAG system
"query augmentation combine the original query and the preliminary generated outputs as a new query... inspire the language model to rethink"

Improve retrieval

Given a query, basic retrieval "fetch the most relevant documents from the vector database... can be enhanced with additional techniques...:

Hybrid search ... combines vector search and keyword search... maintaining control over the semantics while matching the exact terms of the query... [useful] when you search for specific keywords that the embedding model isn’t trained on / or is incapable of matching with vector similarity (e.g. product names, technical terms...
Filter on metadata: Each vector ... [has] metadata. When querying ... use this metadata to pre-filter the space of vectors". Lowers compute cost and increases result quality... if the metadata is consistently applied, of course.
"Test multiple embedding models...
Fine-tune an embedding model... If you have positive pairs of related sentences like (query, context) or (query, answer)"

Improve Post-retrieval

"increase the relevancy of the documents after they are extracted from the database...:

reranking ... re-ordering the documents based on their alignment/relatedness with the query...
instruct an LLM to post-process ... rerank the documents and filter out unimportant sections ... [can] induce hallucinations"

Generation

"Once the documents are retrieved and processed ... they are passed to the LLM as a context to generate the answer... enhance answer generation:

Define a system prompt": define how you want the system to behave - eg "write in a particular style... generate structured outputs" when that matches the use case
"Include a few shot examples in your system prompt... [eg] add some input/output example pairs"
use Chain Of Thought to invoke additional steps: "reasoning, summing up results or even taking a step back"

"Multiple frameworks perform these tasks... DSPY. It covers all these functionalities and provides an interesting paradigm to optimize prompt engineering."

Read the Full Post

The above notes were curated from the full post towardsdatascience.com/what-nobody-tells-you-about-rags-b35f017e1570.