Introduction
RAG-based applications have so many components and moving parts that seems impossible to optimize or know where to start. Add to that the fact that the field changes so fast, which makes it super hard to keep up. So I’ve gathered few ideas over time to improve RAG-based applications from reading research papers and implementations I’ve deployed in the past.
Note
The list will keep changing as I learn/implement new things
Ideas
- Metadata filtering is key for good RAG apps
- For an MVP, it is a good idea to use both bi-encoders and full-text search such as Tf-Idf and BM25 and combine them
- ColBERT reranker is great and less sensitive to chunking
- Have at least 50 chars overlapping in chunks when splitting to not cut-off context
- If you have data and compute, always fine-tune both encoders if you can
- Use
sentence-transformer
to fine-tune embedding models- We typically use triplet loss where for each query we would have positive and negative examples. We want the negative examples to be hard negatives -> very close to positive examples so the model can learn to differentiate between them
- Large/New LLM not necessarily are good embedding models and mayn’t be worth the latency. LLMs with ~1B is enough for most cases
- Challenges with embedding models:
- Mayn’t transfer to your domain
- Fixed vocabulary used when model was trained
- Because chunk/doc is represented in one vector which is combination of all tokens in the chunk/doc, the output vector may dilute the meaning especially for long texts -> Be careful about chunking strategy
- Always start with a baseline such as BM25 (Best Match 25)
- Build your own gold dataset and check its correlation with synthetic dataset generated from LLMs
- Chunking beyond 256 tokens will affect high precision search because it will dilute the vector representation because embedding models were not trained on long contexts such as BERT-based encoders
- Feedback of how users are liking the app is key to guide us where we should focus our efforts to improve the app
- Satisfaction ratings such as “How did we do today” or “Did we answer your question”
- Monitor
cosine
similarity between embeddings of query and retrieved docs and reranking scores that come from reranker (cohere) - Use clustering of questions using tools such as LDA or BERT-Topic to cluster questions into topics and focus on largest topics (by count) that have lowest means of cosine and feedback
- We have two kinds of topics:
- Content topics: Topics that we don’t have inventory of documents about such topics
- Capability topics: Topics that reader will never be able to generate if we don’t capture them in our docs and docs metadata and include them in the prompt. For example, “Who last updated the pricing document” is asking about last modified date/person
- Build classifier to classify questions real-time for better observability and better react to sudden changes in usage
- Generate synthetic data (questions) for topics we’re not doing great job at and evaluate new improvements on the generated questions
- This can be done by providing random chunk from docs that belong to topics we’re trying to improve to decent LLM and ask to generate questions
- We can use LLM to get metadata about docs/objects
- Lancedb is a good vector database to use for small/scale workloads
- BM25 (full text search) outperforms similarity search when questions are just searching for file names. They may have similar performance with similarity search baseline.
- It is always helpful to include BM25
- We can do citation through prompting and attaching IDs to chunks
- Fine-tune embedding model is key for domain-specific RAGs
- With recent increase in context window sizes, chunk size of 800 and 30% overlap is recommended