Building a RAG System with Local Vector DB
Enhance LLM responses with your own knowledge base
Learn how to implement a Retrieval-Augmented Generation (RAG) system using a local vector database to provide your AI with access to custom data.
Local RAG System Guide
A comprehensive guide to building a Retrieval-Augmented Generation system with a local vector database.
Tutorial Series
Introduction to RAG Systems
Retrieval-Augmented Generation (RAG) is a powerful approach that combines the strengths of retrieval-based and generation-based AI systems. RAG enhances large language models by providing them with relevant information retrieved from a knowledge base before generating responses.
This approach offers several advantages:
- Provides access to specific knowledge not in the model's training data
- Reduces hallucinations by grounding responses in factual information
- Enables up-to-date responses without retraining the model
- Allows for domain-specific knowledge integration
RAG Architecture Overview
A typical RAG system consists of three main components:
- Document Processing Pipeline: Ingests, chunks, and embeds documents
- Vector Database: Stores and enables semantic search of document embeddings
- Retrieval-Enhanced Generation: Combines retrieved context with user queries to generate responses
Setting Up Your Environment
Let's start by setting up our development environment. We'll use Python with the following libraries:
# Install required packages pip install langchain chromadb sentence-transformers llama-cpp-python pypdf
Document Processing Pipeline
Document Loading
First, we need to load documents from various sources. LangChain provides document loaders for many file types:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader # Load PDF documents pdf_loader = PyPDFLoader("path/to/your/document.pdf") pdf_documents = pdf_loader.load() # Load text documents from a directory text_loader = DirectoryLoader("path/to/your/text/files", glob="**/*.txt", loader_cls=TextLoader) text_documents = text_loader.load() # Combine all documents documents = pdf_documents + text_documents
Text Chunking
Next, we need to split the documents into manageable chunks for embedding:
from langchain.text_splitter import RecursiveCharacterTextSplitter # Create a text splitter text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, length_function=len, ) # Split documents into chunks chunks = text_splitter.split_documents(documents)
Embedding Generation
Now, we'll convert the text chunks into vector embeddings:
from langchain.embeddings import HuggingFaceEmbeddings # Initialize the embedding model embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={'device': 'cpu'} ) # Note: For production use cases, you might want to use a more powerful model # such as "sentence-transformers/all-mpnet-base-v2"
Setting Up a Local Vector Database
We'll use Chroma, a lightweight vector database that can run locally:
from langchain.vectorstores import Chroma import os # Define the directory to store the vector database persist_directory = "chroma_db" # Create and persist the vector database vectordb = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory=persist_directory ) # Persist the database to disk vectordb.persist()
Building the Retrieval System
Now, let's create a retrieval system that can find relevant documents based on a query:
# Load the persisted database vectordb = Chroma( persist_directory=persist_directory, embedding_function=embeddings ) # Create a retriever retriever = vectordb.as_retriever( search_type="similarity", search_kwargs={"k": 5} # Return the top 5 most similar chunks ) # Test the retriever query = "What is the capital of France?" retrieved_docs = retriever.get_relevant_documents(query) for i, doc in enumerate(retrieved_docs): print(f"Document {i+1}:\n{doc.page_content}\n")
Integrating with a Language Model
Now, let's connect our retrieval system with a language model to create a complete RAG system:
from langchain.llms import LlamaCpp from langchain.chains import RetrievalQA from langchain.prompts import PromptTemplate # Initialize the language model llm = LlamaCpp( model_path="path/to/your/llama-2-7b-chat.gguf", temperature=0.2, max_tokens=2000, top_p=0.95, n_ctx=4096, verbose=False ) # Create a custom prompt template template = """ You are a helpful AI assistant. Use the following pieces of context to answer the question at the end. If you don't know the answer, just say you don't know. Don't try to make up an answer. Context: {context} Question: {question} Answer: """ prompt = PromptTemplate( template=template, input_variables=["context", "question"] ) # Create the RAG chain rag_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=retriever, chain_type_kwargs={"prompt": prompt} ) # Test the RAG system query = "What is the capital of France?" response = rag_chain.run(query) print(response)
Building a Simple RAG API
Let's create a simple API using FastAPI to serve our RAG system:
from fastapi import FastAPI, HTTPException from pydantic import BaseModel import uvicorn app = FastAPI(title="Local RAG System API") class Query(BaseModel): text: str @app.post("/ask") async def ask(query: Query): try: response = rag_chain.run(query.text) return {"response": response} except Exception as e: raise HTTPException(status_code=500, detail=str(e)) @app.get("/health") async def health_check(): return {"status": "healthy"} if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)
Advanced RAG Techniques
To improve your RAG system, consider implementing these advanced techniques:
1. Query Transformation
Rewrite user queries to make them more effective for retrieval:
from langchain.retrievers.multi_query import MultiQueryRetriever # Create a multi-query retriever multi_query_retriever = MultiQueryRetriever.from_llm( retriever=retriever, llm=llm ) # This will generate multiple queries from the original query # and combine the results
2. Hybrid Search
Combine keyword-based (BM25) and semantic search for better results:
# Many vector databases support hybrid search # For example, with Weaviate: retriever = vectordb.as_retriever( search_type="hybrid", search_kwargs={ "k": 5, "alpha": 0.5 # Balance between keyword and semantic search } )
3. Re-ranking
Re-rank retrieved documents to improve relevance:
from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor # Create a compressor that uses the LLM to extract relevant information compressor = LLMChainExtractor.from_llm(llm) # Create a compression retriever compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=retriever )
Evaluating Your RAG System
It's important to evaluate your RAG system to ensure it's performing well:
Retrieval Evaluation
Measure how well your system retrieves relevant documents:
- Precision@K: Percentage of retrieved documents that are relevant
- Recall@K: Percentage of relevant documents that are retrieved
- Mean Reciprocal Rank (MRR): Measures where the first relevant document appears
Generation Evaluation
Evaluate the quality of generated responses:
- Human Evaluation: Have experts rate responses for accuracy and helpfulness
- Automated Metrics: Use metrics like ROUGE or BLEU for comparison to reference answers
- Hallucination Detection: Check if responses contain information not in the retrieved documents
Conclusion
Building a RAG system with a local vector database gives you a powerful tool for enhancing LLM responses with custom knowledge. This approach allows you to create AI applications that can access specific information not available in the model's training data, reducing hallucinations and improving accuracy.
As you continue to develop your RAG system, consider experimenting with different embedding models, chunking strategies, and retrieval techniques to optimize performance for your specific use case.