Building a RAG System with Local Vector DB

Introduction to RAG Systems

Retrieval-Augmented Generation (RAG) is a powerful approach that combines the strengths of retrieval-based and generation-based AI systems. RAG enhances large language models by providing them with relevant information retrieved from a knowledge base before generating responses.

This approach offers several advantages:

Provides access to specific knowledge not in the model's training data
Reduces hallucinations by grounding responses in factual information
Enables up-to-date responses without retraining the model
Allows for domain-specific knowledge integration

RAG Architecture Overview

A typical RAG system consists of three main components:

Document Processing Pipeline: Ingests, chunks, and embeds documents
Vector Database: Stores and enables semantic search of document embeddings
Retrieval-Enhanced Generation: Combines retrieved context with user queries to generate responses

Setting Up Your Environment

Let's start by setting up our development environment. We'll use Python with the following libraries:

# Install required packages
pip install langchain chromadb sentence-transformers llama-cpp-python pypdf

Document Processing Pipeline

Document Loading

First, we need to load documents from various sources. LangChain provides document loaders for many file types:

from langchain.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader

# Load PDF documents
pdf_loader = PyPDFLoader("path/to/your/document.pdf")
pdf_documents = pdf_loader.load()

# Load text documents from a directory
text_loader = DirectoryLoader("path/to/your/text/files", glob="**/*.txt", loader_cls=TextLoader)
text_documents = text_loader.load()

# Combine all documents
documents = pdf_documents + text_documents

Text Chunking

Next, we need to split the documents into manageable chunks for embedding:

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
)

# Split documents into chunks
chunks = text_splitter.split_documents(documents)

Embedding Generation

Now, we'll convert the text chunks into vector embeddings:

from langchain.embeddings import HuggingFaceEmbeddings

# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'}
)

# Note: For production use cases, you might want to use a more powerful model
# such as "sentence-transformers/all-mpnet-base-v2"

Setting Up a Local Vector Database

We'll use Chroma, a lightweight vector database that can run locally:

from langchain.vectorstores import Chroma
import os

# Define the directory to store the vector database
persist_directory = "chroma_db"

# Create and persist the vector database
vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=persist_directory
)

# Persist the database to disk
vectordb.persist()

Building the Retrieval System

Now, let's create a retrieval system that can find relevant documents based on a query:

# Load the persisted database
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embeddings
)

# Create a retriever
retriever = vectordb.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}  # Return the top 5 most similar chunks
)

# Test the retriever
query = "What is the capital of France?"
retrieved_docs = retriever.get_relevant_documents(query)

for i, doc in enumerate(retrieved_docs):
    print(f"Document {i+1}:\n{doc.page_content}\n")

Integrating with a Language Model

Now, let's connect our retrieval system with a language model to create a complete RAG system:

from langchain.llms import LlamaCpp
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Initialize the language model
llm = LlamaCpp(
    model_path="path/to/your/llama-2-7b-chat.gguf",
    temperature=0.2,
    max_tokens=2000,
    top_p=0.95,
    n_ctx=4096,
    verbose=False
)

# Create a custom prompt template
template = """
You are a helpful AI assistant. Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say you don't know. Don't try to make up an answer.

Context:
{context}

Question: {question}

Answer:
"""

prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

# Create the RAG chain
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt}
)

# Test the RAG system
query = "What is the capital of France?"
response = rag_chain.run(query)
print(response)

Building a Simple RAG API

Let's create a simple API using FastAPI to serve our RAG system:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import uvicorn

app = FastAPI(title="Local RAG System API")

class Query(BaseModel):
    text: str

@app.post("/ask")
async def ask(query: Query):
    try:
        response = rag_chain.run(query.text)
        return {"response": response}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    return {"status": "healthy"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Advanced RAG Techniques

To improve your RAG system, consider implementing these advanced techniques:

1. Query Transformation

Rewrite user queries to make them more effective for retrieval:

from langchain.retrievers.multi_query import MultiQueryRetriever

# Create a multi-query retriever
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=retriever,
    llm=llm
)

# This will generate multiple queries from the original query
# and combine the results

2. Hybrid Search

Combine keyword-based (BM25) and semantic search for better results:

# Many vector databases support hybrid search
# For example, with Weaviate:

retriever = vectordb.as_retriever(
    search_type="hybrid",
    search_kwargs={
        "k": 5,
        "alpha": 0.5  # Balance between keyword and semantic search
    }
)

3. Re-ranking

Re-rank retrieved documents to improve relevance:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Create a compressor that uses the LLM to extract relevant information
compressor = LLMChainExtractor.from_llm(llm)

# Create a compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
)

Evaluating Your RAG System

It's important to evaluate your RAG system to ensure it's performing well:

Retrieval Evaluation

Measure how well your system retrieves relevant documents:

Precision@K: Percentage of retrieved documents that are relevant
Recall@K: Percentage of relevant documents that are retrieved
Mean Reciprocal Rank (MRR): Measures where the first relevant document appears

Generation Evaluation

Evaluate the quality of generated responses:

Human Evaluation: Have experts rate responses for accuracy and helpfulness
Automated Metrics: Use metrics like ROUGE or BLEU for comparison to reference answers
Hallucination Detection: Check if responses contain information not in the retrieved documents

Conclusion

Building a RAG system with a local vector database gives you a powerful tool for enhancing LLM responses with custom knowledge. This approach allows you to create AI applications that can access specific information not available in the model's training data, reducing hallucinations and improving accuracy.

As you continue to develop your RAG system, consider experimenting with different embedding models, chunking strategies, and retrieval techniques to optimize performance for your specific use case.

Building a RAG System with Local Vector DB

Local RAG System Guide

Tutorial Series

Introduction to RAG Systems

RAG Architecture Overview

Setting Up Your Environment

Document Processing Pipeline

Document Loading

Text Chunking

Embedding Generation

Setting Up a Local Vector Database

Building the Retrieval System

Integrating with a Language Model

Building a Simple RAG API

Advanced RAG Techniques

1. Query Transformation

2. Hybrid Search

3. Re-ranking

Evaluating Your RAG System

Retrieval Evaluation

Generation Evaluation

Conclusion

Building a RAG System with Local Vector DBBuildingaRAGSystemwithLocalVectorDB

Local RAG System Guide

Tutorial Series

Introduction to RAG Systems

RAG Architecture Overview

Setting Up Your Environment

Document Processing Pipeline

Document Loading

Text Chunking

Embedding Generation

Setting Up a Local Vector Database

Building the Retrieval System

Integrating with a Language Model

Building a Simple RAG API

Advanced RAG Techniques

1. Query Transformation

2. Hybrid Search

3. Re-ranking

Evaluating Your RAG System

Retrieval Evaluation

Generation Evaluation

Conclusion

Building a RAG System with Local Vector DB