How to Build a RAG System in 10 Lines of Python without Frameworks

Unlock the power of RAG (Retrieval Augmented Generation) with this 10-line Python code tutorial. Dive into information retrieval, build your own document system, and leverage LLMs for robust responses - all without relying on complex frameworks. Optimize your content for SEO and engagement.

July 10, 2025

Discover how to build a powerful retrieval-augmented generation (RAG) system from scratch in just 10 lines of Python code, without relying on any external frameworks. This concise yet comprehensive approach provides you with a deep understanding of the core components, empowering you to create robust and customizable AI-powered solutions for your document-based applications.

Understand the Concept of Retrieval Augmented Generation (RAG)
Learn to Chunk Documents into Paragraphs
Embed Documents and User Queries Using Sentence Transformers
Retrieve Relevant Chunks Based on Cosine Similarity
Augment User Query with Retrieved Chunks and Generate Response Using OpenAI GPT-4
Conclusion

Understand the Concept of Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a technique used for information retrieval from your own documents. It involves a two-step process:

Knowledge Base Creation: The documents are chunked into smaller sub-documents, and embeddings are computed for each chunk. These embeddings and the original chunks are stored, typically in a vector store.
Retrieval and Generation: When a new user query comes in, an embedding is computed for the query. The most relevant chunks are retrieved by comparing the query embedding with the stored chunk embeddings. The retrieved chunks are then appended to the original query and fed into a language model (LLM) to generate the final response.

This setup is called a "retrieval-augmented generation" pipeline, as the LLM's generation is augmented with the relevant information retrieved from the knowledge base.

While there are frameworks like Langchain and LlamaIndex that simplify the implementation of RAG pipelines, the core components can be built using just Python, an embedding model, and an LLM. This approach provides a better understanding of the underlying concepts and allows for more flexibility in customizing the pipeline.

Learn to Chunk Documents into Paragraphs

When building a retrieval-augmented generation (RAG) system, one of the key steps is to chunk the input documents into smaller, more manageable pieces. In this example, we are chunking the input document (a Wikipedia article) into paragraphs.

The rationale behind this approach is that paragraphs often represent logical units of information that can be effectively retrieved and used to augment the user's query. By breaking down the document in this way, we can better identify the most relevant sections to include in the final response.

The code for this step is as follows:

# Chunk the document into paragraphs
chunks = [paragraph.strip() for paragraph in text.split('\n') if paragraph.strip()]

Here, we use a simple approach of splitting the input text on newline characters to extract the individual paragraphs. We then strip any leading or trailing whitespace from each paragraph to ensure a clean representation.

The resulting chunks list contains the individual paragraphs, which can then be embedded and stored in the vector store. This allows us to efficiently retrieve the most relevant paragraphs based on the user's query, and include them in the final response generated by the language model.

Embed Documents and User Queries Using Sentence Transformers

To embed the document chunks and the user query, we will use the Sentence Transformers library. This library provides pre-trained models that can generate high-quality embeddings for text.

First, we load the embedding model:

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('all-mpnet-base-v2')

Next, we compute the embeddings for the document chunks:

chunk_embeddings = embedding_model.encode(chunks, normalize_embeddings=True)

Here, chunks is a list of the text chunks from the document. The normalize_embeddings=True option ensures that the embeddings are normalized, which is important for the similarity computation later.

To get the embedding for the user query, we simply run:

query_embedding = embedding_model.encode([user_query], normalize_embeddings=True)[0]

Now, we have the embeddings for the document chunks and the user query, which we can use for the retrieval step.

Retrieve Relevant Chunks Based on Cosine Similarity

To retrieve the most relevant chunks based on the user query, we first need to compute the cosine similarity between the query embedding and the embeddings of the document chunks. Here's how we can do it:

Compute the embedding of the user query using the same embedding model as the document chunks.
Calculate the dot product between the query embedding and each of the document chunk embeddings. This gives us the cosine similarity scores.
Sort the document chunks based on the cosine similarity scores and select the top K most relevant chunks.
Retrieve the text content of the selected chunks to use as the context for the language model.

The key steps are:

Compute query embedding
Calculate cosine similarity scores
Sort and select top K chunks
Retrieve chunk text content

This simple approach allows us to efficiently retrieve the most relevant information from the document collection to augment the user's query before passing it to the language model for generation.

Augment User Query with Retrieved Chunks and Generate Response Using OpenAI GPT-4

To generate a response using the retrieved chunks and the user query, we first create a prompt that includes the relevant chunks and the user query. We then use the OpenAI GPT-4 model to generate a response based on this prompt.

Here's the code:

# Get the user query
user_query = "What is the capital of France?"

# Retrieve the most relevant chunks based on the user query
top_chunk_ids = [6, 8, 5]
top_chunks = [chunks[i] for i in top_chunk_ids]

# Create the prompt
prompt = "Use the following context to answer the question at the end. If you don't know the answer, say that you don't know and do not try to make up an answer.\n\nContext:\n"
for chunk in top_chunks:
    prompt += chunk + "\n\n"
prompt += f"\nQuestion: {user_query}"

# Generate the response using OpenAI GPT-4
response = openai_client.chat.create(
    model="gpt-4",
    messages=[
        {"role": "user", "content": prompt}
    ],
    max_tokens=1024,
    n=1,
    stop=None,
    temperature=0.7,
).choices[0].message.content

print(response)

In this code, we first retrieve the most relevant chunks based on the user query by finding the top 3 chunks with the highest similarity scores. We then create a prompt that includes these chunks and the user query. Finally, we use the OpenAI GPT-4 model to generate a response based on this prompt.

The generated response will be a concise and relevant answer to the user's query, leveraging the information from the retrieved chunks.

Conclusion

In this tutorial, we have learned how to build a complete chat system with a document retrieval system using just 10 lines of Python code. We covered the key components of a Retrieval Augmented Generation (RAG) pipeline, including knowledge base creation, document chunking, embedding, retrieval, and generation using a large language model.

The key takeaways are:

You can implement a basic RAG pipeline without the need for complex frameworks like Langchain or LlamaIndex. Pure Python and a few libraries are sufficient.
Chunking your documents based on the structure (e.g., paragraphs) is a simple yet effective strategy for most use cases.
Embedding the document chunks and the user query, then computing similarity scores, allows you to retrieve the most relevant information to augment the user's query.
Integrating the retrieved chunks with the user's query and feeding it to a large language model enables the generation of a relevant and informative response.

While this example provides a solid foundation, there are many opportunities to build more robust and advanced RAG systems. Frameworks like Langchain and LlamaIndex can be helpful when integrating with various vector stores and language models. However, starting with a pure Python implementation can help you better understand the core concepts and components of a RAG pipeline.

If you're interested in exploring more advanced RAG techniques, I recommend checking out my course "RAG Beyond Basics," which delves deeper into building complex, production-ready RAG systems.

FAQ

What is RAG (Retrieval Augmented Generation)?

Do you need any frameworks or vector stores to build a RAG pipeline?

What are the steps to build a RAG pipeline from scratch in Python?

When should you use a framework like Langchain or Llama Index for building RAG pipelines?

What is the author's recommendation for building a RAG pipeline?

Create Your AI Girlfriend

Create and chat with your dream AI Girlfriend