Supercharge Your RAG Pipeline with DeepSeek's Reasoning Model

Discover how a reasoning model like DeepSEEK R1 can supercharge your RAG pipeline. Learn how to build a simple RAG pipeline without external frameworks, leveraging a fast inference API from Samba NOA Cloud. Explore techniques for knowledge base creation, retrieval, and generation powered by a powerful reasoning model.

15 aprile 2025

party-gif

Unlock the power of your RAG pipeline with DeepSeek's Reasoning Model. This blog post explores how to seamlessly integrate a reasoning model into your knowledge retrieval and generation workflow, delivering more relevant and coherent responses to user queries.

Supercharge Your RAG Pipeline with DeepSeek's Reasoning Model

In this section, we will explore how to leverage a reasoning model like DeepSeek R1 to supercharge your Retrieval-Augmented Generation (RAG) pipeline. We will build a simple RAG pipeline without any external frameworks, but powered by a reasoning LLM.

Our RAG pipeline will have two main components: knowledge base creation and generation. During the knowledge base creation step, we will:

  1. Convert PDF files into smaller chunks.
  2. Compute embeddings for the chunks using an embedding model.
  3. Store the embeddings and chunks in a vector store.

In the generation step, we will:

  1. Retrieve the relevant chunks based on the user's query.
  2. Pass the relevant chunks and the original query to a reasoning model to generate the final response.

We will use the open-source Faiss library as our vector store and the DeepSeek R1 model hosted by Samba NOA Cloud for the reasoning step. Samba NOA Cloud provides fast inference for the DeepSeek R1 model, allowing us to generate responses quickly.

The key advantage of using a reasoning model like DeepSeek R1 is its ability to perform chain-of-thought reasoning to determine the most relevant chunks for the user's query, eliminating the need for a separate reranking step.

We will demonstrate how the reasoning model can analyze the provided context, identify the most relevant documents, and generate a concise and accurate response to the user's query. This approach can be further extended to power more advanced agent-based systems that require reasoning and evaluation capabilities.

Building a Simple RAG Pipeline

In this section, we will build a simple Retrieval-Augmented Generation (RAG) pipeline without using any external frameworks like LangChain or Llama Index. Instead, we will leverage a reasoning language model (LLM) to power our pipeline.

The pipeline will consist of two main components:

  1. Knowledge Base Creation: We will take a set of PDF files, extract the text, chunk the content, and store the chunks along with their embeddings in a vector store.

  2. Generation: When a user provides a query, we will retrieve the most relevant chunks from the vector store, pass them along with the original query to a reasoning LLM, and generate the final response.

To demonstrate this, we will use three research papers on diverse topics, including COVID-19 detection, bite weight estimation, and chewing bout detection. We will leverage an open-source embedding model and the Faiss vector store to create the knowledge base. For the generation step, we will use the powerful DeepSeeR1 reasoning model provided by Samba NOA Cloud.

The key advantages of using a reasoning LLM in this pipeline are:

  1. Improved Relevance: The reasoning model can better determine the relevance of the retrieved chunks to the user's query, effectively acting as a re-ranking step.
  2. Coherent Responses: The reasoning model can generate more coherent and contextual responses by considering the relationships between the retrieved chunks.
  3. Faster Inference: Samba NOA Cloud provides fast inference for the DeepSeeR1 model, allowing for efficient real-time responses.

By the end of this section, you will have a solid understanding of how to build a simple yet powerful RAG pipeline using a reasoning LLM, and how to leverage the capabilities of Samba NOA Cloud to accelerate your development.

Knowledge-Based Creation Step

The knowledge-based creation step involves the following key steps:

  1. Load Documents: The code first loads the documents (PDF files) from a Data folder. It extracts the text from each page, including the file name as metadata.

  2. Chunk Documents: The text from each document is then chunked into smaller segments of 500 characters, with a 50-character overlap between chunks. This is a simple chunking strategy, but more advanced techniques like semantic chunking can also be used.

  3. Compute Embeddings: The chunked text segments are then passed through an embedding model (in this case, an open-source model from Hugging Face) to compute embeddings for each chunk.

  4. Store in Vector Store: The computed embeddings and the original text chunks are then stored in a vector store, in this case, using the Faiss library. This allows for efficient retrieval of relevant chunks based on the query embeddings.

  5. Save to JSON: The original text chunks are also saved to a JSON file, which will be used in the generation step.

This knowledge-based creation step sets up the necessary data structures and indexes to enable efficient retrieval and reasoning in the subsequent generation step.

Generation Step

Here is the body of the "Generation Step" section in Markdown format:

During the generation part, we'll get a query from the user, we'll embed that query using the same embedding model, we'll retrieve relevant chunks, and we'll pass those relevant chunks along with the original query to a reasoning model to generate the final response.

Now, during the retrieval process, we need the embeddings that we stored. We will also load the chunks that were stored in the JSON file, and since we need to compute embeddings for our query, we also need to load the original embedding model.

We are going to get the input query from the user and then use this query-based system to generate the final response. Internally, we take the user query, compute embeddings for that query, and then use those embeddings to do a vector search. In this case, I want to retrieve the top 20 chunks that the embedding model thinks are most similar to the user query.

In general, you want to retrieve a very large number of chunks, say close to 50 or 100, and then use a ranking model in the middle, which will again re-rank the returned chunks based on the similarity with the user query. However, since we are using a reasoning model, it has the ability to use the Chain of Thought reasoning to determine which chunks are more relevant to the user query.

We get the indexes and then use those indexes to get the corresponding text chunks. We put them together into a single text blob and then print the text blob along with the user query. This is going to be used as the context.

In order for the LLM to generate a response, we also need to give it some instructions. The augmented prompt is "Please provide an answer to the following question based on the provided context. If the context does not contain the answer, please respond with 'I'm sorry, but the provided context does not have the information to answer your question.'" We then append the context along with the original user query.

Using Samba NOA Cloud for Fast Inference

Samba NOA Cloud is a powerful platform that provides fast inference capabilities for the Deep Seek R1 model. Here are the key points:

  • Samba NOA is the creator of RDUs (Reconfigurable Data Flow Units), which are an alternative to GPUs for accelerating deep learning inference.
  • Samba NOA Cloud hosts the full Deep Seek R1 model, not just the distilled version, and offers incredibly fast inference speeds of up to 198 tokens per second.
  • You can access the Deep Seek R1 model on the Samba NOA Cloud platform, both through their playground and their API.
  • The Samba NOA Cloud API is easy to integrate with the OpenAI Python client, making it simple to switch between different models.
  • Samba NOA Cloud offers a free tier for developers to get started, and you can join the waitlist to access the API.
  • The fast inference speeds provided by Samba NOA Cloud can be crucial when using a reasoning model like Deep Seek R1 in your application, as these models tend to generate more tokens compared to non-reasoning models.

Overall, Samba NOA Cloud is a great option for developers looking to leverage the power of the Deep Seek R1 model with fast and reliable inference capabilities.

Analyzing the Impact of Bite Size on Intake

The provided context discusses the impact of bite rate (number of bites per minute) on energy intake, rather than the direct impact of bite size. However, the reasoning model is able to infer the relationship between bite size and intake based on the information presented.

The key points from the context are:

  • Reducing the bite rate (i.e., taking fewer bites per minute) may lead to reduced energy intake.
  • Increasing or decreasing the rate of intake (which is related to bite rate) can affect energy intake.
  • Studies have shown that increasing the number of chews per bite, which effectively reduces bite size, leads to reduced energy intake in both obese and normal individuals.

Based on this, the reasoning model concludes that a smaller bite size, achieved by increasing the number of chews per bite, can lead to reduced energy intake. The model is able to make this inference by connecting the information about bite rate and chews per bite, even though the context does not directly discuss the impact of bite size.

Using the Reasoning Model for Ranking

The reasoning model can be used as a powerful reranking component in the retrieval pipeline. Here's how it works:

  1. The original augmented prompt is used, but an additional step is added before the final answer generation:
Analyze each document in the context and identify if it contains the answer to the question. Assign a relevance score (0-10) to each document based on how relevant it is to answering the question. List the most relevant documents first, and then answer the question based on those documents only.
  1. The reasoning model goes through each document in the context, evaluates its relevance to the question, and assigns a relevance score.
  2. The model then lists the most relevant documents first, along with their relevance scores.
  3. Finally, the model answers the question based on the most relevant documents in the context.

This approach allows the reasoning model to act as a dedicated reranking component, identifying the most relevant documents and using them to generate the final answer. This is a crucial step in building a robust retrieval pipeline, as it ensures that the most relevant information is used to answer the user's query.

By leveraging the reasoning capabilities of the deep seek R1 model, this approach can outperform traditional reranking methods, as the model can understand the semantic relevance of the documents to the question, rather than relying solely on surface-level features.

Conclusion

Here is the body of the "Conclusion" section in Markdown format:

In this tutorial, we have explored how to build a simple yet powerful Retrieval-Augmented Generation (RAG) pipeline using a reasoning language model like DeepSeeR1. By leveraging the capabilities of DeepSeeR1, we were able to create a knowledge base, retrieve relevant information, and generate responses to user queries in a concise and effective manner.

The key highlights of this approach include:

  1. Naive Chunking Strategy: Even with a simple chunking strategy, the reasoning model was able to identify the most relevant information to answer the user's query.
  2. Reasoning-based Ranking: The reasoning model can be used as a powerful re-ranking component, analyzing the relevance of each document and prioritizing the most useful information.
  3. Efficient Inference with SambaAI: By utilizing the fast inference capabilities of SambaAI's DeepSeeR1 model, we were able to achieve impressive response times, making this approach suitable for real-world applications.

This demonstration showcases the potential of reasoning models to enhance traditional RAG pipelines. By seamlessly integrating DeepSeeR1 into the workflow, we were able to create a more robust and intelligent system that can handle complex queries and provide accurate, contextual responses.

As you continue to explore and build upon this foundation, consider exploring more advanced techniques, such as semantic chunking and custom ranking models, to further optimize your RAG pipeline. Additionally, the integration with SambaAI's API opens up opportunities to leverage the latest advancements in reasoning models and accelerate your development process.

FAQ