How Agentic RAG Reduces Hallucinations and Improves Retrieval

How Agents Can Fix Hallucinations in RAG

The information retrieval step in retrieval augmented generation (RAG) is highly dependent on how the user asks questions. If the query is not well-formulated, the retrieval of information can be difficult, even if the information the user is looking for is present in the knowledge base. In traditional RAG, we get a single, short retrieval, but we can fix this with agentic RAG.

To understand how agentic RAG can help, let's look at the traditional RAG setup for retrieval. The user query is run through a semantic-based similarity search that looks for the most relevant chunks in the information base. But what happens if the question itself is not asked in the proper way? In that case, your RAG pipeline will most probably hallucinate, meaning it will start making up answers, or the language model will tell the user that it couldn't find the information, even though the information is actually present in the knowledge base.

We can fix this by introducing agents in the RAG pipeline and giving them the ability to not only analyze the initial query but also analyze the responses generated by the RAG pipeline. Here's how it usually looks:

The initial query is passed through an agent.
The agent will reformulate the initial query.
The refined query is passed through the knowledge base, and a semantic-based similarity search is performed to retrieve the most relevant documents.
Before passing the relevant documents to the language model, the agent analyzes those documents or chunks again and refines the query if it thinks the retrieved documents cannot answer the question.
Based on the refined query, the process is repeated until the agent is satisfied with both the retrieved documents and the reformulated query.
The final context is then passed to the language model to generate the answer.

In this case, the agent can plan, analyze, and execute to implement the agentic loop in your RAG pipeline. You have several options, such as frameworks like Crew AI, AutoGPT, or LangGraph from LangChain, which you can use to build agents. In this video, we'll be using Transformers Agents, a lesser-known feature within the Transformers package that allows you to create your own agents.

Building an Agentic RAG Pipeline

The information retrieval step in retrieval augmented generation or RAG is highly dependent on how the user asks questions. If the query is not well-formulated, the retrieval of information can be difficult, even if the information the user is looking for is present in the knowledge base. In traditional RAG, we get a single, short retrieval, but we can fix this with agentic RAG.

To understand how agentic RAG can help, let's look at the traditional RAG setup for retrieval. The user query is run through a semantic-based similarity search that looks for the most relevant chunks in the information base. But what happens if the question itself is not asked in the proper way? In that case, your RAG pipeline will most probably hallucinate, that is, it will start making up answers, or the LLM will tell the user that it couldn't find the information, even though the information is actually present in the knowledge base.

We can fix this by introducing agents in the RAG pipeline and giving them the ability to not only analyze the initial query but also analyze the responses generated by the RAG pipeline. Here's how it usually looks:

The initial query is passed through an agent.
The agent will reformulate the initial query.
The refined query is passed through the knowledge base, and a semantic-based similarity search is performed to retrieve the most relevant documents.
Before passing the relevant documents to the LLM, the agent analyzes those documents or chunks again and refines the query if it thinks the retrieved documents cannot answer the question.
Based on the refined query, the process is repeated until the agent is satisfied with both the retrieved documents and the reformulated query.
The final context is then passed to the LLM to generate the answer.

To implement the agentic loop in your RAG pipeline, you have several options, such as frameworks like Crew AI, Auto, or LangGraph from LangChain. In this case, we'll be using Transformers Agents, a lesser-known feature within the Transformers package that allows you to create your own agents.

First, we need to install the required packages, including pandas, LangChain, the LangChain Community package, the Sentence Transformer package, and the Transformers package. Then, we'll import the necessary modules and set up the data, including splitting the documents into chunks and creating embeddings.

Next, we'll create a retrieval tool that the agent can use to retrieve relevant documents from the knowledge base. We'll also set up the LLM, in this case, using the OpenAI engine.

Finally, we'll create the agent itself, which will have access to the retrieval tool and the LLM, as well as a system prompt that guides the agent's behavior. We'll then run the agent through the agentic loop to generate answers to sample questions, comparing the results to a standard RAG pipeline.

The agentic RAG approach allows for more robust and comprehensive answers by enabling the agent to refine the query and retrieve the most relevant information, leading to better-quality responses from the LLM.

Creating a Retrieval Tool

To create a retrieval tool for the agentic RAG pipeline, we define a RetrievalTool class with the following structure:

class RetrievalTool:
    """Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input."""

    def __init__(self, vector_db):
        self.vector_db = vector_db

    def __call__(self, query: str) -> List[str]:
        """
        Retrieve up to 7 most similar chunks from the vector DB for the given query.
        
        Args:
            query (str): A query to perform retrieval on. This should be semantically close to the target documents.
        
        Returns:
            List[str]: A list of up to 7 most similar chunks from the vector DB.
        """
        results = self.vector_db.similarity_search(query, k=7)
        return [chunk.page_content for chunk in results]

The RetrievalTool class takes a vector_db object as input, which is the vector store (e.g., Faiss, Chroma, Pinecone) used to store the document embeddings.

The __call__ method of the class takes a query string as input and returns a list of up to 7 most similar chunks from the vector DB. It uses the similarity_search method of the vector_db object to find the most similar chunks based on cosine similarity.

This retrieval tool can then be used as part of the agentic RAG pipeline, where the agent can analyze the initial query, refine it, and pass it through the retrieval tool to get the most relevant chunks from the knowledge base.

Integrating the Language Model

To integrate the language model into the agentic RAG pipeline, we need to set up the LLM that will be used both by the agent and to generate the final response. We have two options:

Using the Hugging Face Engine: This allows us to directly call the API endpoints of different LLMs available through the Hugging Face serverless architecture. We can use models like Llama 38B or 70B, but these usually require a Hugging Face Pro subscription.
Using OpenAI: For this example, we will be using the OpenAI engine. The process can be adapted to set up any other LLM.

To set up the OpenAI engine, we create a class called OpenAIEngine that uses the message_rle and gpt_generate_message_list functions from the Transformer Agents LLM engine. This class handles cleaning the input messages and using the OpenAI chat completion endpoint to generate responses.

Next, we create the actual agent. The agent has access to the retrieval tool we created earlier, the LLM we want to use (the OpenAIEngine in this case), and the maximum number of iterations we want the agent to perform before stopping the agentic loop.

We also provide a system prompt to the agent, which gives it instructions on how to use the information in the knowledge base to provide a comprehensive answer to the user's question. The prompt encourages the agent to retry the retrieval process with different queries if it cannot find the necessary information.

With the agent and LLM set up, we can now run the agentic loop to answer user questions. The agent will iteratively refine the query, retrieve relevant information from the knowledge base, and generate a final response using the LLM. This approach leads to more detailed and relevant answers compared to a standard RAG pipeline.

Implementing the Agentic Agent

To implement the agentic agent, we will use the Transformers Agents feature within the Transformers package. This provides a modular and clear approach to creating custom agents.

First, we need to install the required packages, including pandas, Langchain, the Langchain Community package, Sentence Transformers, and the Transformers Agents.

Next, we import the necessary modules and packages. We will be using the ReactJsonAgent, building custom tools for the agent, and leveraging the Hugging Face engine for the language model.

To build the RAG pipeline, we start with a dataset containing the Hugging Face documentation. We split the documents into chunks and create embeddings using the GTE-small model. We then remove any duplicate chunks and store the unique chunks in an F-based vector store.

Now, we introduce the agent into the mix. We create a RetrievalTool that uses semantic similarity to retrieve the most relevant chunks from the knowledge base based on the user's query.

We also set up the language model, in this case, using the OpenAI engine with the GPT-4 model.

The agent is then created, with access to the retrieval tool and the language model. We also specify the maximum number of iterations the agent can perform to refine the query and the retrieved context.

The agent is provided with a system prompt that guides it to use the knowledge base to provide a comprehensive answer to the user's question. The agent then goes through an iterative process, reformulating the query and retrieving more relevant information until it is satisfied with the response.

The agentic RAG pipeline is then compared to a standard RAG pipeline, demonstrating how the agent-based approach can provide more detailed and relevant answers, especially when the user's initial query is not well-formulated.

Comparing Standard RAG and Agentic RAG

The key differences between standard RAG and agentic RAG are:

Query Reformulation: In standard RAG, the user query is directly passed through the semantic-based similarity search to retrieve relevant chunks from the knowledge base. In agentic RAG, an agent analyzes the initial query and can reformulate it to improve the retrieval process.
Iterative Refinement: Agentic RAG allows the agent to iteratively refine the query and the retrieved context. If the agent is not satisfied with the initial retrieval, it can repeat the process with a refined query to get better results.
Concise and Relevant Responses: The agentic approach tends to generate more concise and relevant responses compared to standard RAG. The agent's ability to analyze the query and the retrieved context helps it provide a more comprehensive answer.
Handling Poorly Formulated Queries: Agentic RAG is better equipped to handle cases where the user query is not well-formulated. The agent can recognize the limitations of the initial query and work to reformulate it, leading to better retrieval and more informative responses.
Flexibility and Customization: Agentic RAG allows for more flexibility and customization, as the agent can be equipped with various tools and capabilities to suit the specific needs of the application.

In summary, agentic RAG introduces an additional layer of intelligence and control, enabling the system to better understand the user's intent, refine the retrieval process, and generate more targeted and informative responses, even when the initial query is not optimal.

Conclusion

The introduction of agents into the Retrieval Augmented Generation (RAG) pipeline can significantly improve the quality and relevance of the generated responses. By allowing the agent to analyze the initial query, refine it, and iteratively retrieve and evaluate the most relevant information from the knowledge base, the agentic RAG approach can overcome the limitations of traditional RAG setups, where the quality of the output is highly dependent on the user's ability to formulate the query effectively.

The key benefits of the agentic RAG approach include:

Improved query reformulation: The agent's ability to analyze the initial query and reformulate it based on the retrieved information ensures that the final query is more semantically aligned with the user's intent, leading to more relevant results.
Iterative retrieval and evaluation: The agent's ability to repeatedly retrieve and evaluate the retrieved information allows it to refine the query and ensure that the final context provided to the language model is comprehensive and addresses the user's question.
Increased robustness: By not relying solely on the initial user query, the agentic RAG approach is more robust to poorly formulated questions, as the agent can work to overcome these limitations through its iterative process.
Detailed and informative responses: The agentic RAG approach, as demonstrated in the examples, can generate more detailed and informative responses compared to traditional RAG setups, providing users with a more comprehensive understanding of the topic.

Overall, the integration of agents into the RAG pipeline represents a significant advancement in the field of retrieval-augmented generation, and the techniques and tools presented in this section can serve as a foundation for building more powerful and user-friendly conversational AI systems.

FAQ

What is the difference between traditional RAG and agentic RAG?

How does agentic RAG work?

What are the different tools and components used in the agentic RAG implementation shown in the video?

How does the agentic RAG implementation differ from a standard RAG pipeline?