Unleash the Power of OpenAI's Responses API: Building a Robust RAG System

Unleash the Power of OpenAI's Responses API: Building a Robust RAG System. Discover how to leverage the Responses API to create a custom Retrieval-Augmented Generation (RAG) system, optimize for SEO, readability and engagement.

21 maart 2025

party-gif

Unlock the power of OpenAI's Responses API and build a robust Retrieval-Augmented Generation (RAG) system with ease. This blog post guides you through the process, from creating a vector store to implementing a comprehensive evaluation strategy, empowering you to deliver exceptional user experiences.

Build a RAG System with OpenAI's Responses API

To build a RAG (Retrieval-Augmented Generation) system using OpenAI's Responses API, follow these steps:

  1. Install Required Packages: Install the latest version of the OpenAI Python SDK to work with the Responses API.

  2. Create a Vector Store: Use the client.vector_stores.create() function to create a vector store on OpenAI's servers. This will store your documents and their embeddings.

  3. Upload Documents to the Vector Store: Use a helper function to upload your documents (e.g., blog posts) to the vector store. The API supports various file types, including PDFs.

  4. Retrieve Relevant Documents: Use the vector_store.search() endpoint to retrieve the most relevant documents based on a user's query. This will return a list of documents with relevance scores.

  5. Generate Responses: Use the client.responses.create() endpoint to generate responses based on the retrieved documents. Provide the user's query and the model you want to use (e.g., GPT-4 Mini). The API will handle the end-to-end retrieval and generation process.

  6. Evaluate the System: Create an evaluation dataset by generating questions based on the documents using an LLM. Then, test the retrieval and generation accuracy of your RAG system using metrics like recall, precision, and faithfulness.

By using the Responses API, you can quickly set up a RAG system without having to worry about the underlying infrastructure for document storage, chunking, and embedding. The API abstracts away these details, allowing you to focus on building your application. However, be mindful of the associated costs, as the vector store and tool usage have associated charges.

Install Necessary Packages and Set Up the Environment

#

First, we need to install the necessary packages to work with the OpenAI Responses API. We'll be using the latest version of the OpenAI Python SDK for this:

```python
!pip install openai

Next, we'll import the required packages:

import os
import openai
from google.colab import auth
from google.cloud import storage

Since we're storing the OpenAI API key as a secret in Google Colab, we need to authenticate with Google Cloud:

auth.authenticate_user()

Now, we can retrieve the OpenAI API key and set it in the SDK:

openai.api_key = os.getenv("OPENAI_API_KEY")

With the setup complete, we're ready to start working with the Responses API.

Create a Vector Store and Upload Documents

First, we need to install the necessary packages, including the latest version of the OpenAI Python SDK, which is required for the Responses API.

import os
from openai.api_resources import VectorStore
from openai.api_resources.vector_store import VectorStoreFile
from tqdm.auto import tqdm

We'll store the OpenAI API key as a secret in Google Colab.

import os
import openai
openai.api_key = os.getenv("OPENAI_API_KEY")

Next, we'll read the PDF files from the i_blog_pdfs folder and upload them to a new vector store.

def upload_file_to_vector_store(file_path, vector_store_id):
    with open(file_path, "rb") as f:
        response = openai.VectorStore.create(
            id=vector_store_id,
            file=f,
            filename=os.path.basename(file_path)
        )
    return response

def upload_pdf_files_to_vector_store(pdf_dir, vector_store_id):
    file_paths = [os.path.join(pdf_dir, f) for f in os.listdir(pdf_dir) if f.endswith(".pdf")]
    for file_path in tqdm(file_paths):
        upload_file_to_vector_store(file_path, vector_store_id)

We'll create a new vector store and upload the PDF files to it.

vector_store_name = "openi_blog_store"
vector_store = openai.VectorStore.create(name=vector_store_name)
print(f"Vector store created: {vector_store}")

pdf_dir = "i_blog_pdfs"
upload_pdf_files_to_vector_store(pdf_dir, vector_store.id)

The vector store is now ready to be used for document retrieval and response generation.

Retrieve Relevant Documents Using the Vector Store

To retrieve relevant documents using the vector store, we can follow these steps:

  1. Create a vector store on the OpenAI servers:
client.vector_stores.create(name="openi_blog_store")

This will return the ID, name, and number of files in the vector store.

  1. Upload the PDF files to the vector store:
upload_pdf_files_to_vector_store("openi_blog_store", "path/to/pdf/files")

This function will read the contents of the folder and upload the PDF files to the vector store.

  1. Retrieve relevant documents for a given query:
results = client.vector_stores.search(
    vector_store_id="openi_blog_store",
    query="what is deep research"
)

This will return a list of documents that are most relevant to the given query, along with their relevance scores.

  1. Inspect the first retrieved document:
print(results.documents[0].page_content)

This will print the content of the first retrieved document.

  1. Generate a response using the retrieved documents:
response = client.completions.create(
    model="gpt-4-mini",
    prompt=f"Question: {query}\nContext: {' '.join([doc.page_content for doc in results.documents])}",
    max_tokens=256,
    temperature=0.7
)
print(response.choices[0].text)

This will generate a response using the GPT-4-mini model, based on the given query and the context provided by the retrieved documents.

The key aspects of this workflow are:

  • Creating a vector store to store the documents
  • Uploading the documents to the vector store
  • Retrieving relevant documents for a given query
  • Generating a response using the retrieved documents

This provides a simple and efficient way to build a retrieval-augmented language model system using the OpenAI Responses API.

Generating Responses with the Responses API

To generate responses using the Responses API, we first need to create a vector store on the OpenAI servers and upload our documents to it. Here's how we can do that:

  1. Install the necessary packages, including the latest version of the OpenAI Python SDK.
  2. Create a vector store using the client.vector_stores.create() function, providing a name for the store.
  3. Use a helper function to upload PDF files to the vector store, reading the contents of a folder and uploading the files concurrently.
  4. Once the files are uploaded, OpenAI will handle the chunking, embedding, and storage of the documents in the vector store.

Now, we can use the vector store to retrieve relevant documents for a given query. We can do this by calling the search endpoint on the vector store, providing the query and receiving a list of relevant documents with their relevance scores.

If we want to generate a response using an language model (LLM) like GPT-4, we can use the responses endpoint. This endpoint takes the user query, the model to use (e.g., GPT-4), and the tools to apply (in this case, the file_search tool with the vector store we created). The API will then retrieve the relevant documents, pass them to the LLM, and return the generated response.

The Responses API makes it easy to create a full retrieval-augmented generation pipeline without having to worry about the underlying infrastructure. You can also combine multiple tools, such as the file_search tool and a web_search tool, to enhance the response generation.

Finally, it's important to have a robust evaluation strategy in place to measure the performance of your Responses API-based system. This can involve creating a dataset of questions and answers, and then testing the retrieval accuracy and response quality using metrics like recall, precision, and faithfulness.

Evaluating the RAG System's Performance

To evaluate the performance of the RAG (Retrieval-Augmented Generation) system, we will follow a two-step approach:

  1. Retrieval Accuracy: Measure how well the system is able to retrieve the relevant documents for a given query.
  2. Generation Quality: Assess the quality of the responses generated by the language model using the retrieved documents.

Retrieval Accuracy

For the retrieval accuracy, we will create an evaluation dataset consisting of a set of questions and their corresponding source documents. We will use an LLM (Large Language Model) to generate these questions, ensuring that each question can only be answered using the information in the provided document.

Here's an example of the evaluation dataset:

Source DocumentQuestion
The court rejects Elon's latest attempt to slow down OpenAIWhat was the outcome of Elon musk's request for a preliminary injunction against OpenAI as mentioned in the document?
Introducing Deep ResearchWhat percentage accuracy did the model powering Deep Research achieve on Humanity's last exam?

To evaluate the retrieval accuracy, we will loop through the evaluation questions and use the RAG system to retrieve the relevant documents. We will then compare the retrieved documents with the original source documents and calculate the recall and precision at different thresholds (e.g., top 5, top 10 documents).

This will help us understand how well the RAG system is able to retrieve the necessary information to answer the given questions.

Generation Quality

To assess the generation quality, we will use metrics like response relevancy and faithfulness. These metrics will evaluate how well the language model is able to generate responses that are relevant to the user's query and faithful to the information provided in the retrieved documents.

For this, you can use evaluation frameworks like RAGAS (Retrieval-Augmented Generation Accuracy Score), which provides a comprehensive set of metrics to measure the quality of the generated responses.

By combining the retrieval accuracy and generation quality evaluations, you can get a holistic understanding of the RAG system's performance and identify areas for improvement.

Remember, the evaluation process is crucial for ensuring the robustness and reliability of your RAG-based applications. Continuously monitoring and refining the evaluation strategy will help you deliver high-quality responses to your users.

Conclusion

Here is the body of the "Conclusion" section in Markdown format:

In this video, we explored the new Responses API from OpenAI and how it can be used to build a robust retrieval-augmented generation (RAG) system. We covered the following key points:

  1. Responses API Overview: The Responses API provides a turnkey solution for creating a RAG system, handling the chunking, embedding, and retrieval of documents for you. It offers tools like file search, which allows you to upload documents and use them for question-answering.

  2. Implementing a RAG Pipeline: We walked through the steps of creating a vector store, uploading documents, and using the Responses API to retrieve relevant documents and generate responses to user queries. This demonstrated the ease of use and the abstraction provided by the Responses API.

  3. Evaluation and Testing: We discussed the importance of having a robust evaluation strategy for your RAG system, including measuring both the retrieval accuracy and the quality of the generated responses. We explored using an LLM-generated dataset and calculating metrics like recall and precision.

The Responses API simplifies the process of building a RAG system, allowing developers to focus on the application-specific aspects rather than the underlying infrastructure. While the pricing model may be more expensive than a custom implementation, the convenience and ease of use make it an attractive option, especially for rapid prototyping and development.

In the next video, we will explore the Agent SDK from OpenAI, which allows you to create multi-agent systems that can leverage various tools, including the Responses API, to provide more advanced conversational experiences.

FAQ