A Step-by-Step Guide to Building and Distributing a Sleek RAG Pipeline

Jozu MLOps

5.00/5 (6 votes)

Jul 9, 2024

CPOL

6 min read

14495

In this article, we build a Retrieval-Augmented Generation (RAG) pipeline using KitOps, integrating tools like ChromaDB for embeddings, Llama 3 for language models, and SentenceTransformer for embedding models.

Introduction

This tutorial will walk you through the process of creating a sleek Retrieval-Augmented Generation (RAG) pipeline using KitOps. KitOps is an open source MLOps tool that is used to package the RAG application as a ModelKit for streamlined collaboration on evaluation, testing and operations.

In addition to KitOps, we will also be using Chroma DB for the embedding database, Llama 3 for our large language model (LLM), SentenceTransformer for the embedding model, and LangChain for chunking.

Background

Natural language processing (NLP) has become a hot topic lately. Finding ways to enhance information retrieval and generate contextually accurate responses is now top of mind for enterprise organizations.

One approach is the Retrieval-Augmented Generation or simply RAG pipeline. RAG combines the strengths of retrieval-based systems and generative models, empowering developers to build intelligent, scalable, and highly customizable applications.

From chatbots and virtual assistants to complex information extraction systems, learning to create a RAG pipeline has become a valuable skill set. We wrote this tutorial as an easy way to get started.

1. Prerequisites

Before we start, make sure you have the following:

Python 3.9+: You can find and install the most recent version on the official Python website.
KitOps CLI: You can find the most recent version of the KitOps CLI on GitHub, and follow the KitOps CLI setup guide.
A basic knowledge of Python: We will be using a bit of Python in this tutorial, so knowing your way around Python is required.

2. Install your tools

Our first step will be getting all of our tools in place, to do this we will do the following:

Install ChromaDB: ChromaDB is our embedding database.

pip install chromadb

Install LangChain: LangChain helps with chunking the text data.

pip install langchain

Install Llama.cpp: This will allow us to interact with the Llama model.

pip install llama-cpp-python

Install SentenceTransformers: This is our embedding model framework.

pip install sentence_transformers

3. Load you Llama model

Next, we need to get our Llama model ready, for this we can use KitOps which helps speed things up,

Pull the Llama 7B Model: Use KitOps to download the Llama model.

kit unpack ghcr.io/jozu-ai/llama3:8B-instruct-q4_0 -d ./llama3

Load the model: Once that’s done, we will load the model in Python.

 	from llama_cpp import Llama

llm = Llama(
    model_path="./llama3/llama3-8b-instruct-q4_0.gguf", 
    seed=1337 # set a specific seed
    # n_gpu_layers=-1, # Uncomment to use GPU acceleration
    # n_ctx=2048, # Uncomment to increase the context window
)

Here, we initialize the Llama model, optionally enabling GPU acceleration and adjusting the context window for larger inputs.

4. Chunking

The next step is to break down the data into more manageable chunks using LangChain.

Chunking is the process of dividing a large text into smaller, more manageable "chunks." This is important because it allows the model to process the text more efficiently. With smaller chunks, we can handle larger datasets without overwhelming the model. It also helps in maintaining the context and ensuring each part of the text is given attention during processing. Chunking makes it easier to manage memory usage, processing time.

Read the dataset:

 	from langchain.text_splitter import CharacterTextSplitter

try:
    with open("./dataset.txt", 'r') as file:
        content = file.read()
except FileNotFoundError:
    print("File not found.")

Split the text:

 	text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0,
    length_function=len,
)

split_texts = text_splitter.split_text(content)

This code reads the content of dataset.txt and splits it into manageable chunks of 1000 characters each.

5. Creating and storing embeddings

We will now use Chroma DB as a Vector DB to save our embeddings.e

Embeddings is a way to convert text data into numerical vectors that the model can process. These vectors capture the semantic meaning of the text, allowing the model to understand and work with the data more efficiently. By creating and storing embeddings, we ensure that the text data is in a format that models can easily retrieve and compare. A Vector DB like the ChromaDB helps store the embedding, making it easier to perform searches.

Pull the SentenceTransformer embedding model from ModelKit registry using KitOps:

kit unpack ghcr.io/jozu-ai/all-minilm-l6-v2:safetensor -d minilm

Initialize the embedding model:

from sentence_transformers import SentenceTransformer
import chromadb

embedding_model = SentenceTransformer("./minilm")

Encode the text chunks:

embeddings = embedding_model.encode(split_texts)

Store embeddings in ChromaDB:

 	chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="dataset-collection")
collection.add(
    documents=split_texts,
    embeddings=embeddings,
    ids=[f"id{sno}" for sno in range(1, len(split_texts) + 1)]
)

This process will encode the text chunks into embeddings and store them in ChromaDB.

6. Semantic search

We can now perform semantic search using ChromaDB:

Query embedding:

query = "workflow engines"
query_embedding = embedding_model.encode([query])

Retrieve results:

 	results = collection.query(
    query_embeddings=query_embedding,
    n_results=3,
)

print(results)

Here, we’ve encoded the search query and used ChromaDB to find the most relevant text chunks.

7. Response generation with LLM

The final step before we can run our RAG pipeline is to pass the search results to the Llama model along with the user query and prompt for generating an answer from the given context. This demonstrates how the model synthesizes information from the retrieved context (semantic search results) and produces a relevant answer.

Prepare the prompt:

contexts = results["documents"][0]
query = "Give me 3 open source workflow engines"
user_prompt = '''You are given a Query and a context, your task is to analyze the context and answer the given Query based on the context.
                        'Statement': {}
                        'Context': {}'''.format(query, contexts)

Generate the response:

def generate_response(user_prompt):

    output = llm.create_chat_completion(
        messages=[{
            "role": "user",
            "content": user_prompt
        }],
    max_tokens=200
    )

    return output['choices'][0]['message']['content']

output = generate_response(user_prompt)
print(output)

8. Running the RAG Pipeline

Now, we can finally bring it all together and run our RAG pipeline,

Define the RAG function:

 	def rag(query):
    query_embedding = embedding_model.encode([query])

    results = collection.query(
        query_embeddings=query_embedding,
        n_results=3,
    )

    contexts = results["documents"][0]
    user_prompt = '''You are given a Query and a context, your task is to analyze the context and answer the given Query based on the context.
                            'Statement': {query}
                            'Context': {contexts}'''.format(query, contexts)
    output = generate_response(user_prompt)

    return output['choices'][0]['message']['content']

Run the pipeline:

print(rag("Your Query here"))

This function ties everything together, enabling you to input a query and get a contextually relevant response.

If you’ve made it this far, congratulations! You now have a fully functioning RAG pipeline. This pipeline allows you to perform sophisticated semantic searches and generate responses using a powerful language model. But, we’re not done yet. Now that our pipeline is running, we want to package it for distribution.

To do this, we will use ModelKits which will help streamline collaboration on evaluation, testing, and operations.

ModelKits are an OCI-compliant packaging format that enables the seamless sharing of all necessary artifacts involved in the AI/ML model lifecycle. This includes datasets, code, configurations, and the models themselves. By standardizing the way these components are packaged, ModelKit facilitates a more streamlined and collaborative development process that is compatible with nearly any tool. You can learn more about ModelKits here.

1. Convert Notebook to Python Scripts

Before we get started, we need to add a requirements.txt file to list all dependencies required by your pipeline. I haven’t provided the code in here for brevity, but don't worry you can retrieve it from our existing the ModelKit by issuing the following command,

kit unpack ghcr.io/jozu-ai/modelkit-examples/rag_pipeline:latest --code -d ./myrag

2. Organize your artifacts to look like below

Next, we need to organize our RAG artifacts correctly. Yours should reflect the following,

 Root-
   | - rag_pipeline #Includes the python, configuration and requirements.txt files.
   | - dataset.txt
   | - KitFile

3. Create a Kitfile that refers to your artifacts and your base model

Once we are organized we need to create a Kitfile that lists all of our project artifacts. A Kitfile is a YAML-based manifest designed to streamline the encapsulation and sharing of project artifacts. From code and datasets to models and their metadata, the Kitfile serves as a comprehensive blueprint for your project, ensuring every component is meticulously organized and easily accessible. For more information refer to Kitfile reference.

Your Kitfile should look like the following,

manifestVersion: "1.0"
package:
  name: RAG with LLMA3
  version: 1.0.0
  authors: ["Jozu AI"]
model:
  name: llama3-8B-instruct-q4_0
  path: ghcr.io/jozu-ai/llama3:8B-instruct-q4_0 
  description: Llama 3 8B instruct model
  license: Apache 2.0
datasets:
  - path: ./dataset.txt
code: 
  - path: ./rag_pipeline
  - path: tutorial.py

4. Use Kit to package and push the RAG pipeline to your registry

Next we need to package and push our RAG pipeline to our registry of choice, we will do this from the root of the folder where we created the Kitfile. We suggest using Docker Hub or Artifactory.

Packing the pipeline

kit pack . -t my-oci-registry/my-rag-repo/rag-tutorial:latest

Push to registry to share

kit push my-oci-registry/my-rag-repo/rag-tutorial:latest

5. Retrieve the pipeline and profit

Once we’ve pushed our RAG pipeline, we can retrieve it and unpack the ModelKit,

kit unpack  ghcr.io/jozu-ai/modelkit-examples/rag_pipeline_bot:latest -d /path/to/unpacked

6. To quickly try the to RAG tutorial

Unpack the rag pipeline and base llm

kit unpack ghcr.io/jozu-ai/modelkit-examples/rag_pipeline:latest -d ./myrag

Unpack the embedding model

kit unpack ghcr.io/jozu-ai/all-minilm-l6-v2:safetensor -d minilm

Run the pipeline

python ./rag_pipeline/rag_user.py ./dataset.txt ./minilm llama3-8B-instruct-q4_0.gguf

Congrats! You have successfully created a fully functioning RAG pipeline and have packaged it for distribution. If you found this tutorial helpful and would like to learn more about KitOps or contribute to our project, please visit KitOps.ML or join our Discord group.