Click here to Skip to main content
15,947,785 members
Articles / Artificial Intelligence / Machine Learning

A Step-by-Step Guide to Building and Distributing a Sleek RAG Pipeline

Rate me:
Please Sign up or sign in to vote.
5.00/5 (5 votes)
9 Jul 2024CPOL6 min read 3.6K   5   1
In this article, we build a Retrieval-Augmented Generation (RAG) pipeline using KitOps, integrating tools like ChromaDB for embeddings, Llama 3 for language models, and SentenceTransformer for embedding models.

This article is a sponsored article. Articles such as these are intended to provide you with information on products and services that we consider useful and of value to developers


This tutorial will walk you through the process of creating a sleek Retrieval-Augmented Generation (RAG) pipeline using KitOps. KitOps is an open source MLOps tool that is used to package the RAG application as a ModelKit for streamlined collaboration on evaluation, testing and operations.

In addition to KitOps, we will also be using Chroma DB for the embedding database, Llama 3 for our large language model (LLM), SentenceTransformer for the embedding model, and LangChain for chunking.


Natural language processing (NLP) has become a hot topic lately. Finding ways to enhance information retrieval and generate contextually accurate responses is now top of mind for enterprise organizations.

One approach is the Retrieval-Augmented Generation or simply RAG pipeline. RAG combines the strengths of retrieval-based systems and generative models, empowering developers to build intelligent, scalable, and highly customizable applications.

From chatbots and virtual assistants to complex information extraction systems, learning to create a RAG pipeline has become a valuable skill set. We wrote this tutorial as an easy way to get started.

1. Prerequisites

Before we start, make sure you have the following:

  • Python 3.9+: You can find and install the most recent version on the official Python website.
  • KitOps CLI: You can find the most recent version of the KitOps CLI on GitHub, and follow the KitOps CLI setup guide.
  • A basic knowledge of Python: We will be using a bit of Python in this tutorial, so knowing your way around Python is required.

2. Install your tools

Our first step will be getting all of our tools in place, to do this we will do the following:

  1. Install ChromaDB: ChromaDB is our embedding database.
pip install chromadb
  1. Install LangChain: LangChain helps with chunking the text data.
pip install langchain
  1. Install Llama.cpp: This will allow us to interact with the Llama model.
pip install llama-cpp-python
  1. Install SentenceTransformers: This is our embedding model framework.
pip install sentence_transformers

3. Load you Llama model

Next, we need to get our Llama model ready, for this we can use KitOps which helps speed things up,

  1. Pull the Llama 7B Model: Use KitOps to download the Llama model.
kit unpack -d ./llama3
  1. Load the model: Once that’s done, we will load the model in Python.
 	from llama_cpp import Llama

llm = Llama(
    seed=1337 # set a specific seed
    # n_gpu_layers=-1, # Uncomment to use GPU acceleration
    # n_ctx=2048, # Uncomment to increase the context window

Here, we initialize the Llama model, optionally enabling GPU acceleration and adjusting the context window for larger inputs.

4. Chunking

The next step is to break down the data into more manageable chunks using LangChain.

Chunking is the process of dividing a large text into smaller, more manageable "chunks." This is important because it allows the model to process the text more efficiently. With smaller chunks, we can handle larger datasets without overwhelming the model. It also helps in maintaining the context and ensuring each part of the text is given attention during processing. Chunking makes it easier to manage memory usage, processing time.

  1. Read the dataset:
 	from langchain.text_splitter import CharacterTextSplitter

    with open("./dataset.txt", 'r') as file:
        content =
except FileNotFoundError:
    print("File not found.")
  1. Split the text:
 	text_splitter = CharacterTextSplitter(

split_texts = text_splitter.split_text(content)

This code reads the content of dataset.txt and splits it into manageable chunks of 1000 characters each.

5. Creating and storing embeddings

We will now use Chroma DB as a Vector DB to save our embeddings.e

Embeddings is a way to convert text data into numerical vectors that the model can process. These vectors capture the semantic meaning of the text, allowing the model to understand and work with the data more efficiently. By creating and storing embeddings, we ensure that the text data is in a format that models can easily retrieve and compare. A Vector DB like the ChromaDB helps store the embedding, making it easier to perform searches.

  1. Pull the SentenceTransformer embedding model from ModelKit registry using KitOps:
kit unpack -d minilm
  1. Initialize the embedding model:
from sentence_transformers import SentenceTransformer
import chromadb

embedding_model = SentenceTransformer("./minilm")
  1. Encode the text chunks:
embeddings = embedding_model.encode(split_texts)
  1. Store embeddings in ChromaDB:
 	chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="dataset-collection")
    ids=[f"id{sno}" for sno in range(1, len(split_texts) + 1)]

This process will encode the text chunks into embeddings and store them in ChromaDB.

6. Semantic search

We can now perform semantic search using ChromaDB:

  1. Query embedding:
query = "workflow engines"
query_embedding = embedding_model.encode([query])
  1. Retrieve results:
 	results = collection.query(


Here, we’ve encoded the search query and used ChromaDB to find the most relevant text chunks.

7. Response generation with LLM

The final step before we can run our RAG pipeline is to pass the search results to the Llama model along with the user query and prompt for generating an answer from the given context. This demonstrates how the model synthesizes information from the retrieved context (semantic search results) and produces a relevant answer.

  1. Prepare the prompt:
contexts = results["documents"][0]
query = "Give me 3 open source workflow engines"
user_prompt = '''You are given a Query and a context, your task is to analyze the context and answer the given Query based on the context.
                        'Statement': {}
                        'Context': {}'''.format(query, contexts)
  1. Generate the response:
def generate_response(user_prompt):

    output = llm.create_chat_completion(
            "role": "user",
            "content": user_prompt

    return output['choices'][0]['message']['content']

output = generate_response(user_prompt)

8. Running the RAG Pipeline

Now, we can finally bring it all together and run our RAG pipeline,

  1. Define the RAG function:
def rag(query):
query_embedding = embedding_model.encode([query])

results = collection.query(

contexts = results["documents"][0]
user_prompt = '''You are given a Query and a context, your task is to analyze the context and answer the given Query based on the context.
                        'Statement': {query}
                        'Context': {contexts}'''.format(query, contexts)
output = generate_response(user_prompt)

return output['choices'][0]['message']['content']
  1. Run the pipeline:
print(rag("Your Query here"))

This function ties everything together, enabling you to input a query and get a contextually relevant response.

If you’ve made it this far, congratulations! You now have a fully functioning RAG pipeline. This pipeline allows you to perform sophisticated semantic searches and generate responses using a powerful language model. But, we’re not done yet. Now that our pipeline is running, we want to package it for distribution.

To do this, we will use ModelKits which will help streamline collaboration on evaluation, testing, and operations.

ModelKits are an OCI-compliant packaging format that enables the seamless sharing of all necessary artifacts involved in the AI/ML model lifecycle. This includes datasets, code, configurations, and the models themselves. By standardizing the way these components are packaged, ModelKit facilitates a more streamlined and collaborative development process that is compatible with nearly any tool. You can learn more about ModelKits here.

1. Convert Notebook to Python Scripts

Before we get started, we need to add a requirements.txt file to list all dependencies required by your pipeline. I haven’t provided the code in here for brevity, but don't worry you can retrieve it from our existing the ModelKit by issuing the following command,

kit unpack --code -d ./myrag

2. Organize your artifacts to look like below

Next, we need to organize our RAG artifacts correctly. Yours should reflect the following,

  | - rag_pipeline #Includes the python, configuration and requirements.txt files.
  | - dataset.txt
  | - KitFile

3. Create a Kitfile that refers to your artifacts and your base model

Once we are organized we need to create a Kitfile that lists all of our project artifacts. A Kitfile is a YAML-based manifest designed to streamline the encapsulation and sharing of project artifacts. From code and datasets to models and their metadata, the Kitfile serves as a comprehensive blueprint for your project, ensuring every component is meticulously organized and easily accessible. For more information refer to Kitfile reference.

Your Kitfile should look like the following,

manifestVersion: "1.0"
  name: RAG with LLMA3
  version: 1.0.0
  authors: ["Jozu AI"]
  name: llama3-8B-instruct-q4_0
  description: Llama 3 8B instruct model
  license: Apache 2.0
  - path: ./dataset.txt
  - path: ./rag_pipeline
  - path:

4. Use Kit to package and push the RAG pipeline to your registry

Next we need to package and push our RAG pipeline to our registry of choice, we will do this from the root of the folder where we created the Kitfile. We suggest using Docker Hub or Artifactory.

  1. Packing the pipeline
kit pack . -t my-oci-registry/my-rag-repo/rag-tutorial:latest
  1. Push to registry to share
kit push my-oci-registry/my-rag-repo/rag-tutorial:latest

5. Retrieve the pipeline and profit

Once we’ve pushed our RAG pipeline, we can retrieve it and unpack the ModelKit,

kit unpack -d /path/to/unpacked

6. To quickly try the to RAG tutorial

  1. Unpack the rag pipeline and base llm
kit unpack -d ./myrag
  1. Unpack the embedding model
kit unpack -d minilm
  1. Run the pipeline
python ./rag_pipeline/ ./dataset.txt ./minilm llama3-8B-instruct-q4_0.gguf

Congrats! You have successfully created a fully functioning RAG pipeline and have packaged it for distribution. If you found this tutorial helpful and would like to learn more about KitOps or contribute to our project, please visit KitOps.ML or join our Discord group.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By
United States United States
The MLOps collaboration platform to unite AI/ML and app development teams with shared tools and processes.

The AI/ML space is evolving daily, requiring ongoing innovation from the tools that support its development. At Jozu, we believe that the best solutions come from gathering diverse perspectives to engage in open collaboration. An outcome that open source is uniquely designed to foster.

Comments and Discussions

Questionmissing file - ./dataset.txt ???? Pin
Jerry Walton11-Jul-24 18:26
Jerry Walton11-Jul-24 18:26 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.