Skip to content

jranaraki/e2e-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

e2e-rag

This is an end-to-end CLI RAG system implemented to showcase the critical considerations for a successful deployment.

Vector database

For the vector database, I used Chroma as it is well-maintained and the code is clean and easy to understand.

Inference engine

I used Ollama as my inference engine for this RAG system, however, in production we may use any cloud provider MaaS endpoint. One should always be aware of the occurring cost when using applications that are reliant on LLMs.

Env

The .env file contains:

API_KEY="2511c6862e9241c6ae5997751d5bcd33"
MODELS_CONFIG_PATH="configs/models.json"
DB_CONFIG_PATH="configs/db.json"

where:

  • API_KEY: Is the API key to access the embedding model and LLM (here is a random key as an example).
  • MODELS_CONFIG_PATH: The path where the model configs are stored.
  • DB_CONFIG_PATH: The path where the database files are stored.

Configurations

This system can be configured by adjusting the configurations for the vector database and the models, db.json and models.json, respectively, in the configs folder. The content of the db.json are as follows:

{
  "folder_path": "db",
  "splitter": {
    "chunk_size": 1000,
    "chunk_overlap": 200
  },
  "retriever": {
    "k": 3
  }
}

where each item is explained in the followings:

  • folder_path: Path to the vector database.
  • splitter: Splitter configuration.
    • chunk_size: Maximum size of each text chunk.
    • chunk_overlap: Overlap in characters between each chunk.
  • retriever: Retriever's configuration.
    • k: Number of vectors to return when querying the vector database.

Here is the models.json contents:

{
  "embeddings": {
    "model": "embeddinggemma",
    "base_url": "http://localhost:11434/v1"
  },
  "llm": {
    "model": "qwen3:0.6b",
    "base_url": "http://localhost:11434/v1",
    "temperature": 0
  },
  "prompts": {
    "system": "You have access to a retrieval tool that provides you factual context to answer user queries. Use it when you need it."
  }
}

where each item is explained in the followings:

  • embeddings: Embeddings configuration.
    • model: Embedding model name.
    • base_url: Base url which here directs to a locally deployed model using Ollama.
  • llm: LLM configuration.
    • model: LLM model name (here I used Qwen3 0.6b due to small size and acceptable quality on my limited resource).
    • base_url: Base url which here directs to a locally deployed model using Ollama.
    • temperature: To adjust the randomness and creativity of the LLM, where smaller values result in more consistent outputs where are larger values yield more creative outputs.
  • prompts: Prompts configuration.
    • system: System's prompt.

About

A minimal end-to-end CLI Retrieval-Augmented Generation pipeline

Resources

License

Stars

Watchers

Forks

Contributors

Languages