e2e-rag

This is an end-to-end CLI RAG system implemented to showcase the critical considerations for a successful deployment.

Vector database

For the vector database, I used Chroma as it is well-maintained and the code is clean and easy to understand.

Inference engine

I used Ollama as my inference engine for this RAG system, however, in production we may use any cloud provider MaaS endpoint. One should always be aware of the occurring cost when using applications that are reliant on LLMs.

Env

The .env file contains:

API_KEY="2511c6862e9241c6ae5997751d5bcd33"
MODELS_CONFIG_PATH="configs/models.json"
DB_CONFIG_PATH="configs/db.json"

where:

API_KEY: Is the API key to access the embedding model and LLM (here is a random key as an example).
MODELS_CONFIG_PATH: The path where the model configs are stored.
DB_CONFIG_PATH: The path where the database files are stored.

Configurations

This system can be configured by adjusting the configurations for the vector database and the models, db.json and models.json, respectively, in the configs folder. The content of the db.json are as follows:

{
  "folder_path": "db",
  "splitter": {
    "chunk_size": 1000,
    "chunk_overlap": 200
  },
  "retriever": {
    "k": 3
  }
}

where each item is explained in the followings:

folder_path: Path to the vector database.
splitter: Splitter configuration.
- chunk_size: Maximum size of each text chunk.
- chunk_overlap: Overlap in characters between each chunk.
retriever: Retriever's configuration.
- k: Number of vectors to return when querying the vector database.

Here is the models.json contents:

{
  "embeddings": {
    "model": "embeddinggemma",
    "base_url": "http://localhost:11434/v1"
  },
  "llm": {
    "model": "qwen3:0.6b",
    "base_url": "http://localhost:11434/v1",
    "temperature": 0
  },
  "prompts": {
    "system": "You have access to a retrieval tool that provides you factual context to answer user queries. Use it when you need it."
  }
}

where each item is explained in the followings:

embeddings: Embeddings configuration.
- model: Embedding model name.
- base_url: Base url which here directs to a locally deployed model using Ollama.
llm: LLM configuration.
- model: LLM model name (here I used Qwen3 0.6b due to small size and acceptable quality on my limited resource).
- base_url: Base url which here directs to a locally deployed model using Ollama.
- temperature: To adjust the randomness and creativity of the LLM, where smaller values result in more consistent outputs where are larger values yield more creative outputs.
prompts: Prompts configuration.
- system: System's prompt.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
configs		configs
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
db.py		db.py
rag.py		rag.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

e2e-rag

Vector database

Inference engine

Env

Configurations

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

e2e-rag

Vector database

Inference engine

Env

Configurations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages