Local RAG Chat

A fully local Retrieval-Augmented Generation application that allows users to upload documents and chat with them using a local language model.

This project was designed as a production-style AI application that runs entirely on local infrastructure, without relying on external cloud APIs. It combines document ingestion, semantic search, vector storage, and local LLM inference into a clean full-stack experience.

Overview

Local RAG Chat enables users to:

upload PDF, TXT, and Markdown documents
automatically process and index those documents
ask natural language questions about indexed content
retrieve relevant document chunks through semantic search
generate answers using a local LLM running with Ollama
inspect source references used to produce each answer
interact through a modern chat interface inspired by ChatGPT

The system is designed for privacy, local-first workflows, and portfolio-level presentation of a complete AI product.

Key Features

Local-first architecture

All major components run locally:

local LLM via Ollama
local embedding generation
local vector database persistence with ChromaDB
local backend with FastAPI
local frontend with Next.js

No external AI APIs are required.

End-to-end RAG pipeline

The application implements a complete Retrieval-Augmented Generation workflow:

document upload
text extraction
chunking
embeddings generation
vector storage
semantic retrieval
prompt construction
local LLM answer generation
answer rendering with sources

Document support

Supported file types:

PDF
TXT
Markdown

Chat experience

The frontend includes:

ChatGPT-style layout
streaming assistant responses
markdown rendering
syntax highlighting for code blocks
copy response action
regenerate response action
typing indicator
source expansion panel
multi-document selection

Document management

Users can:

upload files
view indexed documents
select one or more documents for retrieval scope
delete indexed documents

Tech Stack

Frontend

Next.js 14
React
TypeScript
Tailwind CSS
Zustand
react-dropzone
react-markdown
react-syntax-highlighter
Lucide React

Backend

Python
FastAPI
LangChain
Ollama
ChromaDB
PyPDF

Infrastructure

Docker
Docker Compose
local environment configuration with .env

Architecture

High-level flow

Document Upload
↓
Text Extraction
↓
Chunking
↓
Embedding Generation
↓
Vector Storage (ChromaDB)
↓
User Question
↓
Query Embedding
↓
Similarity Search
↓
Top-K Retrieval
↓
Prompt Construction
↓
Local LLM Generation (Ollama)
↓
Answer + Sources

Project structure

ragChat/
├── backend/
│   ├── api/
│   ├── core/
│   ├── embeddings/
│   ├── ingestion/
│   ├── prompts/
│   ├── retrieval/
│   ├── services/
│   ├── storage/
│   ├── .env
│   ├── main.py
│   └── requirements.txt
├── frontend/
│   ├── app/
│   ├── components/
│   ├── hooks/
│   ├── services/
│   ├── styles/
│   ├── types/
│   ├── .env
│   ├── package.json
│   └── tailwind.config.ts
├── docker/
│   ├── backend.Dockerfile
│   ├── frontend.Dockerfile
│   └── docker-compose.yml
├── tests/
├── vector_db/
└── README.md

How It Works

Document ingestion

When a document is uploaded, the backend:

saves the file locally
extracts raw text
splits content into chunks
enriches metadata
generates embeddings
stores chunks in ChromaDB

Current chunking settings:

chunk_size = 800
chunk_overlap = 200

Retrieval

When a user sends a question, the backend:

embeds the question
performs semantic similarity search in ChromaDB
retrieves the top relevant chunks
formats the retrieved context
builds the final prompt
sends the prompt to the local LLM
returns the generated answer with source references

Prompting

The application uses a constrained RAG prompt so the model answers only from the indexed context.

Example behavior:

if the answer exists in the document set, the assistant answers based on those sources
if the answer is not present, the assistant should respond that it could not find the answer in the provided documents

API Endpoints

Document endpoints

`POST /documents/upload`

Uploads and indexes a document.

`GET /documents`

Lists indexed documents.

`DELETE /documents/{id}`

Removes a document and its indexed vectors.

Chat endpoints

`POST /chat`

Returns a full answer after retrieval and generation.

`POST /chat/stream`

Streams answer tokens progressively using SSE-style events.

Health endpoint

`GET /health`

Basic backend health check.

Local Development Requirements

Before running the project, make sure the following tools are installed.

Required software

Backend

Python 3.11 or newer
pip
virtual environment support

Frontend

Node.js 18 or newer
npm

Local AI runtime

Ollama installed and available in your system path

Optional

Docker
Docker Compose

Required Ollama Models

This project expects two local models:

a chat model
an embedding model

Recommended examples:

chat model: llama3.1
embedding model: mxbai-embed-large

Install them with:

ollama pull llama3.1
ollama pull mxbai-embed-large

Start Ollama:

ollama serve

Running the Project Locally

1. Clone the repository

git clone https://github.com/imsupeer/ragChat.git
cd ragChat

2. Configure the backend

Go to the backend folder:

cd backend

Setup the environment file:

APP_NAME=Local RAG Chat
APP_ENV=development
API_HOST=0.0.0.0
API_PORT=8000
CORS_ORIGINS=http://localhost:3000

OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_CHAT_MODEL=llama3.1
OLLAMA_EMBED_MODEL=mxbai-embed-large

CHROMA_PERSIST_DIRECTORY=./vector_db
DOCUMENTS_DIRECTORY=./storage/docs
REGISTRY_PATH=./storage/registry.json

CHUNK_SIZE=800
CHUNK_OVERLAP=200
TOP_K=5
MAX_CONTEXT_CHUNKS=5

Create and activate a virtual environment.

On Linux/macOS:

python -m venv venv
source venv/bin/activate

On Windows PowerShell:

python -m venv venv
venv\Scripts\Activate.ps1

Install dependencies:

pip install -r requirements.txt

Run the backend:

uvicorn main:app --reload

The backend will be available at:

http://localhost:8000

Swagger documentation:

http://localhost:8000/docs

3. Configure the frontend

Open a new terminal and go to the frontend folder:

cd frontend

Setup the frontend environment file:

NEXT_PUBLIC_API_URL=http://localhost:8000

Install dependencies:

npm install

Start the frontend:

npm run dev

The frontend will be available at:

http://localhost:3000

Running with Docker

If you prefer containerized execution, use the Docker setup included in the project.

From the docker/ directory:

cd docker
docker compose up --build

This will start:

the frontend container
the backend container

Note that Ollama must still be available and reachable from the backend container, depending on your host configuration.

Recommended Startup Order

For the smoothest local setup, use this order:

start Ollama
make sure the required models are installed
start the backend and frontend using docker
upload a document
begin chatting

Example Usage

Upload a document

Use the sidebar to upload a file such as:

mobydick.txt
notes.md
paper.pdf

Ask questions

Examples:

Who is Captain Ahab?
Summarize the main idea of this document.
What does the document say about system architecture?
List the most important technical decisions mentioned.
Which section discusses deployment?

Review source references

Each assistant message can display its sources in a collapsible references panel, allowing users to inspect which parts of the indexed documents were used.

Development Notes

Chroma metadata

Metadata stored with each chunk must use primitive values only. Avoid None, nested objects, or unsupported types when writing metadata into ChromaDB.

Multilingual behavior

If documents are in English, questions asked in English may produce more accurate retrieval results than questions in Portuguese, depending on the embedding model and local LLM used.

Persistence

The project stores:

uploaded files in backend/storage/docs/
registry data in backend/storage/registry.json
vector database files in vector_db/

Testing

Run tests from the project root:

pytest

Portfolio Value

This project demonstrates practical experience in:

full-stack AI product development
Retrieval-Augmented Generation architecture
local LLM integration
vector search systems
LangChain orchestration
document ingestion pipelines
modern React and Next.js frontend engineering
streaming UX patterns
production-style code organization
Docker-based local infrastructure

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
backend		backend
docker		docker
frontend		frontend
tests		tests
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Local RAG Chat

Overview

Key Features

Local-first architecture

End-to-end RAG pipeline

Document support

Chat experience

Document management

Tech Stack

Frontend

Backend

Infrastructure

Architecture

High-level flow

Project structure

How It Works

Document ingestion

Retrieval

Prompting

API Endpoints

Document endpoints

POST /documents/upload

GET /documents

DELETE /documents/{id}

Chat endpoints

POST /chat

POST /chat/stream

Health endpoint

GET /health

Local Development Requirements

Required software

Backend

Frontend

Local AI runtime

Optional

Required Ollama Models

Running the Project Locally

1. Clone the repository

2. Configure the backend

3. Configure the frontend

Running with Docker

Recommended Startup Order

Example Usage

Upload a document

Ask questions

Review source references

Development Notes

Chroma metadata

Multilingual behavior

Persistence

Testing

Portfolio Value

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /documents/upload`

`GET /documents`

`DELETE /documents/{id}`

`POST /chat`

`POST /chat/stream`

`GET /health`

Packages