Skip to content

TjandraD/cv-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cv-processor

Backend REST API for a bilingual CV semantic matching system. Receives a pre-extracted CV text string from the frontend (OCR is handled on the FE side), matches it against a job posting using multilingual sentence embeddings, and returns the match result alongside a TF-IDF keyword baseline score for direct comparison.

Skripsi: Perancangan Sistem Filtering Kandidat Berdasarkan Kualifikasi Menggunakan Analisis Semantik Berbasis Machine Learning pada Data Curriculum Vitae


Tech Stack

Layer Library / Tool Purpose
Web Framework Flask REST API server
Auth Static UUID token in .env Simple Bearer token auth
NLP Preprocessing langdetect, re, unicodedata Text cleaning, language detection
Skill Extraction spaCy + custom bilingual skill list Rule-based NER for skills
Embedding Model sentence-transformers (paraphrase-multilingual-mpnet-base-v2, fine-tuned) Cross-lingual sentence embeddings
Similarity scikit-learn (cosine_similarity) Compute semantic match scores
Baseline scikit-learn (TfidfVectorizer) TF-IDF keyword matching — always returned alongside semantic score
Data Store SQLite (via SQLAlchemy) Job descriptions — read-only from API
Env/Config python-dotenv Token & config management
CORS flask-cors Allow cross-origin requests from the FE

Project Structure

backend/
├── app.py                      # Flask entry point
├── config.py                   # Env vars, token, model name
├── requirements.txt
├── train.py                    # Fine-tuning script (CosineSimilarityLoss)
├── evaluate.py                 # Batch evaluation — Precision/Recall/F1
├── compare.py                  # Side-by-side base vs fine-tuned comparison
├── integration_test.sh         # End-to-end API smoke test
├── seed/
│   ├── schema.sql              # Idempotent CREATE TABLE statement
│   └── jobs_seed.sql           # INSERT OR IGNORE seed data
│
├── middleware/
│   └── auth.py                 # Static Bearer token validation
│
├── routes/
│   ├── match.py                # POST /api/match
│   ├── jobs.py                 # GET /api/jobs, GET /api/jobs/<id>
│   └── health.py               # GET /api/health
│
├── services/
│   ├── preprocessing.py        # Text cleaning & normalization
│   ├── extraction.py           # Rule-based skill & section extraction
│   ├── embedding_service.py    # Sentence embeddings (loaded once at startup)
│   └── baseline_service.py     # TF-IDF keyword matching (paper baseline)
│
├── models/
│   └── db_models.py            # SQLAlchemy Job model
│
└── utils/
    └── response_utils.py       # Standardised JSON response helpers

data/
└── prepare_real_data.py        # Convert annotated CSV → train/val pair CSVs

models/
└── finetuned-v2/               # Fine-tuned model checkpoint (not in git)

Setup

1. Create a virtual environment

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Upgrade pip
pip install --upgrade pip

2. Install dependencies

pip install -r backend/requirements.txt
python -m spacy download xx_ent_wiki_sm

Note: If you encounter issues with CUDA dependencies during installation, you can install CPU-only PyTorch:

pip install torch --index-url https://download.pytorch.org/whl/cpu

requirements.txt includes accelerate and datasets which are required by the fine-tuning script (train.py). They are not needed to run the API server.

3. Configure environment

cp backend/.env.example backend/.env
# Edit backend/.env — set APP_TOKEN to a real UUID
# Example: uuidgen | tr -d '\n' (Linux/macOS) to generate a UUID

4. Initialise the database

cd backend
sqlite3 app.db < seed/schema.sql
sqlite3 app.db < seed/jobs_seed.sql

5. Run

# Activate virtual environment (if not already activated)
source venv/bin/activate

# Development (run from the backend/ directory — app uses relative imports)
cd backend
python app.py

# Production
gunicorn --bind 0.0.0.0:5000 --workers 2 --timeout 120 app:app

Docker

cp backend/.env.example backend/.env
# Edit backend/.env — set APP_TOKEN to a real UUID
docker compose up --build

The entrypoint script automatically applies schema and seed data on first run. The SQLite database is stored on a named Docker volume and persists across container restarts.

Service available at: http://localhost:5000/api/health

Docker deployment quickstart (VM)

# 1) Prepare runtime env file
cp backend/.env.example backend/.env

# 2) Set APP_TOKEN in backend/.env
uuidgen

# 3) Build and run
docker compose up --build -d

# 4) Verify health
curl -f http://localhost:5000/api/health

Optional verification:

# Confirm seeded jobs exist (requires APP_TOKEN from backend/.env)
TOKEN="<your-app-token>"
curl -s -H "Authorization: Bearer $TOKEN" \
    "http://localhost:5000/api/jobs?page=1&limit=10"

# Stop services but keep DB data
docker compose down

# Bring back up (data persists in named volume)
docker compose up -d

# Remove services + DB volume (data reset)
docker compose down -v

Fine-Tuning

The embedding model is fine-tuned on a human-annotated dataset of 300 CV-JD pairs using CosineSimilarityLoss with continuous labels derived from 3-annotator aggregate scores.

1. Prepare training data

Obtain the annotated CSV (shared outside of git) and place it in tmp/. Then run from the project root:

python data/prepare_real_data.py

This produces data/train_pairs.csv (~240 rows) and data/val_pairs.csv (~60 rows) with a stratified 80/20 split.

2. Fine-tune

cd backend
python train.py \
  --train ../data/train_pairs.csv \
  --val   ../data/val_pairs.csv \
  --output ../models/finetuned \
  --epochs 4 \
  --batch 16

3. Compare base vs fine-tuned

cd backend
python compare.py --dataset ../data/val_pairs.csv

Outputs a Precision / Recall / F1 table for TF-IDF baseline, base semantic model, and fine-tuned model.

4. Point the service at the fine-tuned model

Set in backend/.env:

MODEL_NAME=models/finetuned

API Endpoints

All endpoints except /api/health require: Authorization: Bearer <token>

Method Path Description
GET /api/health Liveness check — no auth required
GET /api/jobs List job postings (supports industry, page, limit query params)
GET /api/jobs/<job_id> Get full job detail
POST /api/match Single-candidate CV matching — returns semantic and keyword scores with breakdown

POST /api/match

Accepts a single candidate per request. Required fields: job_id, candidate_id, candidate_name, cv_text (pre-extracted plain text). Returns semantic and TF-IDF keyword scores side by side with a per-section breakdown.

Grade thresholds (based on semantic_score):

Score Grade
≥ 80 High Match
60–79 Medium Match
< 60 Low Match

Scoring

semantic_score = (skills_score × 0.4) + (experience_score × 0.4) + (education_score × 0.2)

The score_comparison block in every result contains:

  • semantic_score — weighted cosine similarity from multilingual sentence embeddings (0–100)
  • keyword_score — TF-IDF bag-of-words cosine similarity score (0–100)
  • improvementsemantic_score − keyword_score (quantifies the gain from semantic over keyword matching)

About

Backend application that will route ML model request for CV processor

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors