Backend REST API for a bilingual CV semantic matching system. Receives a pre-extracted CV text string from the frontend (OCR is handled on the FE side), matches it against a job posting using multilingual sentence embeddings, and returns the match result alongside a TF-IDF keyword baseline score for direct comparison.
Skripsi: Perancangan Sistem Filtering Kandidat Berdasarkan Kualifikasi Menggunakan Analisis Semantik Berbasis Machine Learning pada Data Curriculum Vitae
| Layer | Library / Tool | Purpose |
|---|---|---|
| Web Framework | Flask |
REST API server |
| Auth | Static UUID token in .env |
Simple Bearer token auth |
| NLP Preprocessing | langdetect, re, unicodedata |
Text cleaning, language detection |
| Skill Extraction | spaCy + custom bilingual skill list |
Rule-based NER for skills |
| Embedding Model | sentence-transformers (paraphrase-multilingual-mpnet-base-v2, fine-tuned) |
Cross-lingual sentence embeddings |
| Similarity | scikit-learn (cosine_similarity) |
Compute semantic match scores |
| Baseline | scikit-learn (TfidfVectorizer) |
TF-IDF keyword matching — always returned alongside semantic score |
| Data Store | SQLite (via SQLAlchemy) |
Job descriptions — read-only from API |
| Env/Config | python-dotenv |
Token & config management |
| CORS | flask-cors |
Allow cross-origin requests from the FE |
backend/
├── app.py # Flask entry point
├── config.py # Env vars, token, model name
├── requirements.txt
├── train.py # Fine-tuning script (CosineSimilarityLoss)
├── evaluate.py # Batch evaluation — Precision/Recall/F1
├── compare.py # Side-by-side base vs fine-tuned comparison
├── integration_test.sh # End-to-end API smoke test
├── seed/
│ ├── schema.sql # Idempotent CREATE TABLE statement
│ └── jobs_seed.sql # INSERT OR IGNORE seed data
│
├── middleware/
│ └── auth.py # Static Bearer token validation
│
├── routes/
│ ├── match.py # POST /api/match
│ ├── jobs.py # GET /api/jobs, GET /api/jobs/<id>
│ └── health.py # GET /api/health
│
├── services/
│ ├── preprocessing.py # Text cleaning & normalization
│ ├── extraction.py # Rule-based skill & section extraction
│ ├── embedding_service.py # Sentence embeddings (loaded once at startup)
│ └── baseline_service.py # TF-IDF keyword matching (paper baseline)
│
├── models/
│ └── db_models.py # SQLAlchemy Job model
│
└── utils/
└── response_utils.py # Standardised JSON response helpers
data/
└── prepare_real_data.py # Convert annotated CSV → train/val pair CSVs
models/
└── finetuned-v2/ # Fine-tuned model checkpoint (not in git)
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Upgrade pip
pip install --upgrade pippip install -r backend/requirements.txt
python -m spacy download xx_ent_wiki_smNote: If you encounter issues with CUDA dependencies during installation, you can install CPU-only PyTorch:
pip install torch --index-url https://download.pytorch.org/whl/cpu
requirements.txtincludesaccelerateanddatasetswhich are required by the fine-tuning script (train.py). They are not needed to run the API server.
cp backend/.env.example backend/.env
# Edit backend/.env — set APP_TOKEN to a real UUID
# Example: uuidgen | tr -d '\n' (Linux/macOS) to generate a UUIDcd backend
sqlite3 app.db < seed/schema.sql
sqlite3 app.db < seed/jobs_seed.sql# Activate virtual environment (if not already activated)
source venv/bin/activate
# Development (run from the backend/ directory — app uses relative imports)
cd backend
python app.py
# Production
gunicorn --bind 0.0.0.0:5000 --workers 2 --timeout 120 app:appcp backend/.env.example backend/.env
# Edit backend/.env — set APP_TOKEN to a real UUID
docker compose up --buildThe entrypoint script automatically applies schema and seed data on first run. The SQLite database is stored on a named Docker volume and persists across container restarts.
Service available at: http://localhost:5000/api/health
# 1) Prepare runtime env file
cp backend/.env.example backend/.env
# 2) Set APP_TOKEN in backend/.env
uuidgen
# 3) Build and run
docker compose up --build -d
# 4) Verify health
curl -f http://localhost:5000/api/healthOptional verification:
# Confirm seeded jobs exist (requires APP_TOKEN from backend/.env)
TOKEN="<your-app-token>"
curl -s -H "Authorization: Bearer $TOKEN" \
"http://localhost:5000/api/jobs?page=1&limit=10"
# Stop services but keep DB data
docker compose down
# Bring back up (data persists in named volume)
docker compose up -d
# Remove services + DB volume (data reset)
docker compose down -vThe embedding model is fine-tuned on a human-annotated dataset of 300 CV-JD pairs using CosineSimilarityLoss with continuous labels derived from 3-annotator aggregate scores.
Obtain the annotated CSV (shared outside of git) and place it in tmp/. Then run from the project root:
python data/prepare_real_data.pyThis produces data/train_pairs.csv (~240 rows) and data/val_pairs.csv (~60 rows) with a stratified 80/20 split.
cd backend
python train.py \
--train ../data/train_pairs.csv \
--val ../data/val_pairs.csv \
--output ../models/finetuned \
--epochs 4 \
--batch 16cd backend
python compare.py --dataset ../data/val_pairs.csvOutputs a Precision / Recall / F1 table for TF-IDF baseline, base semantic model, and fine-tuned model.
Set in backend/.env:
MODEL_NAME=models/finetuned
All endpoints except /api/health require: Authorization: Bearer <token>
| Method | Path | Description |
|---|---|---|
GET |
/api/health |
Liveness check — no auth required |
GET |
/api/jobs |
List job postings (supports industry, page, limit query params) |
GET |
/api/jobs/<job_id> |
Get full job detail |
POST |
/api/match |
Single-candidate CV matching — returns semantic and keyword scores with breakdown |
Accepts a single candidate per request. Required fields: job_id, candidate_id, candidate_name, cv_text (pre-extracted plain text). Returns semantic and TF-IDF keyword scores side by side with a per-section breakdown.
Grade thresholds (based on semantic_score):
| Score | Grade |
|---|---|
| ≥ 80 | High Match |
| 60–79 | Medium Match |
| < 60 | Low Match |
semantic_score = (skills_score × 0.4) + (experience_score × 0.4) + (education_score × 0.2)
The score_comparison block in every result contains:
semantic_score— weighted cosine similarity from multilingual sentence embeddings (0–100)keyword_score— TF-IDF bag-of-words cosine similarity score (0–100)improvement—semantic_score − keyword_score(quantifies the gain from semantic over keyword matching)