Training code and dataset for a small model that rewrites standard American English into Pittsburghese.
This repository contains the training data, fine-tuning scripts, local inference scripts, and browser export scripts for the project.
Base model: Qwen/Qwen2.5-0.5B-Instruct
data/
pittsburghese_dataset_manual_edit.jsonl
pittsburghese_dataset_expansion_batch1.jsonl
pittsburghese_dataset_expansion_batch2_long.jsonl
pittsburghese_dataset_expansion_batch4_literal_preservation.jsonl
pittsburghese_dataset_expansion_batch5_grammar_preservation.jsonl
phase1_prep_dataset_prompt_completion.py
phase2_finetune_prompt_completion.py
phase3_inference_prompt_completion.py
phase4_export_web_prompt_completion.py
requirements.txt
Install PyTorch for your system first, then install the Python dependencies:
pip install -r requirements.txtThis repo is currently working with:
- transformers==4.57.6
- huggingface_hub==0.36.2
python phase1_prep_dataset_prompt_completion.pyThis creates:
pittsburghese_hf_prompt_completion_dataset/
python phase2_finetune_prompt_completion.pyThis creates:
pittsburghese-lora-prompt-completion/
pittsburghese-merged-prompt-completion/
Interactive mode:
python phase3_inference_prompt_completion.pyBatch mode:
python phase3_inference_prompt_completion.py --batchValidation set check:
python phase3_inference_prompt_completion.py --evalCompare against the base model:
python phase3_inference_prompt_completion.py --compare "Please clean up the kitchen before the guests arrive."python phase4_export_web_prompt_completion.pyThis creates:
pittsburghese-web/
This repository is for training and export.
Generated artifacts such as merged model weights, LoRA outputs, dataset directories, and ONNX exports are better kept out of Git history and published separately where appropriate.