Pose2Word is an end-to-end project for preparing sign-language video data, extracting compact motion-aware features, training classifiers, and running upload-and-predict inference in one workflow.
- Navid
- Musaddiq
- Mahbub
- Raiyan
- Sinha
- CSE-4544: ML Lab Final Project (IUT-SWE-22)
Build a practical word-level sign-language recognition system that can process raw videos and predict sign classes reliably.
Sign-language recognition supports accessibility and improves communication support for Deaf and Hard-of-Hearing communities in digital systems.
- assistive communication tools
- educational sign-learning applications
- HCI interfaces and smart interpretation systems
- rapid prototyping for accessibility-focused products
- WLASL-based dataset organization
- local raw dataset root:
dataset/raw_video_data
- 15 word classes in the core setup
- 187 total
.mp4video samples (current snapshot) - extracted feature format (current primary path):
(T, 168)where defaultT = 15
- signer-centered cropped and normalized videos
- semantically selected keyframes (8 to 15)
- MediaPipe-based landmarks with Relative Quantization (RQ)
- variable signer motion and framing
- idle segments before/after signing
- class-order and sequence-length mismatch risks during inference
- no tabular missing-value imputation is used
- for vision data, missing detections are handled by robust defaults and sequence padding/truncation
- frame-rate normalization via ffmpeg
- CLAHE-based contrast enhancement
- signer-focused cropping
- temporal idle trimming
- resize + pad to square output
- keyframe extraction using fused motion/landmark signals
- landmark selection to 56 points per frame (168 dims after flattening)
- hand dominance correction and shoulder-based calibration
- relative quantization + feature scaling
- model training scripts perform train/validation splitting during training workflow
- exact split configuration depends on selected trainer script and run config
- LSTM classifier
- Transformer classifier
- Hybrid architecture support
- sequential temporal modeling fits sign progression over frames
- LSTM provides a lightweight baseline
- Transformer/hybrid variants support richer temporal feature learning
- input is a fixed-length landmark sequence
- sequence encoder learns temporal representation
- final classifier head outputs class probabilities across sign words
- preprocess raw videos
- extract keyframes
- generate landmark tensors
- train classifier and save checkpoints/logs
model_type:lstm/transformer/hybridbatch_size(example default in pipeline runner: 8)num_epochs(example default: 50)learning_rate(example default: 0.001)weight_decay(example default: 1e-4)target_seq_len(landmark default often 15)- device:
cuda/mps/cpu
- Python
- PyTorch ecosystem
- Streamlit (for app interactions)
- MediaPipe Tasks
- OpenCV
- NumPy / SciPy / scikit-learn
- accuracy (tracked in available training history)
- training and validation loss curves
From mini_brother_mother_quick5/training_history.json:
val_accreached1.0by epoch 4 to 5val_lossimproved from ~0.6933to ~0.6895
- multiple model types are supported by the training API
- side-by-side benchmark tables are not yet consolidated in the repository docs
- Streamlit tabs provide visual workflow demos:
- preprocessing page
- keyframe preview and selected frame outputs
- landmark extraction outputs (
.npy) - prediction output with top-k probabilities and confidence
- TensorBoard logs are generated in checkpoint run folders for training visualization
- keeping training and inference sequence settings aligned
- handling variable-quality videos and inconsistent motion windows
- balancing extraction quality vs processing speed
- closed-vocabulary prediction (only trained classes)
- no full benchmark report table for all runs in one place
- no dedicated in-app training tab (training is CLI-driven)
- a unified preprocessing + landmark + training + inference pipeline improves reproducibility
- landmark-first representation works as a strong practical baseline
- end-to-end tooling in one repo reduces integration errors
- standardized experiment tracking and comparison dashboard
- expanded dataset scale and class coverage
- stronger model selection and calibration workflows
- robust signer/domain adaptation
- improved temporal attention variants
- multimodal feature fusion extensions
Based on WLASL, we kept 15 word classes:
| Word (Class) | Videos (.mp4) |
|---|---|
| brother | 11 |
| call | 12 |
| drink | 15 |
| go | 15 |
| help | 14 |
| man | 12 |
| mother | 11 |
| no | 11 |
| short | 13 |
| tall | 13 |
| what | 12 |
| who | 14 |
| why | 11 |
| woman | 11 |
| yes | 12 |
| TOTAL MP4s | 187 |
uv sync
uv run streamlit run main.pyuv run python util_scripts/run_full_pipeline.py \
--dataset-dir dataset/raw_video_data \
--preprocessed-dir outputs/preprocessed \
--keyframes-dir outputs/keyframes \
--landmarks-dir outputs/landmarks \
--checkpoint-dir checkpoints/full_pipeline_run \
--trainer-script model/trainer_current_config-gpu.py \
--model-type lstm \
--device cuda- Prediction is closed-vocabulary.
- Class label order must match training order exactly.
- Sequence length in prediction must match training configuration.
- If ffmpeg is unavailable, normalization and prediction preprocessing will fail.