PDF/DOCX extraction, LLM classification and summarization, DuckDB persistence — built on the Loom framework.
Docman is a document processing pipeline that evaluates Loom's actor-based architecture with a real-world workload. It takes PDF and DOCX files through four stages:
| Stage | Worker | What it does |
|---|---|---|
| Extract | ProcessorWorker + DoclingBackend | Reads PDF/DOCX via Docling, extracts text/tables/figures, writes JSON to workspace |
| Classify | LLMWorker | LLM classifies document type from text preview and metadata |
| Summarize | LLMWorker | LLM produces structured summary adapted to document type |
| Ingest | ProcessorWorker + DuckDBIngestBackend | Persists metadata, classification, summary, full text, and optional embeddings to DuckDB |
A standalone doc_query worker provides full-text search, filtering, statistics,
and semantic vector search against the DuckDB database.
The entire system can be exposed as an MCP (Model Context Protocol) server using Loom's built-in MCP gateway — a single YAML config, zero MCP-specific code.
Developers evaluating Loom who want to see how the framework handles a multi-stage pipeline with mixed worker types (LLM and processor).
Document processing engineers who need a pipeline for extracting, classifying, and searching document collections.
Anyone building on the Loom framework who wants a reference implementation to study or fork.
| Component | Status |
|---|---|
| DoclingBackend (PDF/DOCX extraction) | Complete |
| DuckDBIngestBackend (persistence + FTS + embeddings) | Complete |
| DocmanQueryBackend (search, filter, stats, get) | Complete |
| DuckDBVectorTool (semantic similarity search) | Complete |
| Worker configs (5 workers) | Complete |
| Pipeline configs (standard + local tier) | Complete |
| MCP gateway config | Complete |
| Unit tests | 40 passing |
| Summarizer file-ref resolution | Config pending (Loom support exists) |
# Requires Python 3.11+
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v # 40 tests, no infrastructure neededFor full environment setup with Docling, Ollama, NATS, and the complete pipeline, see the platform-specific guides below.
- Architecture — Pipeline stages, data flow, DuckDB tools, design rules, Loom integration details
- macOS Setup — Full environment setup for Apple Silicon
- Windows Setup — Full environment setup for Windows 11 / WSL2
- Docling Configuration — Layout models, GPU acceleration, OCR engines, performance tuning
- Contributing — CLA, technical standards, PR process
Extend the pipeline. Add new backends for additional document formats, improve classification categories, or implement multi-language support.
Contribute. New processing backends, pipeline improvements, integration tests, and documentation improvements are all welcome. See Contributing.
Report issues. Bug reports with reproducible steps help the most.
MPL 2.0 — Mozilla Public License 2.0. Modified source files must remain open; unmodified files can be combined with proprietary code in a Larger Work.
Alternative licensing available for organizations with copyleft constraints. Contact: hooman@mac.com
For governance, succession, and contributor rights, see GOVERNANCE.md.