This repository contains an API-driven metadata harvesting toolkit.
- Built on FastAPI for orchestration
- Implements a modular harvesting architecture
- Centralizes the metadata schema and distribution keys in YAML files
- Provides a lightweight admin interface for running jobs via a web browser
| Folder/File | Description |
|---|---|
main.py |
Entry point for running harvesting routines manually or via scripts |
harvesters/ |
Contains source-specific harvester modules, each subclassing the base harvester class |
harvesters/base.py |
Defines the BaseHarvester class with the standard pipeline: fetch → parse → flatten |
utils/ |
Shared utility functions used across harvesters (e.g., title formatting, spatial/temporal cleaning) |
routers/ |
FastAPI endpoints for running harvesters via HTTP routes or background jobs |
schemas/ |
YAML metadata schemas used for field validation and formatting |
reference_data/ |
External controlled vocabularies, lookup tables, or enrichment data (e.g., spatial or organization info) |
inputs/ |
Source-specific configuration or input files, such as CSVs or cached HTML pages |
outputs/ |
Processed metadata outputs, typically saved as CSV or JSON |
config/ |
Optional config files for customizing runtime parameters or deployment settings |
static/ |
Static HTML pages or assets for lightweight documentation or interface testing |
pyproject.toml |
Project metadata and dependency definitions (managed with uv) |
uv.lock |
Locked dependency versions for reproducible installs |
requirements.txt |
Legacy dependency list (use pyproject.toml going forward) |
- Clone the repository and change into this directory
- Create the local environment and install dependencies:
uv sync - Start the FastAPI Server:
uv run uvicorn main:app --reload - Review the API documentation (Swagger UI) at http://localhost:8000/docs
- For a list of runnable jobs, go to http://localhost:8000/
Notes:
- The --reload flag automatically restarts the server when you edit code.
- Jobs are configured in YAML files inside the jobs/ directory.
- Outputs from harvests will be saved in the outputs/ folder.
To create new harvesters, here are the basic steps:
- Add a new Python file in the
harvesters/directory - Create a job config YAML in
config/ - In
routers/jobs.py, update the run endpoint for the new harvester type - Test the new harvester
More details tbd