Metadata Harvester API Toolkit

This repository contains an API-driven metadata harvesting toolkit.

Built on FastAPI for orchestration
Implements a modular harvesting architecture
Centralizes the metadata schema and distribution keys in YAML files
Provides a lightweight admin interface for running jobs via a web browser

Directory Overview

Folder/File	Description
`main.py`	Entry point for running harvesting routines manually or via scripts

`harvesters/`	Contains source-specific harvester modules, each subclassing the base harvester class
`harvesters/base.py`	Defines the `BaseHarvester` class with the standard pipeline: fetch → parse → flatten
`utils/`	Shared utility functions used across harvesters (e.g., title formatting, spatial/temporal cleaning)
`routers/`	FastAPI endpoints for running harvesters via HTTP routes or background jobs
`schemas/`	YAML metadata schemas used for field validation and formatting
`reference_data/`	External controlled vocabularies, lookup tables, or enrichment data (e.g., spatial or organization info)
`inputs/`	Source-specific configuration or input files, such as CSVs or cached HTML pages
`outputs/`	Processed metadata outputs, typically saved as CSV or JSON
`config/`	Optional config files for customizing runtime parameters or deployment settings
`static/`	Static HTML pages or assets for lightweight documentation or interface testing
`pyproject.toml`	Project metadata and dependency definitions (managed with `uv`)
`uv.lock`	Locked dependency versions for reproducible installs
`requirements.txt`	Legacy dependency list (use `pyproject.toml` going forward)

Setup instructions

Clone the repository and change into this directory
Create the local environment and install dependencies: uv sync
Start the FastAPI Server: uv run uvicorn main:app --reload
Review the API documentation (Swagger UI) at http://localhost:8000/docs
For a list of runnable jobs, go to http://localhost:8000/

Notes:

The --reload flag automatically restarts the server when you edit code.
Jobs are configured in YAML files inside the jobs/ directory.
Outputs from harvests will be saved in the outputs/ folder.

Adding jobs

To create new harvesters, here are the basic steps:

Add a new Python file in the harvesters/ directory
Create a job config YAML in config/
In routers/jobs.py, update the run endpoint for the new harvester type
Test the new harvester

More details tbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metadata Harvester API Toolkit

Directory Overview

Setup instructions

Adding jobs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
config		config
harvesters		harvesters
inputs		inputs
outputs		outputs
reference_data		reference_data
routers		routers
schemas		schemas
scripts		scripts
static		static
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Metadata Harvester API Toolkit

Directory Overview

Setup instructions

Adding jobs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages