Tag2Pixel is a pipeline designed to automate the creation of machine learning datasets and models by using OpenStreetMap (OSM) vector data with Sentinel-2 multispectral satellite imagery.
The pipeline is "class-agnostic": you can repurpose it to detect any object that has a distinct spectral signature and is mapped in OSM (e.g., solar panels, specific crop types, or water bodies). It extracts geometries, samples the underlying satellite pixels, and trains a Random Forest model with spatial validation.
Note: This is a pixel-based pipeline. It classifies the spectral signature of a single point rather than the shape or texture of an object.
- Automated Data Bridge: Connects OSM tags directly to Sentinel-2 spectral bands.
- Ultra-Fast OSM Extraction: Uses
osmium-toolpre-filtering to process large-scale PBF files in seconds. - Cloud-Native Imagery: Fetches multispectral data on-the-fly from the Microsoft Planetary Computer STAC API.
- Spatial Validation (kNNDM): Uses spatial cross-validation (Nearest Neighbor Distance Matching).
- Dockerized: One command to set up the entire Python (data) and R (ML) environment.
- OSM Extraction: Scans a
.osm.pbffile for specific tags and extracts polygon centroids. - Spectral Sampling: Queries the STAC API for the 10+ Sentinel-2 bands at those coordinates.
- Median Compositing: Uses multiple time-steps to reduce noise.
- Spatial Training: Trains a Random Forest and validates it.
The pipeline is controlled entirely via config.yaml. Here is how to define your task:
Defines the "Target" (what you want to find) and the "Ratio" (how many negative samples to pick).
task:
target_class: "pv" # Must match a name in the 'classes' list below
target_count: 2000 # How many samples to extract for the target
negative_ratio: 1.0 # 1.0 means 2000 target vs 2000 background samplesYou define classes using OSM tag filters. You can use AND logic (inside one filter set) and OR logic (by adding multiple filter sets).
classes:
- name: "pv"
label: 1 # Integer label for ML
min_area: 200 # Minimum size in m² to consider a polygon valid
filters:
# Option A: match if source=solar AND location=roof
- generator:source: "solar"
location: "roof"
# Option B: OR match if method=photovoltaic AND location=roof
- generator:method: "photovoltaic"
location: "roof"
- name: "background"
label: 0
min_area: 500
filters:
# Match any building tagged as retail or supermarket
- building: ["retail", "supermarket", "commercial"]Choose your bands and time range. Default uses all 10m and 20m Sentinel-2 bands.
stac:
date_range: "2024-01-01/2024-12-31"
cloud_cover: 20
bands: ["B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B11", "B12"]- Docker and Docker Compose
- An OSM data file (e.g.,
germany-latest.osm.pbf) from Geofabrik placed indata/raw_osm_data/.
# Build and run everything
docker-compose up --buildTo verify everything is working without waiting for a full 4,000-point extraction, set test_mode: true in config.yaml. This will use smaller sample sizes and limits.
data/training/training.csv: The extracted spectral dataset.data/artifacts/: The trained.rdsR model.output/metrics.json: Model performance (AUC, Accuracy) using spatial CV.
MIT License.