Configurable, reproducible preprocessing pipeline for SCD41 and LGR CO2 sensor data collected at SMEAR Estonia (smear.emu.ee). Transforms raw JSON message logs into quality-controlled 10-minute aggregates in CSV and Parquet formats. Accompanies: "Fourteen-month co-located SCD41 low-cost IoT CO2 sensor dataset from hemiboreal Estonia", Scientific Data (nature), [status: in review].
Status: v0.4.0 - Stable release with comprehensive validation suite and enhanced data quality reporting
# 1. Setup (one-time)
./setup.sh
# 2. Activate environment
source venv/bin/activate
# 3. Configure sensors (copy template, then edit paths)
cp config/sensor_config.yaml config/sensors.yaml
# edit config/sensors.yaml # Set your data paths
# 4. Run processing
jupyter notebook process_data.ipynbProcessed data appears in output/.
Processes many sensors (4x SCD41, 1x LGR reference) with:
- Time-synced resampling to 10-min intervals (epoch-aligned)
- Quality control tracking uptime, data quality flags, validation report
- Comprehensive validation suite with JSON/Markdown reports and coverage metrics
- Multi-out CSV and Parquet formats
- Visuals uptime, validation, and QC plots
| Output | Location |
|---|---|
| Sensors | output/sensors/ |
| Merged | output/merged/ |
| Validation | output/validation/ |
| Logs | output/logs/ |
Copy the template and edit config/sensors.yaml to set input data paths:
cp config/sensor_config.yaml config/sensors.yamlsensors:
- name: CO2_SCT1_2M
input_dir: /path/to/your/data # ← Change this
sensor_type: scd41
# Other settings auto-configured- Quick Start: This README (you are here!)
- Setup Guide:
SETUP.md- Installation, configuration, and troubleshooting - Container Deployment:
DOCKER.md- Docker/Podman setup, volume mounts, and configuration - API Reference:
API_REFERENCE.md- Complete API docs for all classes and functions - Architecture:
ARCHITECTURE.md- Technical design, data flow, and component architecture - Output Schema:
OUTPUT_SCHEMA.md- Detailed column definitions and data formats - Application Summary:
src/APPLICATION_SUMMARY.md- Overviews features & research context - Data Merging:
src/MERGING.md- Information on sensor data merging - Visualization:
src/visualization/PLOT_README.md- Plotting module docs - Validation:
src/validation/VALIDATION_README.md- Data validation docs - Examples:
process_data.ipynb- Jupyter notebook with usage patterns
Import errors? Ensure virtual environment is active: source venv/bin/activate
No data found? Check paths in config/sensors.yaml and logs: tail -f output/logs/<logfile>.log
Reprocess data:
pipeline = SensorPipeline(config, force=True)
pipeline.run_all(year=2024, month=10)docker compose up or podman-compose up
See DOCKER.md for full setup instructions: volume mounts, config path notes, Docker/Podman commands, and output structure.
Requires editing volume paths in
docker-compose.yaml— see DOCKER.md for details.
- Python 3.10+
- 8 GB RAM minimum, 16 GB recommended (pandas loads large DataFrames day-by-day)
- Modern multi-core CPU (e.g. Intel Core i5/Ultra 5 or equivalent)
- ~500 MB disk space for dependencies, plus space for raw and processed data
- Dependencies installed via
requirements.txt: pandas, numpy, pyarrow, matplotlib, seaborn
Zorec, M. et al. (2026). SMEAR Estonia CO2 Sensor Processing Pipeline.
Version 0.4.0. GitHub: SMEAR-EE/SCD41_DATA
Dataset available at: 10.5281/zenodo.18984845
Current Release: v0.4.0 (Stable)
- Software application licensed under the Apache License 2.0 (see LICENSE file).
- Dataset licensed under CC BY 4.0, available on https://doi.org/10.5281/zenodo.18984845.