Executive Order 14168: This repository is under review for potential modification in compliance with Administration directives.
📚 View Documentation & Examples · CLI Reference
A LinkML schema for Datasheets for Datasets model as published in Datasheets for Datasets. Inspired by datasheets as used in the electronics and other industries, Gebru et al. proposed that every dataset "be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on". To this end the authors create a series of topics and over 50 questions addressing different aspects of datasets, also useful in an AI/ML context. An example of completed datasheet for datasets can be found here: Structured dataset documentation: a datasheet for CheXpert
Google is working with a different model called Data Cards, which in practice is close to the original Datasheets for Datasets template.
This repository stores a LinkML schema representation for the original Datasheets for Datasets model, representing the topics, sets of questions, and expected entities and fields in the answers (work in progress). Beyond a less structured markdown template for this model (e.g. template for datasheet for dataset) we are not aware of any other structured form representing Datasheets for Datasets.
We are also tracking related developments, such as augmented Datasheets for Datasets models as in Augmented Datasheets for Speech Datasets and Ethical Decision-Making.
Curated comprehensive datasheets for each Bridge2AI data generating project:
- AI-READI - Retinal imaging and diabetes dataset
- CM4AI - Cell maps for AI dataset
- VOICE - Voice biomarker dataset
- CHORUS - Health data for underrepresented populations
Browse the source code repository:
- src/data/examples/ - example YAML data
- project/ - project files (do not edit these)
- src/ - source files (edit these)
- src/data_sheets_schema/schema/ - LinkML schema (edit this)
- src/data_sheets_schema/datamodel/ - generated Python datamodel
- tests/ - Python tests
This branch introduces a unified d4d CLI for the Datasheets for Datasets workflow. The command is exposed through Poetry:
poetry install
poetry run d4d --helpAfter installation you can also invoke it as d4d, but poetry run d4d is the safest form while developing in the repo.
Most subcommands currently expect a repository checkout because they import repo-local code from src/ and .claude/agents/scripts/.
The CLI is organized into six top-level groups:
d4d download: fetch, preprocess, and concatenate source materialsd4d evaluate: run presence-based and LLM-based evaluationsd4d render: render datasheets and evaluation outputs to HTMLd4d rocrate: parse, merge, and transform RO-Crate metadatad4d schema: inspect schema metrics and validate D4D YAMLd4d utils: inspect pipeline status and validate preprocessing output
Full option-by-option documentation is available in the docs site: CLI Reference.
Download, preprocess, and concatenate source documents for one project:
poetry run d4d download sources --project AI_READI
poetry run d4d download preprocess --project AI_READI
poetry run d4d download concatenate --project AI_READIEvaluate generated datasheets:
poetry run d4d evaluate presence --project AI_READI --method gpt5
poetry run d4d evaluate llm \
--file data/d4d_concatenated/gpt5/AI_READI_d4d.yaml \
--project AI_READI \
--method gpt5 \
--rubric bothRender and validate outputs:
poetry run d4d render html \
docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml \
-o /tmp/AI_READI_d4d.html
poetry run d4d render html \
docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml \
--template linkml \
-o /tmp/AI_READI_d4d_linkml.html
poetry run d4d render evaluation \
data/evaluation_llm/rubric10/concatenated/AI_READI_claudecode_agent_evaluation.json \
-o /tmp/AI_READI_evaluation.html
poetry run d4d schema validate docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml
poetry run d4d utils status --quickd4d evaluate llmrequiresANTHROPIC_API_KEY.d4d render html --template human-readablerenders a single datasheet YAML file to the exact--outputpath you provide and copiesdatasheet-common.cssinto the same directory so the HTML remains styled when opened directly.d4d render html --template linkmlrenders the same structured input into the more technical LinkML-style HTML view.d4d render evaluationrenders evaluation JSON directly and auto-detectsrubric10vsrubric20unless you specify--rubric.- Evaluation naming is now consistent: if you omit
-o, rubric10 outputs default to*_evaluation.html, while rubric20 outputs default to*_evaluation_rubric20.html. d4d render generate-allis a convenience command that points users to the bulk HTML generation workflow (make gen-d4d-html).d4d schemaandd4d rocraterely on helper scripts in.claude/agents/scripts/, so running from a repository checkout is important.
Details
Use the `make` command to generate project artefacts:make all: make everythingmake deploy: deploys site
This project was made with linkml-project-cookiecutter.