Skip to content

bridge2ai/data-sheets-schema

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

395 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Executive Order 14168: This repository is under review for potential modification in compliance with Administration directives.

data-sheets-schema

📚 View Documentation & Examples · CLI Reference

A LinkML schema for Datasheets for Datasets model as published in Datasheets for Datasets. Inspired by datasheets as used in the electronics and other industries, Gebru et al. proposed that every dataset "be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on". To this end the authors create a series of topics and over 50 questions addressing different aspects of datasets, also useful in an AI/ML context. An example of completed datasheet for datasets can be found here: Structured dataset documentation: a datasheet for CheXpert

Google is working with a different model called Data Cards, which in practice is close to the original Datasheets for Datasets template.

This repository stores a LinkML schema representation for the original Datasheets for Datasets model, representing the topics, sets of questions, and expected entities and fields in the answers (work in progress). Beyond a less structured markdown template for this model (e.g. template for datasheet for dataset) we are not aware of any other structured form representing Datasheets for Datasets.

We are also tracking related developments, such as augmented Datasheets for Datasets models as in Augmented Datasheets for Speech Datasets and Ethical Decision-Making.

Bridge2AI Generating Center Datasheets

Curated comprehensive datasheets for each Bridge2AI data generating project:

  • AI-READI - Retinal imaging and diabetes dataset
  • CM4AI - Cell maps for AI dataset
  • VOICE - Voice biomarker dataset
  • CHORUS - Health data for underrepresented populations

View all D4D examples →

Repository Structure

Browse the source code repository:

D4D CLI

This branch introduces a unified d4d CLI for the Datasheets for Datasets workflow. The command is exposed through Poetry:

poetry install
poetry run d4d --help

After installation you can also invoke it as d4d, but poetry run d4d is the safest form while developing in the repo.

Most subcommands currently expect a repository checkout because they import repo-local code from src/ and .claude/agents/scripts/.

Command Groups

The CLI is organized into six top-level groups:

  • d4d download: fetch, preprocess, and concatenate source materials
  • d4d evaluate: run presence-based and LLM-based evaluations
  • d4d render: render datasheets and evaluation outputs to HTML
  • d4d rocrate: parse, merge, and transform RO-Crate metadata
  • d4d schema: inspect schema metrics and validate D4D YAML
  • d4d utils: inspect pipeline status and validate preprocessing output

Full option-by-option documentation is available in the docs site: CLI Reference.

Common Workflows

Download, preprocess, and concatenate source documents for one project:

poetry run d4d download sources --project AI_READI
poetry run d4d download preprocess --project AI_READI
poetry run d4d download concatenate --project AI_READI

Evaluate generated datasheets:

poetry run d4d evaluate presence --project AI_READI --method gpt5
poetry run d4d evaluate llm \
  --file data/d4d_concatenated/gpt5/AI_READI_d4d.yaml \
  --project AI_READI \
  --method gpt5 \
  --rubric both

Render and validate outputs:

poetry run d4d render html \
  docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml \
  -o /tmp/AI_READI_d4d.html
poetry run d4d render html \
  docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml \
  --template linkml \
  -o /tmp/AI_READI_d4d_linkml.html
poetry run d4d render evaluation \
  data/evaluation_llm/rubric10/concatenated/AI_READI_claudecode_agent_evaluation.json \
  -o /tmp/AI_READI_evaluation.html
poetry run d4d schema validate docs/yaml_output/concatenated/gpt5/AI_READI_d4d.yaml
poetry run d4d utils status --quick

Current CLI Notes

  • d4d evaluate llm requires ANTHROPIC_API_KEY.
  • d4d render html --template human-readable renders a single datasheet YAML file to the exact --output path you provide and copies datasheet-common.css into the same directory so the HTML remains styled when opened directly.
  • d4d render html --template linkml renders the same structured input into the more technical LinkML-style HTML view.
  • d4d render evaluation renders evaluation JSON directly and auto-detects rubric10 vs rubric20 unless you specify --rubric.
  • Evaluation naming is now consistent: if you omit -o, rubric10 outputs default to *_evaluation.html, while rubric20 outputs default to *_evaluation_rubric20.html.
  • d4d render generate-all is a convenience command that points users to the bulk HTML generation workflow (make gen-d4d-html).
  • d4d schema and d4d rocrate rely on helper scripts in .claude/agents/scripts/, so running from a repository checkout is important.

Developer Documentation

Details Use the `make` command to generate project artefacts:
  • make all: make everything
  • make deploy: deploys site

Credits

This project was made with linkml-project-cookiecutter.

About

Datasheets for Datasets, as LinkML Schema

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors