Skip to content

[AnVIL DX] Add AnVIL datasets to Google Datasets catalog #4807

@NoopDog

Description

@NoopDog

Add AnVIL datasets to the Google Dataset Search catalog so they're discoverable from Google. This is done by embedding schema.org Dataset JSON-LD on dataset detail pages — Google's crawler picks it up.

Companion to galaxyproject/brc-analytics#1264 and #4806.

Reference implementation

NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:

Google Dataset required + recommended fields

Per Google's Dataset structured data guidelines:

Required

  • name — descriptive title
  • description — 50–5000 characters

Recommended

  • identifier, url, sameAs
  • creator, funder, license
  • distribution (with contentUrl, encodingFormat)
  • keywords, variableMeasured, measurementTechnique
  • spatialCoverage, temporalCoverage
  • includedInDataCatalog, isAccessibleForFree, version, citation

Initial mapping — AnVIL dataset (DatasetEntity) → Dataset

Source entity at app/apis/azul/anvil-cmg/common/entities.ts (DatasetEntity).

schema.org field Source / value
@context "https://schema.org"
@type "Dataset"
name title (fall back to dataset_id)
description description — strip HTML, truncate to 5000 chars (pad to ≥50 chars from title/consortium when needed)
identifier [dataset_id, ...registered_identifier] (e.g. dbGaP phs accessions)
url ${browserURL}/datasets/${dataset_id}
sameAs dbGaP and other registered identifier URLs derived from registered_identifier
includedInDataCatalog { "@type": "DataCatalog", name: "AnVIL Data Explorer", url: browserURL }
isAccessibleForFree Map from accessible / data use restrictions (datasets behind controlled access → false)
keywords Union of consortium, data_modality, phenotypic_sex, reported_ethnicity, species, disease (from biosamples/donors), library prep
creator { "@type": "Organization", name: consortium }
funder TBD — confirm whether AnVIL-side funder mapping is available
distribution DataDownload[] derived from manifest/curl/Terra export endpoints (mark as access-controlled where applicable)
variableMeasured Optional PropertyValue[] from dataset summary counts (donors, biosamples, files, libraries)
license TBD — confirm with team (likely DUO-derived terms)

Open questions for funder / license / how to express controlled access in isAccessibleForFree and distribution should be resolved before merge.

Implementation steps

  1. Add app/utils/schemaOrg.ts (or AnVIL-namespaced equivalent) with SchemaDataset types and buildDatasetJsonLd(dataset, browserURL).
  2. Add a DatasetJsonLd component that renders the JSON-LD via next/head with the same HTML-escape helper as NCPI.
  3. Mount the component on the dataset detail page (pages/datasets/[entityId]).
  4. Unit-test the builder: required fields present, description truncation, conditional fields omitted when source is null, controlled-access handling.
  5. Validate output against Google's Rich Results Test and Schema Markup Validator for representative datasets (open access, controlled access, multi-consortium).
  6. Once shipped, request indexing via Google Search Console and confirm dataset pages start appearing in Google Dataset Search.

Out of scope (follow-ups)

  • JSON-LD on biosamples/donors/files/libraries detail pages.
  • Sitemap entries for dataset detail pages if not already complete.
  • Mirroring this on the AnVIL Catalog explorer (separate ticket).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions