Add AnVIL datasets to the Google Dataset Search catalog so they're discoverable from Google. This is done by embedding schema.org Dataset JSON-LD on dataset detail pages — Google's crawler picks it up.
Companion to galaxyproject/brc-analytics#1264 and #4806.
Reference implementation
NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:
Google Dataset required + recommended fields
Per Google's Dataset structured data guidelines:
Required
name — descriptive title
description — 50–5000 characters
Recommended
identifier, url, sameAs
creator, funder, license
distribution (with contentUrl, encodingFormat)
keywords, variableMeasured, measurementTechnique
spatialCoverage, temporalCoverage
includedInDataCatalog, isAccessibleForFree, version, citation
Initial mapping — AnVIL dataset (DatasetEntity) → Dataset
Source entity at app/apis/azul/anvil-cmg/common/entities.ts (DatasetEntity).
| schema.org field |
Source / value |
@context |
"https://schema.org" |
@type |
"Dataset" |
name |
title (fall back to dataset_id) |
description |
description — strip HTML, truncate to 5000 chars (pad to ≥50 chars from title/consortium when needed) |
identifier |
[dataset_id, ...registered_identifier] (e.g. dbGaP phs accessions) |
url |
${browserURL}/datasets/${dataset_id} |
sameAs |
dbGaP and other registered identifier URLs derived from registered_identifier |
includedInDataCatalog |
{ "@type": "DataCatalog", name: "AnVIL Data Explorer", url: browserURL } |
isAccessibleForFree |
Map from accessible / data use restrictions (datasets behind controlled access → false) |
keywords |
Union of consortium, data_modality, phenotypic_sex, reported_ethnicity, species, disease (from biosamples/donors), library prep |
creator |
{ "@type": "Organization", name: consortium } |
funder |
TBD — confirm whether AnVIL-side funder mapping is available |
distribution |
DataDownload[] derived from manifest/curl/Terra export endpoints (mark as access-controlled where applicable) |
variableMeasured |
Optional PropertyValue[] from dataset summary counts (donors, biosamples, files, libraries) |
license |
TBD — confirm with team (likely DUO-derived terms) |
Open questions for funder / license / how to express controlled access in isAccessibleForFree and distribution should be resolved before merge.
Implementation steps
- Add
app/utils/schemaOrg.ts (or AnVIL-namespaced equivalent) with SchemaDataset types and buildDatasetJsonLd(dataset, browserURL).
- Add a
DatasetJsonLd component that renders the JSON-LD via next/head with the same HTML-escape helper as NCPI.
- Mount the component on the dataset detail page (
pages/datasets/[entityId]).
- Unit-test the builder: required fields present, description truncation, conditional fields omitted when source is null, controlled-access handling.
- Validate output against Google's Rich Results Test and Schema Markup Validator for representative datasets (open access, controlled access, multi-consortium).
- Once shipped, request indexing via Google Search Console and confirm dataset pages start appearing in Google Dataset Search.
Out of scope (follow-ups)
- JSON-LD on biosamples/donors/files/libraries detail pages.
- Sitemap entries for dataset detail pages if not already complete.
- Mirroring this on the AnVIL Catalog explorer (separate ticket).
Add AnVIL datasets to the Google Dataset Search catalog so they're discoverable from Google. This is done by embedding schema.org Dataset JSON-LD on dataset detail pages — Google's crawler picks it up.
Companion to galaxyproject/brc-analytics#1264 and #4806.
Reference implementation
NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:
app/utils/schemaOrg.ts—SchemaDatasetinterface andbuildStudyJsonLd()factory.app/components/Detail/components/StudyJsonLd/studyJsonLd.tsx— wraps the JSON-LD in a<script type="application/ld+json">insidenext/head, with HTML escaping to prevent script injection.pages/[entityListType]/[...params].tsx— mounted on the detail route only.app/utils/schemaOrg.test.ts— covers required fields, truncation, and conditional fields.Google Dataset required + recommended fields
Per Google's Dataset structured data guidelines:
Required
name— descriptive titledescription— 50–5000 charactersRecommended
identifier,url,sameAscreator,funder,licensedistribution(withcontentUrl,encodingFormat)keywords,variableMeasured,measurementTechniquespatialCoverage,temporalCoverageincludedInDataCatalog,isAccessibleForFree,version,citationInitial mapping — AnVIL dataset (
DatasetEntity) →DatasetSource entity at
app/apis/azul/anvil-cmg/common/entities.ts(DatasetEntity).@context"https://schema.org"@type"Dataset"nametitle(fall back todataset_id)descriptiondescription— strip HTML, truncate to 5000 chars (pad to ≥50 chars from title/consortium when needed)identifier[dataset_id, ...registered_identifier](e.g. dbGaP phs accessions)url${browserURL}/datasets/${dataset_id}sameAsregistered_identifierincludedInDataCatalog{ "@type": "DataCatalog", name: "AnVIL Data Explorer", url: browserURL }isAccessibleForFreeaccessible/ data use restrictions (datasets behind controlled access →false)keywordsconsortium,data_modality,phenotypic_sex,reported_ethnicity,species,disease(from biosamples/donors), library prepcreator{ "@type": "Organization", name: consortium }funderdistributionDataDownload[]derived from manifest/curl/Terra export endpoints (mark as access-controlled where applicable)variableMeasuredPropertyValue[]from dataset summary counts (donors, biosamples, files, libraries)licenseOpen questions for
funder/license/ how to express controlled access inisAccessibleForFreeanddistributionshould be resolved before merge.Implementation steps
app/utils/schemaOrg.ts(or AnVIL-namespaced equivalent) withSchemaDatasettypes andbuildDatasetJsonLd(dataset, browserURL).DatasetJsonLdcomponent that renders the JSON-LD vianext/headwith the same HTML-escape helper as NCPI.pages/datasets/[entityId]).Out of scope (follow-ups)