[AnVIL DX] Add AnVIL datasets to Google Datasets catalog

Add AnVIL datasets to the [Google Dataset Search](https://datasetsearch.research.google.com/) catalog so they're discoverable from Google. This is done by embedding [schema.org Dataset](https://schema.org/Dataset) JSON-LD on dataset detail pages — Google's crawler picks it up.

Companion to galaxyproject/brc-analytics#1264 and DataBiosphere/data-browser#4806.

## Reference implementation

NCPI Dataset Catalog has already shipped this for studies. Mirror their pattern:

- Builder + types: [`app/utils/schemaOrg.ts`](https://github.com/NIH-NCPI/ncpi-dataset-catalog/blob/main/app/utils/schemaOrg.ts) — `SchemaDataset` interface and `buildStudyJsonLd()` factory.
- Render component: [`app/components/Detail/components/StudyJsonLd/studyJsonLd.tsx`](https://github.com/NIH-NCPI/ncpi-dataset-catalog/blob/main/app/components/Detail/components/StudyJsonLd/studyJsonLd.tsx) — wraps the JSON-LD in a `<script type="application/ld+json">` inside `next/head`, with HTML escaping to prevent script injection.
- Page integration: [`pages/[entityListType]/[...params].tsx`](https://github.com/NIH-NCPI/ncpi-dataset-catalog/blob/main/pages/%5BentityListType%5D/%5B...params%5D.tsx) — mounted on the detail route only.
- Tests: [`app/utils/schemaOrg.test.ts`](https://github.com/NIH-NCPI/ncpi-dataset-catalog/blob/main/app/utils/schemaOrg.test.ts) — covers required fields, truncation, and conditional fields.

## Google Dataset required + recommended fields

Per [Google's Dataset structured data guidelines](https://developers.google.com/search/docs/appearance/structured-data/dataset):

**Required**
- `name` — descriptive title
- `description` — 50–5000 characters

**Recommended**
- `identifier`, `url`, `sameAs`
- `creator`, `funder`, `license`
- `distribution` (with `contentUrl`, `encodingFormat`)
- `keywords`, `variableMeasured`, `measurementTechnique`
- `spatialCoverage`, `temporalCoverage`
- `includedInDataCatalog`, `isAccessibleForFree`, `version`, `citation`

## Initial mapping — AnVIL dataset (`DatasetEntity`) → `Dataset`

Source entity at `app/apis/azul/anvil-cmg/common/entities.ts` (`DatasetEntity`).

| schema.org field | Source / value |
| --- | --- |
| `@context` | `"https://schema.org"` |
| `@type` | `"Dataset"` |
| `name` | `title` (fall back to `dataset_id`) |
| `description` | `description` — strip HTML, truncate to 5000 chars (pad to ≥50 chars from title/consortium when needed) |
| `identifier` | `[dataset_id, ...registered_identifier]` (e.g. dbGaP phs accessions) |
| `url` | `${browserURL}/datasets/${dataset_id}` |
| `sameAs` | dbGaP and other registered identifier URLs derived from `registered_identifier` |
| `includedInDataCatalog` | `{ "@type": "DataCatalog", name: "AnVIL Data Explorer", url: browserURL }` |
| `isAccessibleForFree` | Map from `accessible` / data use restrictions (datasets behind controlled access → `false`) |
| `keywords` | Union of `consortium`, `data_modality`, `phenotypic_sex`, `reported_ethnicity`, `species`, `disease` (from biosamples/donors), library prep |
| `creator` | `{ "@type": "Organization", name: consortium }` |
| `funder` | TBD — confirm whether AnVIL-side funder mapping is available |
| `distribution` | `DataDownload[]` derived from manifest/curl/Terra export endpoints (mark as access-controlled where applicable) |
| `variableMeasured` | Optional `PropertyValue[]` from dataset summary counts (donors, biosamples, files, libraries) |
| `license` | TBD — confirm with team (likely DUO-derived terms) |

Open questions for `funder` / `license` / how to express controlled access in `isAccessibleForFree` and `distribution` should be resolved before merge.

## Implementation steps

1. Add `app/utils/schemaOrg.ts` (or AnVIL-namespaced equivalent) with `SchemaDataset` types and `buildDatasetJsonLd(dataset, browserURL)`.
2. Add a `DatasetJsonLd` component that renders the JSON-LD via `next/head` with the same HTML-escape helper as NCPI.
3. Mount the component on the dataset detail page (`pages/datasets/[entityId]`).
4. Unit-test the builder: required fields present, description truncation, conditional fields omitted when source is null, controlled-access handling.
5. Validate output against [Google's Rich Results Test](https://search.google.com/test/rich-results) and [Schema Markup Validator](https://validator.schema.org/) for representative datasets (open access, controlled access, multi-consortium).
6. Once shipped, request indexing via Google Search Console and confirm dataset pages start appearing in Google Dataset Search.

## Out of scope (follow-ups)

- JSON-LD on biosamples/donors/files/libraries detail pages.
- Sitemap entries for dataset detail pages if not already complete.
- Mirroring this on the AnVIL Catalog explorer (separate ticket).

schema.org field	Source / value
`@context`	`"https://schema.org"`
`@type`	`"Dataset"`
`name`	`title` (fall back to `dataset_id`)
`description`	`description` — strip HTML, truncate to 5000 chars (pad to ≥50 chars from title/consortium when needed)
`identifier`	`[dataset_id, ...registered_identifier]` (e.g. dbGaP phs accessions)
`url`	`${browserURL}/datasets/${dataset_id}`
`sameAs`	dbGaP and other registered identifier URLs derived from `registered_identifier`
`includedInDataCatalog`	`{ "@type": "DataCatalog", name: "AnVIL Data Explorer", url: browserURL }`
`isAccessibleForFree`	Map from `accessible` / data use restrictions (datasets behind controlled access → `false`)
`keywords`	Union of `consortium`, `data_modality`, `phenotypic_sex`, `reported_ethnicity`, `species`, `disease` (from biosamples/donors), library prep
`creator`	`{ "@type": "Organization", name: consortium }`
`funder`	TBD — confirm whether AnVIL-side funder mapping is available
`distribution`	`DataDownload[]` derived from manifest/curl/Terra export endpoints (mark as access-controlled where applicable)
`variableMeasured`	Optional `PropertyValue[]` from dataset summary counts (donors, biosamples, files, libraries)
`license`	TBD — confirm with team (likely DUO-derived terms)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AnVIL DX] Add AnVIL datasets to Google Datasets catalog #4807

Reference implementation

Google Dataset required + recommended fields

Initial mapping — AnVIL dataset (`DatasetEntity`) → `Dataset`

Implementation steps

Out of scope (follow-ups)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AnVIL DX] Add AnVIL datasets to Google Datasets catalog #4807

Description

Reference implementation

Google Dataset required + recommended fields

Initial mapping — AnVIL dataset (DatasetEntity) → Dataset

Implementation steps

Out of scope (follow-ups)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Initial mapping — AnVIL dataset (`DatasetEntity`) → `Dataset`