diff --git a/.gitignore b/.gitignore index 4acaec28..4698bf88 100644 --- a/.gitignore +++ b/.gitignore @@ -2,3 +2,4 @@ _site/ .jekyll-cache/ .Rhistory node_modules/ +.DS_Store \ No newline at end of file diff --git a/acknowledgements.md b/acknowledgements.md index 9bc62c5b..498b7232 100644 --- a/acknowledgements.md +++ b/acknowledgements.md @@ -2,6 +2,7 @@ Poseidon depends on a large community of contributors and advisors. Here we list project alumni and other contributors. +- **Luca Thale-Bombien**: Student assistant for the Poseidon project in 2024 and 2025. - **Dhananjaya Bandara Aththanayaka Aththanayaka Mudiyanselage**: Student assistant for the Poseidon project from 2021 to 2025. - **Kenana Saeed**: Student assistant for the Poseidon project in 2024. - **Michelle O´Reilly**: Design of the Poseidon logo and colour palette. diff --git a/archive_explorer_old.md b/archive_explorer_old.md deleted file mode 100644 index 46d8719a..00000000 --- a/archive_explorer_old.md +++ /dev/null @@ -1,480 +0,0 @@ - - - - -
- - -
- _  Loading... -
-
- -
- -
- - - - -
-
- -
- Package: {{ selectedPackageTitle }} -
-
- -
- - - - -
- -
- - - - - - - - - - - - - - - - - - - - - - - -
Description{{ selectedPackage.description }}
Package version - v{{ selectedPackage.packageVersion }} - (that is the latest available version) - (that is not the latest available version) - for Poseidon v{{ selectedPackage.poseidonVersion }}. -
- It was last modified on {{ selectedPackage.lastModified }}. -
Resources - See this package on GitHub: - - - - Download this package as .zip archive: - -
Nr of samples{{ selectedPackage.nrIndividuals }}
-
- -
- - - - - - - - - - - - - - - - - - - - -
Poseidon_IDGroupsDetails
{{ sample.poseidonID }}{{ sample.groupNames.toString() }} -
- View sample details -
-
- {{ addCol[0] }}: {{ addCol[1] }}
-
-
- *More variables are available in the complete .janno file. -
-
-
- -
- - -
- -
- - - - - - - - - - - - - - -
- {{ pac.packageTitle }}
- v{{ pac.packageVersion }}, Samples: {{ pac.nrIndividuals }} -
- {{ pac.description }} - - - - - - -
- -
-
-
- -
- - - - - diff --git a/archive_reviewer_guide.md b/archive_reviewer_guide.md index a6cf1bbb..536f11dc 100644 --- a/archive_reviewer_guide.md +++ b/archive_reviewer_guide.md @@ -4,7 +4,7 @@ The role of the Poseidon package reviewer is to help ensuring quality standards for Poseidon's public package archives. Fortunately, many aspects of the Poseidon schema are machine-testable. Automatic validation catches various structural issues right away, for example missing mandatory columns in the Poseidon .janno file (such as the `Poseidon_ID`). -But there are some aspects we cannot check, such as the scientific correctness of the given information. And there are other we don't want to formally check, because they are not included in the core definition of a Poseidon package, but just policy for our public archives. For these, we rely on a checklist every package author has to fill, and finally manual reviews. +But there are some aspects we cannot check, such as the scientific correctness of the given information. And there are others we don't want to formally check, because they are not included in the core definition of a Poseidon package, but just policy for our public archives. For these, we rely on a checklist every package author has to fill, and finally reviews. ## GitHub Pull Requests diff --git a/archive_submission_guide.md b/archive_submission_guide.md index e43c1d34..acd32f15 100644 --- a/archive_submission_guide.md +++ b/archive_submission_guide.md @@ -6,8 +6,6 @@ The Poseidon framework has a strongly decentralized philosophy and relies very m We assume you have some basic knowledge about using a command line software like [`trident`](trident), and how to handle Git and GitHub. If not, then you can become knowledgable quickly about the latter, for example [here](https://githubtraining.github.io/training-manual). -!> Never clone the archive repositories without `GIT_LFS_SKIP_SMUDGE=1`. Always clone with `GIT_LFS_SKIP_SMUDGE=1 git clone ...`. - ## Archive curation roles To manage package submissions and modifications in our archives, we define the following roles, which are synonymous to the respective roles within github: @@ -55,29 +53,26 @@ This is mandatory. Please also run [`trident validate`](trident?id=validate-comm ### Submitting the package -The procedure for the actual submission is then as follows (a shorter, slightly more hands-on tutorial is available [here](https://mpi-eva-archaeogenetics.github.io/comp_human_adna_book/poseidon.html#contributing-to-the-community-archive)) +The procedure for the actual submission is then as follows: -**1. Fork and then clone the GitHub repository for the archive you want to modify.** +**1. Fork the GitHub repository for the archive you want to modify.** You need to be logged into github with your user account. You can then navigate to our github repository: and hit the "Fork" button near the top of the page. You will then have a copy of the entire repository under your own user name: `https://github.com//community-archive`. -For the following to work, you need to have setup your github account in a way that allows you to communicate with github via the command line. For this, you need to configure an SSH public-key, so github really knows it's you. Find out more about it here: . +**2. Clone (download) your fork.** -!> To safe our [Git LFS](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage) bandwidth, **please clone in a way that does not download the large data files from GitHub** (they should be downloaded from our webserver with [`trident fetch`](trident?id=fetch-command)). +For the following to work, you need to have setup your github account in a way that allows you to communicate with github via the command line. For this, you need to configure an SSH public-key, so github really knows it's you. Find out more about it here: . -At the same time you need to be able to add new LFS files. A proper setup for this includes the following steps: +You need to be able to add new [Git LFS](https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage) files. A proper setup for this includes the following two steps: -- downloading and [installing Git LFS](https://git-lfs.github.com/), -- setting it up for your user with `git lfs install` -- cloning the repo **with the `GIT_LFS_SKIP_SMUDGE` environment variable**, which prevents downloading the LFS files despite Git LFS being enabled: +1. downloading and [installing Git LFS](https://git-lfs.github.com/), +2. setting it up for your user with `git lfs install` -``` -GIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:/community-archive.git -``` +You can then clone the fork repository. -As a consequence the large files will not be downloaded, but only stub files, representing the real files on the LFS server. This clone is only for submission purposes after all -- you can not work with the genotype data in it. `2021_Wang_EastAsia/2021_Wang_EastAsia.bed` for example will look like this: +To safe some time and storage space on your system, you can clone in a way that does not download the large data files in the repository. You can do so by setting `GIT_LFS_SKIP_SMUDGE` environment variable. As a consequence the large files will not be downloaded, but only stub files, representing the real files on the LFS server. This clone is only for submission purposes after all -- you will probably not work with the genotype data in it. `2021_Wang_EastAsia/2021_Wang_EastAsia.bed` for example will look like this: ``` version https://git-lfs.github.com/spec/v1 @@ -85,7 +80,15 @@ oid sha256:766e7c9f79c1659dfb924c901420f01e8720557a0ec37f2a694f6a29cdc0a55e size 177553875 ``` -**2. Copy your new package into your local clone.** +The clone command with `GIT_LFS_SKIP_SMUDGE` set is as follows: + +``` +GIT_LFS_SKIP_SMUDGE=1 git clone git@github.com:/community-archive.git +``` + +If you want to download the large files as well, then omit `GIT_LFS_SKIP_SMUDGE=1`. + +**3. Copy your new package into your local clone.** You should now copy your package including the full genotype data into the cloned repository as a new package directory. The directory should include the genotype data. Git (with Git LFS enabled) and GitHub will detect automatically that it should treat them as LFS files. Then commit the changes and push: @@ -97,7 +100,7 @@ git push If you accidentally pushed the large files as normal files, for example if your LFS setup was incomplete, you can fix this with `git lfs migrate import --no-rewrite path/to/file.bed` (see [here](https://github.com/git-lfs/git-lfs/blob/main/docs/man/git-lfs-migrate.adoc#import-without-rewriting-history)). -**3. Submit a pull request from your fork to merge your updates into our repository.** +**4. Submit a pull request from your fork to merge your updates into our repository.** Having successfully pushed your branch to your fork on github, you need to now tell github to propose your branch as a submission to our master repository. This is done through [github Pull Requests](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests). @@ -111,7 +114,7 @@ If you identify a mistake in any package, be it in the context data (`.janno` fi **1. Fork and clone the GitHub repository that contains the package you want to improve.** -Just as above described for the package submission, please remember to clone with `GIT_LFS_SKIP_SMUDGE=1`. Individual LFS files can be downloaded with `git lfs pull --include "PATH-TO-FILE"`. This is necessary if you would like to modify not just the context- and meta data, but also the genotype data of a package. +Just as described above for the package submission. If you cloned with `GIT_LFS_SKIP_SMUDGE=1` but now want to edit individual LFS files, then you can download them with `git lfs pull --include "PATH-TO-FILE"`. **2. Modify the files you want to change.** diff --git a/background.md b/background.md index 258d7a2a..46b037fc 100644 --- a/background.md +++ b/background.md @@ -1,14 +1,14 @@ # Background -Archaeogenetics has become a fast accelerating field, with new data coming out faster than many individual researchers can keep track of and co-analyze. Recently, we have surpassed the threshold of genome-wide data for [10,000 ancient human individuals](https://www.nature.com/articles/d41586-023-01403-4). In addition, for many of those samples we also have rich metadata ranging from archaeological information to radiocarbon dating. +Archaeogenetics has become a fast accelerating field, with new data coming out faster than many individual researchers can keep track of and co-analyze. Already in 2023 we have surpassed the threshold of genome-wide data for [10,000 ancient human individuals](https://www.nature.com/articles/d41586-023-01403-4). In addition, for many of those samples we also have rich metadata ranging from archaeological information to radiocarbon dating. -The way data is currently shared and published via academic papers, at least from genetic analyses, is mainly via releasing raw sequencing data into public repositories such as the [ENA](https://www.ebi.ac.uk/ena), while providing partial metadata on samples via often poorly formatted Excel tables in the Supplement. This creates (at least) the following problems: +The way data is currently shared and published via academic papers, at least from genetic analyses, is mainly via releasing raw sequencing data into public repositories such as the [ENA](https://www.ebi.ac.uk/ena), while providing partial metadata on samples via often poorly formatted Excel tables in the supplementary materials. This creates (at least) the following problems: 1. Intermediate data such as genotypes are often not released at all, making it hard for others to reproduce analyses. 2. The connection between individuals, contextual information, and genetic data becomes hard to maintain, bridging between very different repositories and sources (Excel vs. personal homepages vs. public repositories) 3. Meta-analyses spanning datasets require enormous amounts of work on data collection and curation. -A major initiative to address these problems in human archaeogenetics is the [Allen Ancient DNA Resource](https://doi.org/10.1101/2023.04.06.535797) ("AADR"), which is a curated dataset of public ancient DNA data generated, curated and bundled by David Reich's ancient DNA laboratory at Harvard University. In many ways, our initiative is inpiried by and deriving from this resource. In particular, the AADR currently (April 2023) is arguably the most complete resource world-wide that provides genome-wide genotype data for ancient human individuals from nearly all publications in the field. +A major initiative to address these problems in human archaeogenetics is the [Allen Ancient DNA Resource](http://dx.doi.org/10.1038/s41597-024-03031-7) ("AADR"), which is a curated dataset of public ancient DNA data generated, curated and bundled by David Reich's ancient DNA laboratory at Harvard University. In many ways, our initiative is inpiried by and deriving from this resource. In particular, the AADR currently (April 2023) is arguably the most complete resource world-wide that provides genome-wide genotype data for ancient human individuals from nearly all publications in the field. Our [public archives](archive_overview) derive to a large extent directly from the AARD, while many curated packages, in particular from 2019 and later, contain data compiled and generated by us. But our initiative also differs in important aspects from the AARD: diff --git a/dev_notes.md b/dev_notes.md index 152f9dab..38635086 100644 --- a/dev_notes.md +++ b/dev_notes.md @@ -24,11 +24,8 @@ flowchart TD ssfDef[".ssf"] packageDef -- defines --> ssfDef - poseidonAnalysisHS["poseidon-analysis-hs library"] - poseidonHS --> poseidonAnalysisHS - xerxes["xerxes"] - poseidonAnalysisHS --> xerxes + poseidonHS --> xerxes poseidonHS["poseidon-hs library"] poseidonYMLDef --> poseidonHS diff --git a/genotype_data.md b/genotype_data.md index c74376db..70f50feb 100644 --- a/genotype_data.md +++ b/genotype_data.md @@ -2,17 +2,19 @@ ## File formats -Genotype data in Poseidon packages can be stored in either of two (multi)file formats: PLINK (binary) and EIGENSTRAT. +Genotype data in Poseidon packages can be stored in one of three (multi)file formats: PLINK (binary), EIGENSTRAT, and VCF. -| | PLINK (binary) | EIGENSTRAT | -|---|---|---| -| genotype file | [`.bed` (binary biallelic genotype table)](https://www.cog-genomics.org/plink/1.9/formats#bed) or `.bed.gz` | [`.geno` (genotype file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) or `.geno.gz` -| SNP file | [`.bim` (extended MAP file)](https://www.cog-genomics.org/plink/1.9/formats#bim) or `.bim.gz` | [`.snp` (snp file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) or `.snp.gz` | -| individual file | [`.fam` (sample information)](https://www.cog-genomics.org/plink/1.9/formats#fam) | [`.ind` (indiv file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | +| | PLINK (binary) | EIGENSTRAT | VCF | +|---|---|---|---| +| genotype file | [`.bed` (binary biallelic genotype table) or `.bed.gz`](https://www.cog-genomics.org/plink/1.9/formats#bed) | [`.geno` (genotype file) or `.geno.gz`](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | [`.vcf` or `.vcf.gz`](https://samtools.github.io/hts-specs/VCFv4.2.pdf) | +| SNP file | [`.bim` (extended MAP file) or `.bim.gz`](https://www.cog-genomics.org/plink/1.9/formats#bim) | [`.snp` (snp file) or `.snp.gz`](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | | +| individual file | [`.fam` (sample information)](https://www.cog-genomics.org/plink/1.9/formats#fam) | [`.ind` (indiv file)](https://github.com/DReichLab/EIG/blob/fb4fb59065055d3622e0f97f0149588eae630a3e/CONVERTF/README#L67) | | -The PLINK file format is a well specified, storage efficient data type compatible with many bioinformatic software tools, which made it an obvious choice for Poseidon. The EIGENSTRAT format is also common within archaeogenetics, compatible with many of the important tools developed by the Reich Lab, e.g. the ones in the [EIGENSOFT](https://github.com/DReichLab/EIG) and [ADMIXTOOLS](https://github.com/DReichLab/AdmixTools). In the future even more formats might be supported (see e.g. [here](https://reich.hms.harvard.edu/software/InputFileFormats)). +The PLINK file format is a well specified, storage efficient data type compatible with many bioinformatic software tools, which made it an obvious choice for Poseidon. The EIGENSTRAT format is also common within archaeogenetics, compatible with many of the important tools developed by the Reich Lab, e.g. the ones in the [EIGENSOFT](https://github.com/DReichLab/EIG) and [ADMIXTOOLS](https://github.com/DReichLab/AdmixTools) sets. Since Poseidon v3.0.0 the [Variant Call Format](https://samtools.github.io/hts-specs/VCFv4.2.pdf) (VCF) is also supported. In the future even more formats might be added (see e.g. [here](https://reich.hms.harvard.edu/software/InputFileFormats)). -The large genotype data files to store SNP definitions and values can be stored in gzipped files (`*.gz`). +To make VCF files fully convertible to PLINK and EIGENSTRAT, they MUST be biallelic and contain only genotypes coded as `0/0`, `0/1`, `1/1`, `./.`. Furthermore, they CAN encode group names and genetic sex for all samples through special header fields `##group_names=name1,name2,...` and `##genetic_sex=F,U,M,...`, respectively. If these fields are not present, then group names are assumed to be "unknown" and genetic sex "U" (unknown) for all samples. + +For all of these formats the genotype and SNP-definition files can be stored in gzipped form (`*.gz`), i.e.: `*.bed.gz`, `*.geno.gz`, `*.bim.gz`, `*.snp.gz`, `*.vcf.gz`, but note that `*.fam` and `*.ind` files always must remain unzipped. The `genotypeData` field in the `POSEIDON.yml` file documents in which format the data for a package is stored and the relative paths to the respective files. @@ -47,23 +49,40 @@ genotypeData: + + + + + + + +
VCF
+ +``` +genotypeData: + format: VCF + genoFile: X.vcf + snpSet: 1240K +``` +
+ ## Typical setup and SNP panels Poseidon is not limited to a specific panel of single nucleotide polymorphism (SNPs) that should be available for each sample. All known SNPs for an individual derived from one or multiple libraries can be merged and stored in the genotype data accompanying a Poseidon package. The `snpSet` subfield in the `POSEIDON.yml` file documents the shape of the genotype file in the respective package, with the possible entries `HumanOrigins`, `1240K`, and `Other`. As of today (25.01.2021) most ancient genomic data is pulled down to the Affymetrix Human Origins SNP array ([Patterson et al. 2012](https://dx.doi.org/10.1534%2Fgenetics.112.145037)) or the 1240k SNP array ([Mathieson et al. 2015](https://dx.doi.org/10.1038%2Fnature16152)). These are the panels we are relying on for our public Poseidon [repositories](repos) because of their ubiquitous use in public datasets such as the [Allen Ancient DNA Resource](https://reich.hms.harvard.edu/allen-ancient-dna-resource-aadr-downloadable-genotypes-present-day-and-ancient-dna-data) and their design for population genetic research questions. The 1240k SNP array includes "nearly all SNPs on the Affymetrix Human Origins and Illumina 610-Quad arrays, 49,711 SNPs on chromosome X and 32,681 on chromosome Y, and 47,384 SNPs with evidence of functional importance" -- [Mathieson et al. 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4918750/). -## Naming of SNPs, individuals and groups +## Naming of SNPs, samples and groups ### SNP IDs All SNPs listed in the SNP file must adhere to the Reference SNP cluster ID naming scheme ("RS number/id") as provided and maintained by the [dbSNP database](https://www.ncbi.nlm.nih.gov/snp/). Other SNP naming schemes are not explicitly forbidden or prevented in POSEIDON, but we encourage users to rely on this established standard to keep merging data from different sources as simple as possible. -### Individual IDs +### Poseidon IDs -The individual IDs in the individual file of a Poseidon package must be identical to the respective IDs in the `.janno` file (column `Individual_ID`). This ID together with identical ordering is what allows seamless linkage of genotype and context data. +The sample IDs in the individual file of a Poseidon package must be identical to the respective IDs in the `.janno` file (column `Poseidon_ID`). This ID together with identical ordering is what allows seamless linkage of genotype and context data. -Poseidon requires the individual IDs within a set of Poseidon packages to be unique, at least for many useful operations within our toolset. Contrary to the usually decentralised philosophy of the framework we would very much like to establish unique identifiers for individuals. Unambiguous IDs would be a tremendous advantage for all fields working with ancient genomic data, including archaeology, which often collects substantially more context information about prehistoric human individuals. +Poseidon requires the `Poseidon_ID`s IDs within a set of Poseidon packages to be unique, at least for many useful operations within our toolset. Contrary to the usually decentralised philosophy of the framework we would very much like to establish unique identifiers for samples and individuals. Unambiguous IDs would be a tremendous advantage for all fields working with ancient genomic data, including archaeology, which often collects substantially more context information about prehistoric human individuals. ### Group IDs diff --git a/getting_started.md b/getting_started.md index 6618c15c..d8a2b30e 100644 --- a/getting_started.md +++ b/getting_started.md @@ -2,7 +2,7 @@ This is a short tutorial that runs you through some basic functionality of the poseidon framework and tooling. -Please also see the slides [A short introduction to the Poseidon framework](https://nevrome.github.io/uni.tuebingen.poseidon.intro.2h.2024/) by Clemens Schmid, which showcases many key aspects of the framework, as well as our [preprint on biorxiv](https://www.biorxiv.org/content/10.1101/2024.04.12.589180). +Please also see the slides [A short introduction to the Poseidon framework](https://nevrome.github.io/uni.tuebingen.poseidon.intro.2h.2024/) by Clemens Schmid, which showcases many key aspects of the framework, as well as our [reviewed preprint on eLife](https://doi.org/10.7554/eLife.98317.1). ## Preparation You will need to install our two command-line tools `trident` and `xerxes`. They are available as pre-compiled binaries for MacOS and Linux, and in case of trident also for Windows: diff --git a/janno_details.md b/janno_details.md index e1ce20e7..ab3208e7 100644 --- a/janno_details.md +++ b/janno_details.md @@ -1,16 +1,20 @@ # .janno file details -## Background +## Overview -The `.janno` file columns are specified in the Poseidon package specification [here](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv). The following documentation includes additional background information for many of the variables. This should make it more easy to compile the necessary information for both published and unpublished data. The `.pdf` version of the latest version of this document is available [here](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/janno_details.pdf). +A `.janno` file is a tabular, tab-separated (`.tsv`) file. A base set of `.janno` file columns are specified in the Poseidon package specification [here](https://github.com/poseidon-framework/poseidon-schema/blob/master/janno_columns.tsv), including information on which columns are mandatory, which ones are list columns that can hold multiple entries, and which ones limit the allowed set of entries to a strict enumeration. Beyond that the `.janno` file can include any number and type of additional columns to hold project- and context-specific variables. These arbitrary additional columns should be named in a way so that they do not conflict with the base set. They are not validated (assumed to free-form text) by the Poseidon tooling, but they will be preserved in the Poseidon package, and propagated in operations like `trident forge`. -### The `Poseidon_ID` +The following documentation includes additional background information on the base set. This should make it easier to understand and use the columns for both published and unpublished data. A `.pdf` version of the latest version of this document is available for download [here](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/janno_details.pdf). -The `Poseidon_ID` column assigns each entity in a Poseidon package (so one row of the .janno file) a unique identifier string. +While previous versions of the `.janno` base set included various explicit `_Note` columns to add free form information to specific columns or column blocks, from Poseidon v3.0.0 onwards these explicit columns were removed. The schema supports arbitrary additional columns, so the user can add ANY `_Note` column they deem relevant or useful. The Poseidon tooling, e.g. the `trident` CLI software, still gives special considerations to columns with the `_Note` suffix when sorting columns. For example a column `Relation_Note` will be appended after all other `Relation_*` columns, but a more specific `Relation_Degree_Note` right after `Relation_Degree`. -Often the `Poseidon_ID` can be readily taken from the respective accompanying publication introducing a given sample. If there are multiple samples from one ancient human individual, then they may share this identifier in the publication. For the Poseidon package they have to be clearly distinguished with relevant suffixes, though, added to the `Poseidon_ID`. `Poseidon_ID`s are also employed in the genetic data files in a Poseidon package and therefore have to adhere to certain constraints. +## The `Poseidon_ID` -#### What does the `Poseidon_ID` represent exactly? +The `Poseidon_ID` column assigns each entity in a Poseidon package (so one row of the `.janno` file) a unique identifier string. It links the `.janno` file entries to the genetic data in a Poseidon package. + +Often the `Poseidon_ID` can be readily taken from the respective accompanying publication introducing a given sample or analysis-version of a sample. If there are multiple samples from one ancient human individual, or multiple versions of the same dataset resulting from different filtering or bioinformatic treatment, then they may share this identifier in the publication. For the Poseidon package they have to be clearly distinguished with relevant suffixes, though, added to the `Poseidon_ID`. For good compatibility with Poseidon tooling, e.g. `trident`'s subsetting-and merging language, it is recommended to only use the ASCII characters `A-Za-z0-9_-.` for `Poseidon_ID`s. + +### What does the `Poseidon_ID` represent exactly? Generally, archaeogenetics operates on burial contexts, e.g. graves, with one or multiple ancient human individuals. Usually, though not always, it is possible to attribute the skeletal remains within these graves to individuals based on the archaeological context and physical-anthropological analysis. Each individual can get sampled one or multiple times, either by directly probing their preserved tissue, mostly bones, or by sampling any reagent that contains their DNA (through whatever pathway or taphonomic process). From one such sample one or multiple extracts can be derived, which can be transformed into one or multiple libraries, which may or may not be subjected to a DNA capture protocol and then sequenced one or multiple times. The raw sequencing data can undergo various different forms of computational processing and eventually genotyping to produce the data relevant for most derived analyses and thus stored in Poseidon. @@ -18,45 +22,62 @@ While the wetlab-processes can be understood as a relatively predictable tree of A `Poseidon_ID`, and therefore the identifier for the main singular entity in a Poseidon package, could approximately be described as representing one end-point in the data preparation graph laid out above. Typically this end-point corresponds to an optimal result, consciously selected for a given individual, research question and publication. Unfortunately, in reality a `Poseidon_ID` is not suited to uniquely identify exactly one such end-point. The reality in the Poseidon ecosystem is rather that slightly different end-points can have the same `Poseidon_ID`, e.g. across package versions or public Poseidon archives. A single endpoint can only be uniquely identified from a combination of `Poseidon_ID`, Poseidon package and package version. -### Other identifiers +## Other identifiers + +The `Individual_ID` column (introduced in Poseidon v3.0.0) acts as an identifier on the level of (human/animal) individuals in a Poseidon package. That means multiple `Poseidon_ID`s can share an `Individual_ID`. In practice these IDs are often identical for a given sample, or only differ in additional suffixes appended to the `Poseidon_ID`. The distinction of an individual- and analysis endpoint-level ID also exists in the AADR dataset [@Mallick2024](https://doi.org/10.1038/s41597-024-03031-7), e.g. in v62.0, with the `Master ID` and `Genetic ID` columns. It is recommended to only use the ASCII characters `A-Za-z0-9_-.` for `Individual_ID`s. + +The column `Alternative_IDs` provides a way to list other IDs used for the respective individual. These might be formal identifiers in datasets beyond Poseidon, e.g. `Master ID`s in specific AADR releases, or identifiers used in different publications, or even just popular names like ["Iceman"/"Ötzi"](https://en.wikipedia.org/wiki/%C3%96tzi), ["Girl of the Uchter Moor"](https://en.wikipedia.org/wiki/Girl_of_the_Uchter_Moor), or ["Tollund Man"](https://en.wikipedia.org/wiki/Tollund_Man). + +To document the context of such an `Alternative_IDs` entry, the `Alternative_IDs_Context` column (introduced in Poseidon v3.0.0) allows to provide the necessary context. It is a list column with the same length and order as the `Alternative_IDs` list column, where the name of the respectice source database, e.g. `AADRv62`, must be entered. This indicates where an alternative identifier may work as a "foreign key". For the non-scientific names used in media and public discussion, the term `popular` can be entered. -The column `Alternative_IDs` provides a way to list other IDs used for the respective individual. These might for example be names used in different publications or popular names like "Iceman", "Ötzi", "Girl of the Uchter Moor", "Tollund Man", etc.. The `Relation_*` columns described below allow to more precisely express the relationship type "identical" among samples in a Poseidon package. +The `Collection_ID` column stores additional, secondary identifiers used by collaboration partners (archaeologists, museums, collections) that provide the specimen for archaeogenetic research (see also `Custodian_Institution` below). These identifiers can have a very heterogenous structure and may not be unique across different projects or institutions. The `Collection_ID` column is therefore a free-form text list column. -The `Collection_ID` column stores an additional, secondary identifier as it is often provided by collaboration partners (archaeologists, museums, collections) that provide the specimen for archaeogenetic research. These identifiers can have a very heterogenous structure and may not be unique across different projects or institutions. The `Collection_ID` column is therefore a free-form text field. +The `Group_Name` column contains one or multiple group or population names for each sample, separated by `;`. The first entry must be identical to the one used in the genotype data for the respective sample in a Poseidon package. Especially for the first entry it is recommended to only use the ASCII characters `A-Za-z0-9_-.`. Whitespaces are not allowed in any of the entries. The names can follow the geographic-temporal nomenclature proposed by [@Eisenmann2018](https://doi.org/10.1038/s41598-018-31123-z), or communicate additional categories that are meaningful for groupings in specific analyses, such as cultural labels, outlier status or relatedness to other samples -The `Group_Name` column contains one or multiple group or population names for each individual, separated by `;`. The first entry must be identical to the one used in the genotype data for the respective sample in a Poseidon package, and whitespace is not allowed in any of the entries. Assigning group and population names is a hard problem in archeogenetics [@Eisenmann2018](https://doi.org/10.1038/s41598-018-31123-z), so the `.janno` file allows for more than one identifier. +## The sampled species + +The `Species` column (introduced in Poseidon v3.0.0) should contain the species of the respective sample. The entry should follow binomial nomenclature as standard in Biology, e.g. `Homo sapiens`. + +Poseidon is geared towards human data, but is to a large extent species-agnostic and can be used to track archaeogenetic data also of non-human species. If it is used for non-human data, then various other `.janno` file columns of the base set may not be applicable or may not include the required choice options. As non of these columns are mandatory they can just be left out in this case. ## Relations among samples/individuals -To systematically document biological relationships uncovered among samples/individuals in one or multiple Poseidon datasets (e.g. with software like READ [@MonroyKuhn2018](https://doi.org/10.1371/journal.pone.0195491) or BREADR [@Rohrlach2023](https://doi.org/10.1101/2023.04.17.537144)), the `.janno` file can be fit with a set of columns featuring the `Relation_*` prefix. Across these columns it should be possible to encode all kinds of pairwise, biological relationships an individual might have. +To systematically document biological relationships uncovered among individuals in one or multiple Poseidon datasets (e.g. with software like READ [@MonroyKuhn2018](https://doi.org/10.1371/journal.pone.0195491) or BREADR [@Rohrlach2023](https://doi.org/10.1101/2023.04.17.537144)), the `.janno` file can be fit with a set of columns featuring the `Relation_*` prefix. Across these columns it should be possible to encode all kinds of pairwise, biological relationships an individual might have. -`Relation_To` is a string list column (so: multiple values are possible if separated by `;`) that stores the `Poseidon_ID`s of other samples/individuals to which the current individual has some relationship. +`Relation_To` is a string list column (so: multiple values are possible if separated by `;`) that stores the `Individual_ID`s of other individuals to which the current individual has some relationship. `Relation_Degree` stores a formal description of the closeness of this relationship as measured purely from aDNA data. It is therefore also a list column that can hold the following values for each relationship: -- `identical`: The two samples are from the same individual or from identical twins -- `first`: The two individuals are closely related -- a first degree relationship (e.g. siblings, parent-offspring) -- `second`: A second degree relationship (e.g. cousins, grandparent to grandchild) -- `thirdToFifth`: A third to fifth degree relationship (e.g. great-grandparent to great-grandchild) -- `sixthToTenth`: A sixth to tenth degree relationship -- `unrelated`: Unrelated -- this is the default state among all individuals, which does not have to be expressed explicitly. This category will therefore probably never be used -- `other`: Any other kind of relationship not covered by the aforementioned categories +- `identical`: The two samples are from identical twins. +- `first`: The two individuals are closely related -- a first degree relationship (e.g. siblings, parent-offspring). +- `second`: A second degree relationship (e.g. cousins, grandparent to grandchild). +- `thirdToFifth`: A third to fifth degree relationship (e.g. great-grandparent to great-grandchild). +- `sixthToTenth`: A sixth to tenth degree relationship. +- `unrelated`: Unrelated -- this is the default state among all individuals, which does not have to be expressed explicitly. +- `other`: Any other kind of relationship not covered by the aforementioned categories. For each entry in `Relation_To` there must be a corresponding entry in `Relation_Degree`. `Relation_Type` allows to add more verbose details about the relationship type, if it was possible to reconstruct that from the archaeological or historical context. Because there are too many possible permutations, there is no pre-defined set of values for what can and cannot be entered here. It is advisable, though, to stick to a general scheme like the following, which describes a given relationship from the point of view of the current individual: -- `same_as`: This sample is from the same inividual as another sample -- `identical_twin_of`: This individual is likely an identical twin of another individual -- `father_of`: This individual is likely the father of the partner individual -- `grandchild_of`: This individual is likely the grandchild of the partner individual -- `mother_or_daughter_of`: This individual is likely either the mother or daughter of the partner individual (which might be unclear, in case of imprecise archaeological dating) +- `identical_twin_of`: This individual is likely an identical twin of another individual. +- `father_of`: This individual is likely the father of the partner individual. +- `grandchild_of`: This individual is likely the grandchild of the partner individual. +- `mother_or_daughter_of`: This individual is likely either the mother or daughter of the partner individual (which might be unclear, in case of imprecise archaeological dating). - `unknown`: The relationship is unclear or not yet determined. This is the default state and does not have to be expressed, unless multiple relationships are present and some but not all are known. - `...` Unlike `Relation_Degree`, `Relation_Type` can be left empty even if there are entries in `Relation_To`. But if it is filled, then the number of values must be equal to the number of entries in both `Relation_To` and `Relation_Degree`. -The `Relation_Note` column allows to add free-form text information about the relationships of this individual. This might also include information about the method used to infer the degree and type. +## Cultural and archaeological context + +Poseidon v3.0.0 introduced the following four columns to add archaeological context information for a given sample -- at least on the level of era- and archaeological culture-attribution. Given the nature of human behaviour and archaeological inference these attributions must not be understood as absolute, objective classifications, but rather as preliminary model assumptions and interpretative tool. + +The `Cultural_Era` column serves to list one or multiple cultural eras approximating the period in which the sampled individual lived. These can be classes like, for example "Danish Bronze Age" or "Pre-Pottery Neolithic A". If possible these classes should be taken from an established space-time gazetteer like ChronOntology (https://chronontology.dainst.org) or PeriodO (https://perio.do) to link relevant background information about the referenced phenomena, so their spatiotemporal extend and research history. + +The `Cultural_Era_URL` column allows to complement the human-readable era terms give in `Cultural_Era` with persistent URLs pointing to definitions of said entities. Length and order of both columns must therefore match. https://n2t.net/ark:/99152/p0zj6g8ks9s, for example, points to an entry for "Danish Bronze Age", and https://chronontology.dainst.org/period/Gx4uxaeTCbbg to one for "Pre-Pottery Neolithic A". Note how the entries in said gazetters go back to an authoritative source, e.g. in the form of an archaeological publication presenting a typo-chronological scheme. Most archaeological and archaeogenetic publications implicitly or explicitly adopt such a scheme for the spatio-temporal context they work on. Ideally the scheme referenced in the Poseidon package and the one in the publication should match, but in practice this may be difficult to ascertain. + +The column pair `Archaeological_Culture` and `Archaeological_Culture_URL` functions just as the cultural era pair, but now on a more fine-grained level. It allows to attribute a given ancient individual to specific archaeological cultures, technocomplexes, pottery styles or political entities, for example the "Hallstatt culture in Hungary" (https://n2t.net/ark:/99152/p0nxc78fxgt), or the "Neo-Assyrian Empire" (https://chronontology.dainst.org/period/bvLwqFcGyoaL). ## Spatial position @@ -100,8 +121,6 @@ In the columns `Date_BC_AD_Median`, `Date_BC_AD_Start`, `Date_BC_AD_Stop` ages a - If only contextual (e.g. from archaeological typology) age information is available (`Date_Type = contextual`): `Date_BC_AD_Start` and `Date_BC_AD_Stop` should simply report the approximate start and end date determined by the respective source of scientific authority (e.g. an archaeologist knowledgable about the relevant typological sequences). In this case `Date_BC_AD_Median` should be calculated as the mean of `Date_BC_AD_Start` and `Date_BC_AD_Stop` rounded to an integer value. - If the sample is a modern reference sample (`Date_Type = modern`): `Date_BC_AD_Median`, `Date_BC_AD_Start`, `Date_BC_AD_Stop` should all be set to the value 2000, for 2000 AD. -The column `Date_Note` stores arbitrary free-form text information about the dating of a sample. - ## Genetic summary data ### Individual properties @@ -118,9 +137,11 @@ The `MT_Haplogroup` column is meant to store the human mitochondrial DNA haplogr The `Y_Haplogroup` column holds the respective human Y-chromosome DNA haplogroup in a simple string. To avoid confusion from using different haplotype naming systems, the notation should follow a syntax with the main branch + the most terminal derived Y-SNP separated with a minus symbol (e.g. R1b-P312), similar to that used by [Yfull](https://www.yfull.com/sc/tree/). +The `Chromosomal_Anomalies` column (introduced with Poseidon v3.0.0) allows to note one or multiple genetic chromosomal anomalies detected for the individual, so extra, missing or irregual portions of chromosomal DNA. This includes both gonosomal and autosomal aneuploidies. As there are many such possible anomalies there is no fixed list of valid entries for this column. The following terminology is recommended for some of the most common aneuploidies: `XXY` for Klinefelter syndrome, `XYY` for Jacobs syndrome, `XXX` for Triple X syndrome, `X0` for Monosomy X, `Trisomy21` for Down syndrome, and `Trisomy18` for Edwards syndrome. + ### Library properties -The `Source_Tissue` column documents the skeletal, soft tissue or other elements from which source material for DNA library preparation was extracted. If multiple samples have been taken from different elements, these can be listed separated by `;`. Specific bone names should be reported with an underscore (e.g. bone_phalanx, tooth_molar). +The `Source_Material` column (formerly `Source_Tissue`, before Poseidon v3.0.0) documents the skeletal, soft tissue or other elements from which source material for DNA library preparation was extracted. The following entries are allowed: `petrous`, `bone`, `tooth`, `hair`, `soft`, `sediment`, and `other`. `soft` encompasses (archaeologically rarely preserved) soft tissues like skin, muscle, tendons, or fat. If multiple DNA libraries have been prepared from different sampled elements, then these can be listed separated by `;` as in other list columns. Further details, e.g. specific bone names, can be reported in a `Source_Material_Note` column. The `Nr_Libraries` column holds a simple integer value of the number of libraries that have been prepared for an individual. @@ -129,11 +150,11 @@ The `Library_Names` column should list the names for the libraries as used in th The `Capture_Type` column specifies the general pre-sequencing preparation methods that have been applied to the library. See [@Knapp2010](https://doi.org/10.3390/genes1020227) for a review of the different techniques (not including newer developments). This field can hold one of multiple different values, but also multiple of these separated by `;` if different methods have been applied for different libraries. - `Shotgun`: Sequencing without any enrichment (whole genome sequencing, screening etc.). -- `1240K`: Target enrichment with hybridization capture optimised for sequences covering the 1240k SNP array [@Fu2015](https://doi.org/10.1038/nature14558), [@Haak2015](https://doi.org/10.1038/nature14317), [@Mathieson2015](https://doi.org/10.1038/nature16152). +- `1240K`: Target enrichment with hybridization capture optimised for sequences covering the 1240k SNP array, see [@Fu2015](https://doi.org/10.1038/nature14558), [@Haak2015](https://doi.org/10.1038/nature14317), [@Mathieson2015](https://doi.org/10.1038/nature16152). - `ArborComplete`, `ArborPrimePlus`, `ArborAncestralPlus`: Target enrichment with hybridization capture as provided by Arbor Biosciences in three different kits branded [myBaits Expert Human Affinities](https://arborbiosci.com/genomics/targeted-sequencing/mybaits/mybaits-expert/mybaits-expert-human-affinities). - `TwistAncientDNA`: Target enrichment with hybridization capture as provided by Twist Bioscience [@Rohland2022](https://doi.org/10.1101/gr.276728.122). +- `WISC2013`: Whole genome capture as described by [@Carpenter2013](10.1016/j.ajhg.2013.10.002). - `OtherCapture`: Target enrichment with hybridization capture for any other set of sequences. -- `ReferenceGenome`: Modern reference genomes where aDNA fragmentation is not an issue and other sample preparation techniques apply. The `UDG` column documents if the libraries for the respective individual went through UDG (or USER enzyme) treatment. This wet lab protocol step removes molecular damage in the form of deaminated cytosines characteristic of ancient DNA. @@ -157,7 +178,7 @@ The column `Data_Preparation_Pipeline_URL` should finally store an URL that link ### Data yield -The `Endogenous` column holds the percentage of mapped reads over the total amount of reads that went into the mapping pipeline. That boils down to the DNA percentage of the library that matches the (human) reference. It should be determined from Shotgun libraries (so before any hybridization capture), not on target (i.e. across the whole genome, not specific positions), and before any mapping quality filtering. In case of multiple libraries only the highest value should be reported. The % endogenous DNA can be calculated for example with the [endorS.py](https://github.com/aidaanva/endorS.py) script. +The `Endogenous` column holds the fraction (between 0 and 1, only before Poseidon v3.0.0 between 0 and 100%) of mapped reads over the total amount of reads that went into the mapping pipeline. That boils down to the DNA percentage of the library that matches the (human) reference. It should be determined from Shotgun libraries (so before any hybridization capture), not on target (i.e. across the whole genome, not specific positions), and before any mapping quality filtering. In case of multiple libraries only the highest value should be reported. The endogenous DNA fraction can be calculated for example with the [endorS.py](https://github.com/aidaanva/endorS.py) script. The `Nr_SNPs` column gives the number of SNPs reported in the genotype data files for this individual. @@ -165,7 +186,7 @@ The `Coverage_on_Target_SNPs` column reports the mean fold coverage on the SNP s ### Data quality -The `Damage` column contains the % damage on the first position of the 5' end for the main Shotgun library used for sequencing or capture. This is an important statistic to verify the age of ancient DNA. In case of multiple libraries you should report a value from the merged read alignment. +The `Damage` column contains the fraction (between 0 and 1, only before Poseidon v3.0.0 between 0 and 100%) damage on the first position of the 5' end for the main Shotgun library used for sequencing or capture. This is an important statistic to verify the age of ancient DNA. In case of multiple libraries either report multiple values separated by ;, or a single value from the merged read alignment. Contamination of ancient DNA with foreign reads is a major challenge for archaeogenetics. There exist multiple competing ideas, algorithms and software tools to estimate the degree of contamination for individual samples (e.g. ANGSD [@Korneliussen2014](https://doi.org/10.1186/s12859-014-0356-4), contamLD [@Nakatsuka2020](https://doi.org/10.1186/s13059-020-02111-2) or hapCon [@Huang2022](https://doi.org/10.1093/bioinformatics/btac390)), with some methods only applicable under certain circumstances (e.g. popular X-chromosome based approaches only work on male individuals). Also the results of different methods tend to differ both in the degree of contamination they estimate and in the way the output is usually encoded. To cover the multitude of methods in this domain, and to make the results representable in the `.janno` file, we offer the `Contamination_*` column family. @@ -181,9 +202,9 @@ Some tools for contamination estimation do not return a mean plus a standard err - `hapCon v0.4a1` - `custom script` -This setup has the consequence that the columns `Contamination`, `Contamination_Err`, `Contamination_Meas` always have to have the same number of `;`-separated values. +More specific information about which parameters were chosen can be added in a `Contamination_Note` column. -The `Contamination_Note` column is a free text field to add additional information about the contamination estimates, e.g. which parameters where used with the respective software tools. +This setup has the consequence that the columns `Contamination`, `Contamination_Err`, `Contamination_Meas` always have to have the same number of `;`-separated values. ## Context information @@ -191,6 +212,8 @@ The `Genetic_Source_Accession_IDs` column was introduced to link the derived gen The `Primary_Contact` column is a free-form text field that stores the name of the main or the corresponding author of the respective paper for published data. +The `Custodian_Institution` column (introduced in Poseidon v3.0.0) allows to document one or multiple institutions that curated the sampled remains at the time of sampling. Each institution should be given with name, city and country. The `Collection_ID` column may allow to link to the internal bookkeeping of this institutions. + The `Publication` column holds either the value `unpublished` for (yet) unpublished samples or -- for published data -- one or multiple citation-keys of the form `AuthorJournalYear` without any spaces or special characters. These keys have to be identical to the [BibTeX](http://www.bibtex.org) citation-keys identifying the respective entries in the `.bib` file of the package. BibTeX is a file format to store bibliographic information, where each entry (article, book, website, ...) is defined by a series of parameters (authors, year of publication, journal, ...). Here's an example `.bib` file with two entries for [@Cassidy2015](https://doi.org/10.1073/pnas.1518445113) and [@Feldman2019](https://doi.org/10.1126/sciadv.aax0061): ```default @@ -233,10 +256,8 @@ The string `CassidyPNAS2015` is the citation-key of the first entry. To cite bot When creating a new Poseidon package the `.bib` file should be filled together with the `Publication` column. One of the most simple ways to obtain the BibTeX entries may be to request them with the doi from the [doi2bib](https://doi2bib.org) wep app. It could be necessary to adjust the result manually, though. The citation-key, for example, has to be replaced by the one used in the `Publication` column. -The `Note` column is a free-form text field that can contain small amounts of additional information that is not yet expressed in a more systematic form in the the other `.janno` file columns. +The `Note` column is a free-form text field that can contain small amounts of additional information that is not yet expressed in a more systematic form in the other `.janno` file columns. The `Keywords` column was introduced to allow for tagging individuals with arbitrary keywords. This should simplify sorting and filtering in personal Poseidon package repositories. Each keyword is a string and multiple keywords can be separated with `;`. -Arbitrary additional columns can be included in a `.janno` file, but they should be named in a way that they do not conflict with the Poseidon package specification. These columns will not be validated (assumed free-form text), but they will be preserved in the Poseidon package, and propagated during operations with `trident forge`. - --- diff --git a/janno_r_package.md b/janno_r_package.md index cc1d1987..a8c77fc3 100644 --- a/janno_r_package.md +++ b/janno_r_package.md @@ -13,9 +13,9 @@ remotes::install_github('poseidon-framework/janno') The guide below explains the main functions in the package. It is available in .pdf format here: -- [🗎 Guide for the janno R package v1.0.0](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/janno_r_package.pdf) (shown below) +- [🗎 Guide for the janno R package v1.0.0 to v1.1.0](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/janno_r_package.pdf) (shown below) -# Guide for the janno R package v1.0.0 +# Guide for the janno R package v1.0.0 to v1.1.0 ## Installation @@ -39,6 +39,8 @@ Before loading the `.janno` files they are validated with `janno::validate_janno Usually the `.janno` files are first loaded as normal `.tsv` files with every column type set to `character` and then the columns are transformed to the specified types. This transformation can be turned off with `to_janno = FALSE`. +Note that the transformation is always done according to the specific Poseidon schema version a given janno R package version supports. `read_janno()` is not aware of the schema version a `.janno` file was intended to follow. See the [version table](version_table.md) for a lookup table which janno version supports which schema version. The package also reports this upon loading in its start-up message. + `read_janno()` returns an object of class `janno`. This class is derived from the [`tibble`](https://tibble.tidyverse.org/) class, which integrates well with the tidyverse [@Wickham2019](https://doi.org/10.21105/joss.01686) and its packages, e.g. `dplyr` or `ggplot2`. ## Validate `.janno` files @@ -51,6 +53,8 @@ my_janno_issues <- janno::validate_janno("path/to/my/janno_file.janno") `validate_janno` returns a `tibble` with issues in the respective `.janno` files. For edge cases this validation may yield slightly different results than `trident validate`. +Note that the validation is always done against the specific Poseidon schema version a given janno R package version supports. `validate_janno()` is not aware of the schema version a `.janno` file was intended to follow. + ## Write `janno` objects back to `.janno` files `janno` objects usually contain list columns, that can not directly be written to a flat text file like the `.janno` file. The function `write_janno` solves that. It employs a helper function `flatten_janno()`, which translates list columns to the string list format in `.janno` files (so: multiple values for one cell separated by `;`). diff --git a/pdf_conversion/pdf_conversion_list.tsv b/pdf_conversion/pdf_conversion_list.tsv index 5c335aa7..1627f0bb 100644 --- a/pdf_conversion/pdf_conversion_list.tsv +++ b/pdf_conversion/pdf_conversion_list.tsv @@ -14,6 +14,7 @@ trident_guide_archive/trident_guide_1.4.1.0_to_1.5.0.1.md trident_guide_archive/ trident_guide_archive/trident_guide_1.5.4.0.md trident_guide_archive/trident_guide_1.5.4.0.pdf trident_guide_archive/trident_guide_1.5.7.0_to_1.5.7.3.md trident_guide_archive/trident_guide_1.5.7.0_to_1.5.7.3.pdf trident_guide_archive/trident_guide_1.6.2.1.md trident_guide_archive/trident_guide_1.6.2.1.pdf +trident_guide_archive/trident_guide_1.6.7.1_to_1.6.7.3.md trident_guide_archive/trident_guide_1.6.7.1_to_1.6.7.3.pdf xerxes_guide_archive/xerxes_guide_0.2.0.0.md xerxes_guide_archive/xerxes_guide_0.2.0.0.pdf xerxes_guide_archive/xerxes_guide_1.0.0.2.md xerxes_guide_archive/xerxes_guide_1.0.0.2.pdf janno_details.md janno_details.pdf diff --git a/qjanno.md b/qjanno.md index a2bf9bac..451be262 100644 --- a/qjanno.md +++ b/qjanno.md @@ -29,7 +29,7 @@ The guide below explains the inner workings of qjanno and gives some examples fo - [🗎 Guide for qjanno v1.0.0.0 to v1.0.0.1](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/qjanno.pdf) (shown below) - [🗎 Guide for qjanno v1.0.0](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/qjanno_guide_archive/qjanno_guide_1.0.0.pdf) -# Guide for qjanno v1.0.0.0 to v1.0.0.1 +# Guide for qjanno v1.0.0.0 to v1.0.1.0 ## Background diff --git a/references.bib b/references.bib index 02d19834..a0bb281f 100644 --- a/references.bib +++ b/references.bib @@ -411,3 +411,17 @@ @article{Bhatia2013 month = jul, pages = {1514–1521} } + +@article{Mallick2024, + title = {The Allen Ancient DNA Resource (AADR) a curated compendium of ancient human genomes}, + volume = {11}, + ISSN = {2052-4463}, + url = {http://dx.doi.org/10.1038/s41597-024-03031-7}, + DOI = {10.1038/s41597-024-03031-7}, + number = {1}, + journal = {Scientific Data}, + publisher = {Springer Science and Business Media LLC}, + author = {Mallick, Swapan and Micco, Adam and Mah, Matthew and Ringbauer, Harald and Lazaridis, Iosif and Olalde, Iñigo and Patterson, Nick and Reich, David}, + year = {2024}, + month = feb +} diff --git a/ssf_details.md b/ssf_details.md index 5509973d..0232b332 100644 --- a/ssf_details.md +++ b/ssf_details.md @@ -1,6 +1,6 @@ # .ssf file details -Poseidon 2.7.0 added an option to specify sequencing source data. This is a tab-separated table, much like the Janno file, but following [a different schema](https://github.com/poseidon-framework/poseidon-schema/blob/master/ssf_columns.tsv), typically with file ending `*.ssf` for "Sequencing Source File". The primary entities in this table are Sequencing entities (typically corresponding to DNA libraries or even multiple runs/lanes of the same library). The link to the Individuals listed in the Janno-file are made through a foreign-key relationship from the column `poseidon_IDs` in this file to `Poseidon_ID` in the Janno-file. The relationship is many-to-many, so each row in the SSF file can contain multiple Poseidon_IDs, and multiple rows can link to the same Poseidon_ID. +Poseidon v2.7.0 added an option to specify sequencing source data. This is a tab-separated table, much like the `.janno` file, but following [a different schema](https://github.com/poseidon-framework/poseidon-schema/blob/master/ssf_columns.tsv), typically with file ending `.ssf` for "Sequencing Source File". The primary entities in this table are sequencing entities (typically corresponding to DNA libraries or even multiple runs/lanes of the same library). The link to the samples listed in the `.janno` file are made through a foreign-key relationship from the column `poseidon_IDs` in this file to `Poseidon_ID` in the Janno-file. The relationship is many-to-many, so each row in the SSF file can contain multiple Poseidon_IDs, and multiple rows can link to the same Poseidon_ID. Here is an example for such a file: diff --git a/trident.md b/trident.md index 60cc50d9..5411012a 100644 --- a/trident.md +++ b/trident.md @@ -30,7 +30,8 @@ On GitHub you will also find [older release versions](https://github.com/poseido With `trident --help` and `trident --help` you can get information about each subcommand and parameter directly on the command line. The guide below explains the subcommands in more detail. It is available in .pdf format for the current and previous versions here: -- [🗎 Guide for trident v1.6.7.1 to v1.6.7.3](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/trident.pdf) (shown below) +- [🗎 Guide for trident v1.7.0.0](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/trident.pdf) (shown below) +- [🗎 Guide for trident v1.6.7.1 to v1.6.7.3](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/trident_guid_1.6.7.1_to_1.6.7.3.pdf) - [🗎 Guide for trident v1.6.2.1](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/trident_guide_archive/trident_guide_1.6.2.1.pdf) - [🗎 Guide for trident v1.5.7.0 to v1.5.7.3](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/trident_guide_archive/trident_guide_1.5.7.0_to_1.5.7.3.pdf) - [🗎 Guide for trident v1.5.4.0](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/trident_guide_archive/trident_guide_1.5.4.0.pdf) @@ -47,7 +48,7 @@ With `trident --help` and `trident --help` you can get information - [🗎 Guide for trident v0.29.0](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/trident_guide_archive/trident_guide_0.29.0.pdf) - [🗎 Guide for trident v0.28.0](https://github.com/poseidon-framework/poseidon-framework.github.io/blob/master/trident_guide_archive/trident_guide_0.28.0.pdf) -# Guide for trident v1.6.7.1 to v1.6.7.3 +# Guide for trident v1.7.0.0 ## Installation @@ -58,7 +59,7 @@ See the Poseidon website () or the GitH Trident is a command line software tool structured in multiple subcommands. If you installed it properly you can call it on the command line by typing `trident`. This will show an overview of the general options and all subcommands, which are explained in detail below. ```default -Usage: trident [--version] [--logMode MODE | --debug] [--errLength INT] +Usage: trident [--version] [--logMode MODE | --debug] [--errLength INT] [--inPlinkPopName MODE] (COMMAND | COMMAND) trident is a management and analysis tool for Poseidon packages. Report issues @@ -815,22 +816,27 @@ With `-z|--zip` the genotype data output can be wrapped in gzipped archives with Command line details ```default -Usage: trident jannocoalesce ((-s|--sourceFile FILE) | (-d|--baseDir DIR)) - (-t|--targetFile FILE) [-o|--outFile FILE] - [--includeColumns ARG | --excludeColumns ARG] - [-f|--force] [--sourceKey ARG] [--targetKey ARG] +Usage: trident jannocoalesce ([--pvSource VERSION] (-s|--sourceFile FILE) | + (-d|--baseDir DIR)) [--pvTarget VERSION] + (-t|--targetFile FILE) (-o|--outFile FILE) + [--includeColumns ARG | --excludeColumns ARG] + [-f|--force] [--sourceKey ARG] [--targetKey ARG] [--stripIdRegex ARG] Coalesce information from one or multiple janno files to another one Available options: -h,--help Show this help text + --pvSource VERSION Poseidon version (e.g. 2.7.1). (default: 3.0.0) -s,--sourceFile FILE The source .janno file. -d,--baseDir DIR A base directory to search for Poseidon packages. + --pvTarget VERSION Poseidon version (e.g. 2.7.1). (default: 3.0.0) -t,--targetFile FILE The target .janno file to fill. - -o,--outFile FILE An optional file to write the results to. If not - specified, change the target file in place. - (default: Nothing) + -o,--outFile FILE File path to write the result to. Can be identical to + --targetFile to overwrite the target file in place. + Note that trident only writes .janno files in the + latest Poseidon version it supports, so in this case + v3.0.0. --includeColumns ARG A comma-separated list of .janno column names to coalesce. If not specified, all columns that can be found in the source and target will get filled. @@ -858,7 +864,8 @@ A most basic run may just include two arguments: ```bash trident jannocoalesce \ --sourceFile path/to/source.janno \ - --targetFile path/to/target.janno + --targetFile path/to/target.janno \ + --outFile path/to/coalesced.janno ``` `jannocoalesce` generally works by reading a source `.janno` file with `-s|--sourceFile` (or all `.janno` files in a `-d|--baseDir`) and a target `.janno` file with `-t|--targetFile`. @@ -867,6 +874,14 @@ It then merges these files by a key column, which can be selected with `--source `jannocoalesce` generally attempts to fill **all** empty cells in the target `.janno` file with information from the source. `--includeColumns` and `--excludeColumns` allow to select specific columns for which this should be done. In some cases it may be desirable to not just fill empty fields in the target, but overwrite the information already there with the `-f|--force` option. If the target file should be preserved, then the output can be directed to a new output `.janno` file with `-o|--outFile`. +Note that all three files, the source, the target, and the outfile, are mandatory. The roles are: + +- `targetFile` -> This is the file which is taken as the starting point for the new janno file. +- `sourceFile` -> This is the file from which to read additional columns that might be missing in the target. +- `outFile` -> This is the file that will contain the coalesced result. + +In addition to these three files, you can choose a Poseidon Version for both the source (`--pvSource`) and the target (`--pvTarget`), which is useful for backwards compatibility, in case you have `.janno` files in older Poseidon versions. **_But_**: Note that the output file will _always_ be written in the newest Poseidon version. Note that you _can_ choose the outFile to be the targetFile, which will then overwrite the target file. This can be useful if you know what you are doing. Otherwise it is of course safer to choose a new filename for the output. + ## Rectify command `rectify` automatically harmonizes POSEIDON.yml files of one or multiple packages. This is not an automatic update from one Poseidon version to the next, but rather a clean-up wizard after manual modifications. It also includes additional, automatic package editing features. @@ -1094,14 +1109,17 @@ Again you can use the `--raw` option to output the survey table in a tab-delimit Command line details ```default -Usage: trident validate ((-d|--baseDir DIR) [--ignoreGeno] [--fullGeno] - [--ignoreDuplicates] [-c|--ignoreChecksums] +Usage: trident validate ((-d|--baseDir DIR) [--ignoreGeno] [--fullGeno] + [--ignoreDuplicates] [-c|--ignoreChecksums] [--ignorePoseidonVersion] | --pyml FILE | (-p|--genoOne FILE) | --genoFile FILE --snpFile FILE --indFile FILE | - --bedFile FILE --bimFile FILE --famFile FILE | - --vcfFile FILE | --janno FILE | --ssf FILE | - --bib FILE) [--noExitCode] [--onlyLatest] + --bedFile FILE --bimFile FILE --famFile FILE | + --vcfFile FILE | [--pvJanno VERSION] --janno FILE | + [--pvSSF VERSION] --ssf FILE | + --bib FILE) [-j|--mandatoryJannoColumn COLNAME] + [-s|--mandatorySSFColumn COLNAME] [--noExitCode] + [--onlyLatest] Check Poseidon packages or package components for structural correctness @@ -1142,9 +1160,19 @@ Available options: --famFile FILE Plink individual file. Accepted file endings are .fam --vcfFile FILE VCF (Variant Call Format) file, optionall gzipped. Accepted file endings are .vcf, .vcf.gz + --pvJanno VERSION Poseidon version (e.g. 2.7.1). (default: 3.0.0) --janno FILE Path to a .janno file. + --pvSSF VERSION Poseidon version (e.g. 2.7.1). (default: 3.0.0) --ssf FILE Path to a .ssf file. --bib FILE Path to a .bib file. + -j,--mandatoryJannoColumn COLNAME + Usually optional .janno file column that should be + treated as mandatory, such as e.g. Individual_ID. Can + be given multiple times. + -s,--mandatorySSFColumn COLNAME + Usually optional .ssf file column that should be + treated as mandatory, such as e.g. poseidon_IDs. Can + be given multiple times. --noExitCode Do not produce an explicit exit code. --onlyLatest Consider only the latest versions of packages, or the groups and individuals within the latest versions of diff --git a/trident_guide_archive/trident_guide_1.6.7.1_to_1.6.7.3.md b/trident_guide_archive/trident_guide_1.6.7.1_to_1.6.7.3.md new file mode 100644 index 00000000..e3f6e46c --- /dev/null +++ b/trident_guide_archive/trident_guide_1.6.7.1_to_1.6.7.3.md @@ -0,0 +1,1129 @@ +# Guide for trident v1.6.7.1 to v1.6.7.3 + +## Installation + +See the Poseidon website () or the GitHub repository () for up-to-date installation instructions. + +## Overview + +Trident is a command line software tool structured in multiple subcommands. If you installed it properly you can call it on the command line by typing `trident`. This will show an overview of the general options and all subcommands, which are explained in detail below. + +```default +Usage: trident [--version] [--logMode MODE | --debug] [--errLength INT] + [--inPlinkPopName MODE] (COMMAND | COMMAND) + + trident is a management and analysis tool for Poseidon packages. Report issues + here: https://github.com/poseidon-framework/poseidon-hs/issues + +Available options: + -h,--help Show this help text + --version Show version number + --logMode MODE How information should be reported: NoLog, SimpleLog, + DefaultLog, ServerLog or VerboseLog. + (default: DefaultLog) + --debug Short for --logMode VerboseLog. + --errLength INT After how many characters should a potential genotype + data parsing error message be truncated. "Inf" for no + truncation. (default: CharCount 1500) + --inPlinkPopName MODE Where to read the population/group name from the FAM + file in Plink-format. Three options are possible: + asFamily (default) | asPhenotype | asBoth. + +Package creation and manipulation commands: + init Create a new Poseidon package from genotype data + fetch Download data from a remote Poseidon repository + forge Select packages, groups or individuals and create a + new Poseidon package from them + genoconvert Convert the genotype data in a Poseidon package to a + different file format + jannocoalesce Coalesce information from one or multiple janno files + to another one + rectify Adjust POSEIDON.yml files automatically to package + changes + +Inspection commands: + list List packages, groups or individuals from local or + remote Poseidon repositories + summarise Get an overview over the content of one or multiple + Poseidon packages + survey Survey the degree of context information completeness + for Poseidon packages + validate Check Poseidon packages or package components for + structural correctness +``` + +`trident` allows to work directly with genotype data (see `-p` below), but it is optimized for the interaction with Poseidon packages, which wrap and contextualize the data. Most `trident` subcommands therefore have a central parameter, called `--baseDir` or simply `-d` to specify one or more base directories to look for packages. For example, if all Poseidon packages live inside a repository at `/path/to/poseidon/packages` you would simply say `trident -d /path/to/poseidon/dirs/` and `trident` would automatically search all subdirectories inside of the repository for valid Poseidon packages (as identified by valid POSEIDON.yml files). + +You can arrange a Poseidon repository in a hierarchical way. For example: + +```default +/path/to/poseidon/packages + /modern + /2019_poseidon_package1 + /2019_poseidon_package2 + /ancient + /... + /... + /Reference_Genomes + /... + /... +``` + +This structure then allows to select only the level of packages you are interested in, even individual ones. `-d` can be given multiple times, which is particularly useful as you may have your own data to co-analyse with external reference data. In this case you simply need to provide your own genotype data as yet another Poseidon package to be added to your `trident` command. For example, you may have genotype data in `EIGENSTRAT` format (`trident` supports `EIGENSTRAT`, `PLINK` and `VCF` as formats): + +```default +~/my_project/my_project.geno +~/my_project/my_project.snp +~/my_project/my_project.ind +``` + +Then you can transform that into a skeleton Poseidon package with the `init` command. You can also do it manually by simply adding a POSEIDON.yml file, with, for example, the following content: + +```yml +poseidonVersion: 2.7.1 +title: My_awesome_project +description: Unpublished genetic data from my awesome project +contributor: + - name: Stephan Schiffels + email: schiffels@institute.org +packageVersion: 0.1.0 +lastModified: 2020-10-07 +genotypeData: + format: EIGENSTRAT + genoFile: my_project.geno + snpFile: my_project.snp + indFile: my_project.ind +jannoFile: my_project.janno +bibFile: sources.bib +``` + +Two remarks: 1) All file paths in this POSEIDON.yml file are considered _relative_ to the directory in which POSEIDON.yml resides. For this example we assume that this file is added into the same directory as the three genotype files. 2) Besides the genotype data files there are two (technically optional) files referenced in this example: `sources.bib` and `my_project.janno`. Of course you can add them manually - `init` automatically creates empty dummy versions. + +Once you have set up your own Poseidon package (which is really only a skeleton so far), you can add it to your `trident` analysis, by simply adding your project directory to the command using `-d`, for example: + +```bash +trident list -d /path/to/poseidon/packages/modern \ + -d /path/to/poseidon/packages/ReferenceGenomes \ + -d ~/my_project \ + --packages +``` + +### Logging and command line output + +For all subcommands the general argument `--logMode` defines how `trident` reports messages (to stderr) on the command line: + +- *NoLog*: Hides all messages. +- *SimpleLog*: Plain and simple output. +- *DefaultLog*: Adds the severity indicators (log levels) `Info`, `Warning` and `Error` before each message. This is the default setting. +- *ServerLog*: Additionally adds timestamps before each message. +- *VerboseLog*: Shows not just messages on the log levels `Info`, `Warning` and `Error` like the other modes, but also on the more verbose level `Debug`. Use this mostly relevant for debugging. + +`--debug` is short for `--logMode VerboseLog` to activate this important log level more easily. + +### Package duplicates and versions + +- For `trident` multiple packages in a set of base directories can share the same `title`, if they have different `packageVersion` numbers. If the version numbers are also identical or missing, then `trident` stops with an exception. +- The `trident` subcommands `genoconvert`, `list`, `rectify`, `survey` and `validate` by default consider all versions of each Poseidon package in the given base directories. The `--onlyLatest` flag causes them to instead only consider the latest versions. +- `fetch` and `forge` generally consider all package versions. Their selection language (see below) allows for detailed version handling. +- `summarize` and `jannocoalesce` consider always only the latest package versions. + +### Individual/Sample duplicates + +- `Poseidon_ID`s (so individual/sample names) within one package have to be unique, or `trident` will stop. +- We also discourage sample duplicates across packages in package repositories, but `trident` will generally continue with them. `validate` will fail though, if the `--ignoreDuplicates` flag is not set. +- `forge` offers a special mechanism to resolve sample duplicates within its selection language. + +### Group names in `.fam` files + +The `.fam` file of PLINK-formatted genotype data is used inconsistently across different popular aDNA software tools to store group/population name information. The (global) `trident` option `--inPlinkPopName` with the arguments `asFamily` (default), `asPhenotype` and `asBoth` allows to control the reading of the population name from PLINK `.fam` files. The subcommands that write genotype data (`forge`, `genoconvert`) have a corresponding option `--outPlinkPopName` to specify this for the output. + +### Whitespaces in the `.janno` file + +While reading the `.janno` file `trident` trims all leading and trailing whitespaces around individual cells. Also all instances of the `No-Break Space` unicode character will be removed. This means these whitespaces will not be preserved when a package is `forge`d. + +## Init command + +`init` creates a new Poseidon package from genotype data files. It adds a POSEIDON.yml file, a dummy `.janno` file for context information and an empty `.bib` file for literature references. + +
+ Command line details + +```default +Usage: trident init ((-p|--genoOne FILE) | --genoFile FILE --snpFile FILE + --indFile FILE | + --bedFile FILE --bimFile FILE --famFile FILE | + --vcfFile FILE) [--snpSet SET] (-o|--outPackagePath DIR) + [-n|--outPackageName STRING] [--minimal] + + Create a new Poseidon package from genotype data + +Available options: + -h,--help Show this help text + -p,--genoOne FILE One of the input genotype data files. Expects .bed, + .bed.gz, .bim, .bim.gz or .fam for PLINK, .geno, + .geno.gz, .snp, .snp.gz or .ind for EIGENSTRAT, + or.vcf or .vcf.gz for VCF. In case of EIGENSTRAT and + PLINK, the two other files must be in the same + directory and must have the same base name. If a + gzipped file is given, it is assumed that the file + pairs (.geno.gz, .snp.gz) or (.bim.gz, .bed.gz) are + both zipped, but not the .fam or .ind file. If a .ind + or .fam file is given, it is assumed that none of the + file triples is zipped. + --genoFile FILE Eigenstrat genotype matrix, optionally gzipped. + Accepted file endings are .geno, .geno.gz + --snpFile FILE Eigenstrat snp positions file, optionally gzipped. + Accepted file endings are .snp, .snp.gz + --indFile FILE Eigenstrat individual file. Accepted file endings are + .ind + --bedFile FILE Plink genotype matrix, optionally gzipped. Accepted + file endings are .bed, .bed.gz + --bimFile FILE Plink snp positions file, optionally gzipped. + Accepted file endings are .bim, .bim.gz + --famFile FILE Plink individual file. Accepted file endings are .fam + --vcfFile FILE VCF (Variant Call Format) file, optionall gzipped. + Accepted file endings are .vcf, .vcf.gz + --snpSet SET The snpSet of the package: 1240K, HumanOrigins or + Other. Only relevant for data input with -p|--genoOne + or --genoFile + --snpFile + --indFile, because the + packages in a -d|--baseDir already have this + information in their respective POSEIDON.yml files. + (default: Other) + -o,--outPackagePath DIR Path to the output package directory. + -n,--outPackageName STRING + The output package name. This is optional: If no name + is provided, then the package name defaults to the + basename of the (mandatory) --outPackagePath + argument. (default: Nothing) + --minimal Should the output Poseidon package be reduced to a + necessary minimum? +``` + +
+ +The command + +```bash +trident init \ + --genoFile path/to/genoFile.geno \ + --snpFile path/to/snpFile.snp \ + --indFile path/to/indFile.ind \ + --snpSet 1240K|HumanOrigins|Other \ + -o path/to/new_package_name +``` + +requires the paths to the respective files (`--genoFile --snpFile --indFile | --bedFile --bimFile --famFile | --vcfFile`), and optionally the "shape" of these files (`--snpSet`), so if they cover the `1240K`, the `HumanOrigins` or an `Other` SNP set. + +A simpler interface is available with `-p (+ --snpSet)`, which only requires a path to one of the genotype data files and automatically discovers the others if they share the same base name: + +```bash +trident init \ + -p path/to/genoFile \ + --snpSet 1240K|HumanOrigins|Other \ + -o path/to/new_package_name +``` + +The following file extensions are expected: + +| | EIGENSTRAT | PLINK | VCF | +|----------|--------------|---------|--------| +| genoFile | `.geno` | `.bed` | `.vcf` | +| snpFile | `.snp` | `.bim` | --- | +| indFile | `.ind` | `.fam` | --- | + +The output package created by `init` is located in a new directory `-o`, which should not already exist when `init` is called, and gets the package title corresponding to the basename of `-o`. You can also set the title explicitly with `-n`. + +The `--minimal` flag causes `init` to create a minimal package with a very basic POSEIDON.yml and no `.bib` and `.janno` files. + +## Fetch command + +`fetch` allows to download Poseidon packages from a remote Poseidon server via a Web API. This server provides all packages in the Poseidon public archives. + +
+ Command line details + +```default +Usage: trident fetch (-d|--baseDir DIR) + (--downloadAll | + (--fetchFile FILE | (-f|--fetchString DSL))) + [--remoteURL URL] [--archive STRING] + + Download data from a remote Poseidon repository + +Available options: + -h,--help Show this help text + -d,--baseDir DIR A base directory to search for Poseidon packages. + --downloadAll Download all packages the server is offering. + --fetchFile FILE A file with a list of packages. Works just as -f, but + multiple values can also be separated by newline, not + just by comma. -f and --fetchFile can be combined. + -f,--fetchString DSL List of packages to be downloaded from the remote + server. Package names should be wrapped in asterisks: + *package_title*. You can combine multiple values with + comma, so for example: "*package_1*, *package_2*, + *package_3*". fetchString uses the same parser as + forgeString, but does not allow excludes. If groups + or individuals are specified, then packages which + include these groups or individuals are included in + the download. + --remoteURL URL URL of the remote Poseidon server. + (default: "https://server.poseidon-adna.org") + --archive STRING The name of the Poseidon package archive that should + be queried. If not given, then the query falls back + to the default archive of the server selected with + --remoteURL. See the archive documentation at + https://www.poseidon-adna.org/#/archive_overview for + a list of archives currently available from the + official Poseidon Web API. (default: Nothing) +``` + +
+ +It works with + +```bash +trident fetch -d ... -d ... \ + -f "*package_title_1*,*package_title_2-1.0.1*,group_name," +``` + +and the entities you want to download must be listed either in a simple string of comma-separated values, which can be passed via `-f`/`--fetchString`, or in a text file (`--fetchFile`). Entities are then combined from these sources. + +Entities are specified using a special syntax (see also the documentation of `forge` below): packages are wrapped in asterisks, with or without a version number appended after a dash (e.g. `*package_title*` or `*package_title-1.2.3`), group names are spelled as is, and individual names are wrapped in angular brackets (e.g. ``). Fetch will figure out which packages need to be downloaded to include all specified entities. + +`--downloadAll`, which can be given instead of `-f` and `--fetchFile`, causes fetch to download all packages from the server. The downloaded packages are added in the first (!) `-d` directory (which gets created if it doesn't exist), but downloads are only performed if the respective packages are not already present in the latest version in any of the `-d` directories. + +Note that `trident fetch` is usually used in a workflow with `trident list --remote`: First one inspects what is available on the server with `list`, to then compile a custom, targeted `fetch` command. + +`fetch` has the optional arguments `--remote https:://..."` to name an alternative Poseidon server and `--archive` to select a specific Poseidon archive on the server. + +## Forge command + +`forge` creates new Poseidon packages by extracting and merging packages, populations and individuals/samples from Poseidon repositories. + +
+ Command line details + +```default +Usage: trident forge ((-d|--baseDir DIR) | + ((-p|--genoOne FILE) | --genoFile FILE --snpFile FILE + --indFile FILE | + --bedFile FILE --bimFile FILE --famFile FILE | + --vcfFile FILE) [--snpSet SET]) + [--forgeFile FILE | (-f|--forgeString DSL)] + [--selectSnps FILE] [--intersect] [--outFormat FORMAT] + [--onlyGeno | --minimal | --preservePyml] [-z|--zip] + (-o|--outPackagePath DIR) [-n|--outPackageName STRING] + [--packagewise] [--outPlinkPopName MODE] [--ordered] + + Select packages, groups or individuals and create a new Poseidon package from + them + +Available options: + -h,--help Show this help text + -d,--baseDir DIR A base directory to search for Poseidon packages. + -p,--genoOne FILE One of the input genotype data files. Expects .bed, + .bed.gz, .bim, .bim.gz or .fam for PLINK, .geno, + .geno.gz, .snp, .snp.gz or .ind for EIGENSTRAT, + or.vcf or .vcf.gz for VCF. In case of EIGENSTRAT and + PLINK, the two other files must be in the same + directory and must have the same base name. If a + gzipped file is given, it is assumed that the file + pairs (.geno.gz, .snp.gz) or (.bim.gz, .bed.gz) are + both zipped, but not the .fam or .ind file. If a .ind + or .fam file is given, it is assumed that none of the + file triples is zipped. + --genoFile FILE Eigenstrat genotype matrix, optionally gzipped. + Accepted file endings are .geno, .geno.gz + --snpFile FILE Eigenstrat snp positions file, optionally gzipped. + Accepted file endings are .snp, .snp.gz + --indFile FILE Eigenstrat individual file. Accepted file endings are + .ind + --bedFile FILE Plink genotype matrix, optionally gzipped. Accepted + file endings are .bed, .bed.gz + --bimFile FILE Plink snp positions file, optionally gzipped. + Accepted file endings are .bim, .bim.gz + --famFile FILE Plink individual file. Accepted file endings are .fam + --vcfFile FILE VCF (Variant Call Format) file, optionall gzipped. + Accepted file endings are .vcf, .vcf.gz + --snpSet SET The snpSet of the package: 1240K, HumanOrigins or + Other. Only relevant for data input with -p|--genoOne + or --genoFile + --snpFile + --indFile, because the + packages in a -d|--baseDir already have this + information in their respective POSEIDON.yml files. + (default: Other) + --forgeFile FILE A file with a list of packages, groups or individual + samples. Works just as -f, but multiple values can + also be separated by newline, not just by comma. + Empty lines are ignored and comments start with "#", + so everything after "#" is ignored in one line. + Multiple instances of -f and --forgeFile can be + given. They will be evaluated according to their + input order on the command line. + -f,--forgeString DSL List of packages, groups or individual samples to be + combined in the output package. Packages follow the + syntax *package_title*, populations/groups are simply + group_id and individuals . You can + combine multiple values with comma, so for example: + "*package_1*, , , + group_1". Duplicates are treated as one entry. + Negative selection is possible by prepending "-" to + the entity you want to exclude (e.g. "*package_1*, + -, -group_1"). forge will apply + excludes and includes in order. If the first entity + is negative, then forge will assume you want to merge + all individuals in the packages found in the baseDirs + (except the ones explicitly excluded) before the + exclude entities are applied. An empty forgeString + (and no --forgeFile) will therefore merge all + available individuals. If there are individuals in + your input packages with equal individual id, but + different main group or source package, they can be + specified with the special syntax + "". + --selectSnps FILE To extract specific SNPs during this forge operation, + provide a Snp file. Can be either Eigenstrat (file + ending must be '.snp' or '.snp.gz') or Plink (file + ending must be '.bim' or '.bim.gz'). When this option + is set, the output package will have exactly the SNPs + listed in this file. Any SNP not listed in the file + will be excluded. If option '--intersect' is also + set, only the SNPs overlapping between the SNP file + and the forged packages are output. + (default: Nothing) + --intersect Whether to output the intersection of the genotype + files to be forged. The default (if this option is + not set) is to output the union of all SNPs, with + genotypes defined as missing in those packages which + do not have a SNP that is present in another package. + With this option set, the forged dataset will + typically have fewer SNPs, but less missingness. + --outFormat FORMAT The format of the output genotype data: EIGENSTRAT, + PLINK or VCF. (default: PLINK) + --onlyGeno Should only the resulting genotype data be returned? + This means the output will not be a Poseidon package. + --minimal Should the output Poseidon package be reduced to a + necessary minimum? + --preservePyml Should the output Poseidon package mimic the input + package? With this option some fields of the source + package's POSEIDON.yml file, its README file and its + CHANGELOG file (if available) are copied to the + output package. Only works for a singular source + package. + -z,--zip Should the resulting genotype- and snp-files be + gzipped? + -o,--outPackagePath DIR Path to the output package directory. + -n,--outPackageName STRING + The output package name. This is optional: If no name + is provided, then the package name defaults to the + basename of the (mandatory) --outPackagePath + argument. (default: Nothing) + --packagewise Skip the within-package selection step in forge. This + will result in outputting all individuals in the + relevant packages, and hence a superset of the + requested individuals/groups. It may result in better + performance in cases where one wants to forge entire + packages or almost entire packages. Details: Forge + conceptually performs two types of selection: First, + it identifies which packages in the supplied base + directories are relevant to the requested forge, i.e. + whether they are either explicitly listed using + *PackageName*, or because they contain selected + individuals or groups. Second, within each relevant + package, individuals which are not requested are + removed. This option skips only the second step, but + still performs the first. + --outPlinkPopName MODE Where to write the population/group name into the FAM + file in Plink-format. Three options are possible: + asFamily (default) | asPhenotype | asBoth. See also + --inPlinkPopName. + --ordered With this option, the output of forge is ordered + according to the entities given. +``` + +
+ +`forge` can be used with + +```bash +trident forge -d ... -d ... \ + -f "*package_name*, group_id, " \ + -o path/to/new_package_name +``` + +where the entities (packages, groups/populations, individuals/samples) you want in the output package can be denoted either as a string on the command line (`-f`/`--forgeString`), or in an input text file (`--forgeFile`). See the section below for the syntax of this selection language. Do not forget to wrap the `--forgeString` query in quotes. + +Including one or multiple Poseidon packages with `-d` is not the only way to include data for a forge operation. It is also possible to consider unpackaged genotype data directly with `-p (+ --snpSet)`, `--genoFile + --snpFile + --indFile (+ --snpSet)` (for EIGENSTRAT data), `--bedFile + --bimFile + --famFile (+ --snpSet)` (for PLINK data) or `--vcfFile (+ --snpSet)` (for VCF data). This makes the following example possible, where we merge data from one Poseidon package and two unpackaged genotype datasets to get a new EIGENSTRAT dataset. + +```bash +trident forge \ + -d 2017_GonzalesFortesCurrentBiology \ + -p 2018_VeeramahPNAS/2018_VeeramahPNAS.fam \ + --bedFile 2017_HaberAJHG/2017_HaberAJHG.bed \ + --bimFile 2017_HaberAJHG/2017_HaberAJHG.bim \ + --famFile 2017_HaberAJHG/2017_HaberAJHG.fam \ + -f ",,Iberia_HG.SG" \ + -o testpackage \ + --outFormat EIGENSTRAT \ + --onlyGeno +``` + +### The forge selection language + +The text in `--forgeString` and `--forgeFile` (and with a reduced syntax also in `--fetchString` and `--fetchFile`) are parsed as a domain specific query language that describes precisely which entities should be compiled in the output package of a given `forge` operation. The language has multiple syntactic elements and a specific evaluation logic. + +In general a `--forgeString` query consists of multiple entities, separated by `,`. The main entities are Poseidon packages, groups/populations and individuals/samples: + +- Each package title is surrounded by `*`: `*package*`. That means if you want all individuals of the Poseidon package `2019_Jeong_InnerEurasia` in the output package you would add `*2019_Jeong_InnerEurasia*` to the query. +- Groups/populations are not specially marked: `group`. So to get all individuals of the group `Swiss_Roman_Period`, you would simply add `Swiss_Roman_Period`. +- Individuals/samples are surrounded by `<` and `>`: ``. `ALA026` therefore becomes ``. A second way to denote individuals is with the more verbose and specific syntax ``. Such defined individuals take precedence over differently defined ones (so directly with `` or as a subset of `*package*` or `group`). This allows to resolve duplication issues precisely -- at least in cases where the duplicated individuals differ in source package or primary group. +- Package versions can be appended to package names, such as `*package-1.2.3*`. +- This also works with the verbose individual syntax: ``. + +In the `--forgeFile` each line is treated as a separate forgeString, empty lines are ignored and `#` symbols start comments. So this is a valid example of a forgeFile: + +```default +# Packages +*pac1*, *pac2-1.2.3* + +# Groups and individuals from other packages beyond pac1 and pac2 +group1, , group2, , + +# pac2 has two outlier individuals that should be ignored +- # This one has very low coverage +- # This one is from a different time period +``` + +By prepending `-` to entities, we can exclude them from the forged package (this feature is not available for `fetch`). `forge` figures out the final list of samples to include by interpreting all forge-entities in order. So an entity list `*pac1*,-,group1` may result in a different outcome than `*pac1*,group1,-`, depending on whether `` belongs to `group1` or not. + +If the forge entity list starts with a negative entity, or if the entity list is empty, `forge` will implicitly assume you want to include all individuals in all **latest** versions of packages found in the base directories (except the ones explicitly excluded, of course). + +The specific semantics of the various ways to include or exclude entities are as follows: + +#### Inclusion queries + +* `*pac1*`: Select all individuals in the latest version of package "pac1" +* `*pac1-1.0.1*`: Select all individuals in package "pac1" with version "1.0.1" +* `group1`: Select all individuals associated with "group1" in all latest versions of all packages +* ``: Select the individual named "ind1", searching in all latest packages. +* ``: Select the individual named "ind1" associated with "group1" in the latest version of package "pac1" +* ``: Select the individual named "ind1" associated with "group1" in the package "pac1" with version "1.0.1" + +#### Exclusion queries + +* `-*pac1*`: Remove all individuals in all versions of package "pac1" +* `-*pac1-1.0.1*`: Remove only individuals in package "pac1" with version "1.0.1" (but leave other versions in) +* `-group1`: Remove all individuals associated with "group1" in all versions of all packages (not just the latest) +* `-`: Remove all individuals named "ind1" in all versions of all packages (not just the latest) +* `-`: Remove the individual named "ind1" associated with "group1", searching in all versions of package "pac1" +* `-`: Remove the individual named "ind1" associated with "group1", but only if they are in "pac1" with version "1.0.1" + +If a query results in multiple individuals with the same name, forge will throw an error. + +### Ordered output + +By default the order of samples in a Poseidon package created with `forge` depends on the order in which the relevant source packages are discovered by `trident` (e.g. when it crawls for packages in the `-d` base directories) and then the sample order within these packages. + +The option `--ordered` gives more control over the output order. It causes `trident` to output the resulting package with samples ordered according to the selection in `-f` or `--forgeFile`. This works through an alternative, slower sample selection algorithm that loops through the list of entities and checks for each entity which samples it adds or removes respectively to and from the final selection. + +For simple, positive selection, packages, groups and samples are added as expected. Negative selection removes samples from the list again. If an entity is selected twice via positive selection, then its first occurrence is considered for the ordering. + +#### Reordering samples in a package + +One particular application of `--ordered` is the reordering of samples in an existing Poseidon package, here for example `MyPac`. We suggest the following workflow for this application: + +1. Generate a `--forgeFile` with the desired order of the samples in `MyPac`. This can be done manually or with any suitable tool. Here is an example, where we employ `qjanno` to generate a `forge` selection so that the samples are ordered alphabetically by their `Poseidon_ID`: + +```bash +qjanno "SELECT '<'||Poseidon_ID||'>' FROM d(MyPac) ORDER BY Poseidon_ID" \ + --raw --noOutHeader > myOrder.txt +``` + +2. Use `trident forge` with `--ordered` and `--preservePyml` (see below) to create the package with the specified order: + +```bash +trident forge -d MyPac --forgeFile myOrder.txt -o MyPac2 --ordered --preservePyml +``` + +3. Apply `trident rectify` to increment the package version number and document the reordering: + +```bash +trident rectify -d MyPac2 --packageVersion Minor \ + --logText "reordered the samples alphabetically by Poseidon_ID" +``` + +`MyPac2` then acts as a stand-in replacement for `MyPac` that only differs in the order of samples (and maybe the order of variables/fields in the `POSEIDON.yml`, `.janno`, `.ssf` or `.bib` files). + +### Treatment of the genotype data while merging + +Forge performs a series of steps to merge the genotype data of multiple source files: + +1. Genotype data from each package is streamed in parallel. Because our packages may have different SNP locations (specified by chromosome-position pairs) listed in their `.bim`/`.snp` or `.vcf` file, we first perform a zipping-operation, whose behaviour depends on whether `--intersect` is set or not. Without `--intersect`, any SNP position listed in any package will be forwarded to the output, with missing values being filled in in all packages that do not list that particular SNP. With `--intersect`, only SNP positions that are present in all packages are considered. Note that relevant for this step is only whether a given SNP position is part of the genotype data, not whether the actual genotypes are missing or not. +2. At each SNP, the consensus alleles are selected, by collecting all reference and alternative alleles from all sources. If more than two non-dummy alleles (alleles different from `N`) are present in that collection, an error is thrown. If exactly two non-dummy alleles are present (which should be the case for binary SNPs), the two alleles are declared "reference" and "alternative" alleles for the output. If only one non-dummy allele is present, it is set to be the reference allele, and "N" is set to be the alternative. +3. All source genotype data is then read and recoded in terms of the two chosen consensus alleles. This will make sure that source data with flipped reference and alternative allele gets correctly merged in. +4. SNP IDs, as part of PLINK `.bim` and `.vcf` files are checked across the source files. If all SNP IDs for a given SNP are missing, then the result will also be missing. If there is only one SNP ID present in some or all source packages, that ID gets forwarded to the output. In the (unusual) case that there are multiple different non-missing SNP IDs (of the form "rs" followed by a number), then a debug warning is output (which gets printed to the screen when `--debug` is selected), and simply the first value is chosen to be output into the forged `.bim` file. We decided not to throw an error in that case, because we consider the physical position of the SNP (specified by Chromosome and position) to be definitive, and the SNP ID to be of secondary importance. +5. Genetic positions, as part of PLINK `.bim` files are checked in a similar manner, with "0.0" being interpreted as missing. + +### Treatment of the `.janno` file while merging + +`forge` merges and subsets `.janno` files along with the genotype data. If a package lacks a `.janno` file, then a basic one will be created internally an on-the-fly based on the information in the genotype data, and used for the output. Missing columns across packages will be filled with `n/a`. + +For merging two `.janno` files **A** and **B** the following rules apply regarding undefined, arbitrary additional columns: + +- If **A** has an additional column which is not in **B** then empty cells in the rows imported from **B** are filled with `n/a`. +- If **A** and **B** share additional columns with identical column name, then they are treated as semantically identical units and merged accordingly. +- In the resulting `.janno` file, all additional columns from both **A** and **B** are sorted alphabetically and appended after the normal, specified variables. + +The following example illustrates the described behaviour: + +**A.janno** + +| Poseidon_ID | Group_Name | Genetic_Sex | AdditionalColumn1 | AdditionalColumn2 | +|-------------|------------|-------------|-------------------|-------------------| +| XXX011 | POP1 | M | A | D | +| XXX012 | POP2 | F | B | E | +| XXX013 | POP1 | M | C | F | + +**B.janno** + +| Poseidon_ID | Group_Name | Genetic_Sex | AdditionalColumn3 | AdditionalColumn2 | +|-------------|------------|-------------|-------------------|-------------------| +| YYY022 | POP5 | F | G | J | +| YYY023 | POP5 | F | H | K | +| YYY024 | POP5 | M | I | L | + +**A.janno + B.janno** + +| Poseidon_ID | Group_Name | Genetic_Sex | AdditionalColumn1 | AdditionalColumn2 | AdditionalColumn3 | +|-------------|------------|-------------|-------------------|-------------------|-------------------| +| XXX011 | POP1 | M | A | D | n/a | +| XXX012 | POP2 | F | B | E | n/a | +| XXX013 | POP1 | M | C | F | n/a | +| YYY022 | POP5 | F | n/a | J | G | +| YYY023 | POP5 | F | n/a | K | H | +| YYY024 | POP5 | M | n/a | L | I | + +### Treatment of the `.ssf` file while merging + +The Sequencing Source File (short `.ssf` file) is forged in exactly the same way as the `.janno` file. `.ssf` files that are present are included in the forge product, following selection of those entities which are listed in the `poseidon_IDs` columns. Columns that are only present in some packages, including those not defined in the Poseidon package specification, are also included in the forged product in the same way as described for `.janno` files above. + +### Treatment of the `.bib` file while merging + +In the forge process all relevant samples for the output package are determined. This includes their `.janno` entries and therefore the information on the publication keys documented for them in the `.janno` `Publication` column. The output `.bib` file compiles only the relevant references for the samples in the output package. It includes the references exactly once and is sorted alphabetically by key. + +### Output modes + +The output package of `forge` is created as a new directory `-o`. The title can also be explicitly defined with `-n`. + +`forge` by default returns a new output package with a generic `POSEIDON.yml` file, the genotype data as created from the input and the selection, and a `.janno` file. If the input includes `.bib` or `.ssf` files, the output will as well. + +Other output formats can be selected with these mutually exclusive flags: + +**`--onlyGeno`:** + +Only the genotype data is returned without any Poseidon package wrapping around it. This is especially useful for data analysis pipelines, where only the genotype data is required. + +**`--minimal`:** + +A minimal output package without `.janno`, `.bib` and `.ssf`. This wraps the genotype data in a very basic Poseidon package. + +**`--preservePyml`:** + +A full Poseidon package just as the default, but with various settings copied from the source package. This only works in case of a single source package. + +For the specific task of sub-setting or reordering (see above) a singular, existing Poseidon package it can be useful to preserve some fields of the `POSEIDON.yml` file of this input package, as well as supplementary information in the `README.md` and the `CHANGELOG.md` file. These are typically discarded by `forge`, but can be copied over to the output package with the new `--preservePyml` output mode. + +`--preservePyml` specifically preserves the following `POSEIDON.yml` fields: + +- `description` +- `contributor` +- `packageVersion` +- `lastModified` +- `readmeFile` +- `changelogFile` + +This does not include the package `title`, which can be easily set to be identical to the source with `-n` or `-o` if it is desired. The `poseidonVersion` field is also not copied, because `trident` can only ever produce output packages with the latest Poseidon schema version. + +With `-z|--zip` the genotype data output (independent of the selected output mode) can be wrapped in gzipped archives with the additional file extension `.gz`. `trident` can seamlessly interact with genotype data in this format. + +### Other options + +`forge` has a an optional flag `--intersect`, that defines, if the genotype data from different packages should be merged with a union or an intersect operation. See *Treatment of the genotype data while merging* above. + +`--intersect` also influences the automatic determination of the `snpSet` field in the POSEIDON.yml file for the resulting package. If the `snpSet`s of all input packages are identical, then the resulting package will just inherit this configuration. Otherwise `forge` applies the following pairwise merging logic: + +| Input snpSet A | Input snpSet B | `--intersect` | Ouput snpSet | +|----------------|----------------|---------------|--------------| +| Other | * | * | Other | +| 1240K | HumanOrigins | True | HumanOrigins | +| 1240K | HumanOrigins | False | 1240K | + +`--selectSnps` allows to provide `forge` with a SNP file in EIGENSTRAT (`.snp`) or PLINK (`.bim`) format to create a package with a specific selection. When this option is set, the output package will have exactly the SNPs listed in this file. Any SNP not listed in the file will be excluded. If `--intersect` is also set, only the SNPs overlapping between the SNP file and the forged packages are output. + +With `--packagewise` the within-package selection step in forge can be skipped. This will result in outputting all individuals in the relevant packages, and hence a superset of the requested individuals/groups. It may result in better performance in cases where one wants to forge entire packages. + +## Genoconvert command + +`genoconvert` converts the genotype data in a Poseidon package to a different file format. The respective entries in the POSEIDON.yml file are changed accordingly. + +
+ Command line details + +```default +Usage: trident genoconvert ((-d|--baseDir DIR) | + ((-p|--genoOne FILE) | --genoFile FILE + --snpFile FILE --indFile FILE | + --bedFile FILE --bimFile FILE --famFile FILE | + --vcfFile FILE) [--snpSet SET]) + --outFormat FORMAT [-o|--outPackagePath DIR] + [--removeOld] [--outPlinkPopName MODE] [--onlyLatest] + [-z|--zip] + + Convert the genotype data in a Poseidon package to a different file format + +Available options: + -h,--help Show this help text + -d,--baseDir DIR A base directory to search for Poseidon packages. + -p,--genoOne FILE One of the input genotype data files. Expects .bed, + .bed.gz, .bim, .bim.gz or .fam for PLINK, .geno, + .geno.gz, .snp, .snp.gz or .ind for EIGENSTRAT, + or.vcf or .vcf.gz for VCF. In case of EIGENSTRAT and + PLINK, the two other files must be in the same + directory and must have the same base name. If a + gzipped file is given, it is assumed that the file + pairs (.geno.gz, .snp.gz) or (.bim.gz, .bed.gz) are + both zipped, but not the .fam or .ind file. If a .ind + or .fam file is given, it is assumed that none of the + file triples is zipped. + --genoFile FILE Eigenstrat genotype matrix, optionally gzipped. + Accepted file endings are .geno, .geno.gz + --snpFile FILE Eigenstrat snp positions file, optionally gzipped. + Accepted file endings are .snp, .snp.gz + --indFile FILE Eigenstrat individual file. Accepted file endings are + .ind + --bedFile FILE Plink genotype matrix, optionally gzipped. Accepted + file endings are .bed, .bed.gz + --bimFile FILE Plink snp positions file, optionally gzipped. + Accepted file endings are .bim, .bim.gz + --famFile FILE Plink individual file. Accepted file endings are .fam + --vcfFile FILE VCF (Variant Call Format) file, optionall gzipped. + Accepted file endings are .vcf, .vcf.gz + --snpSet SET The snpSet of the package: 1240K, HumanOrigins or + Other. Only relevant for data input with -p|--genoOne + or --genoFile + --snpFile + --indFile, because the + packages in a -d|--baseDir already have this + information in their respective POSEIDON.yml files. + (default: Other) + --outFormat FORMAT the format of the output genotype data: EIGENSTRAT, + PLINK or VCF. + -o,--outPackagePath DIR Path for the converted genotype files to be written + to. If a path is provided, only the converted + genotype files are written out, with no change of the + original package. If no path is provided, genotype + files will be converted in-place, including a change + in the POSEIDON.yml file to yield an updated valid + package (default: Nothing) + --removeOld Remove the old genotype files when creating the new + ones. + --outPlinkPopName MODE Where to write the population/group name into the FAM + file in Plink-format. Three options are possible: + asFamily (default) | asPhenotype | asBoth. See also + --inPlinkPopName. + --onlyLatest Consider only the latest versions of packages, or the + groups and individuals within the latest versions of + packages, respectively. + -z,--zip Should the resulting genotype- and snp-files be + gzipped? +``` + +
+ +With the default setting + +```bash +trident genoconvert -d ... -d ... --outFormat EIGENSTRAT|PLINK +``` + +all packages in `-d` will be converted to the desired `--outFormat` (either `EIGENSTRAT` or `PLINK`), if the data is not already in this format. This includes updating the respective POSEIDON.yml files. + +The "old" data is not deleted, but kept around. That means conversion can result in a package with both PLINK and EIGENSTRAT data, but only one is linked in the POSEIDON.yml file, and that is what will be used by `trident`. To delete the old data in the conversion you can add the `--removeOld` flag. + +`-p (+ --snpSet)`, `--genoFile + --snpFile + --indFile (+ --snpSet)` (for EIGENSTRAT data), `--bedFile + --bimFile + --famFile (+ --snpSet)` (for PLINK data) or `--vcfFile (+ --snpSet)` (for VCF data) allow to directly convert genotype data that is not wrapped in a Poseidon package and store it to a directory given in `-o`. See this example: + +```bash +trident genoconvert \ + -p 2018_Mittnik_Baltic/Mittnik_Baltic.bed \ + --outFormat EIGENSTRAT \ + -o my_directory +``` + +With `-z|--zip` the genotype data output can be wrapped in gzipped archives with the additional file extension `.gz`. + +## Jannocoalesce command + +`jannocoalesce` merges information from one or multiple source `.janno` files into a target `.janno` file. + +
+ Command line details + +```default +Usage: trident jannocoalesce ((-s|--sourceFile FILE) | (-d|--baseDir DIR)) + (-t|--targetFile FILE) [-o|--outFile FILE] + [--includeColumns ARG | --excludeColumns ARG] + [-f|--force] [--sourceKey ARG] [--targetKey ARG] + [--stripIdRegex ARG] + + Coalesce information from one or multiple janno files to another one + +Available options: + -h,--help Show this help text + -s,--sourceFile FILE The source .janno file. + -d,--baseDir DIR A base directory to search for Poseidon packages. + -t,--targetFile FILE The target .janno file to fill. + -o,--outFile FILE An optional file to write the results to. If not + specified, change the target file in place. + (default: Nothing) + --includeColumns ARG A comma-separated list of .janno column names to + coalesce. If not specified, all columns that can be + found in the source and target will get filled. + --excludeColumns ARG A comma-separated list of .janno column names NOT to + coalesce. All columns that can be found in the source + and target will get filled, except the ones listed + here. + -f,--force With this option, potential non-missing content in + target columns gets overridden with non-missing + content in source columns. By default, only missing + data gets filled-in. + --sourceKey ARG The .janno column to use as the source key. + (default: "Poseidon_ID") + --targetKey ARG The .janno column to use as the target key. + (default: "Poseidon_ID") + --stripIdRegex ARG An optional regular expression to identify parts of + the IDs to strip before matching between source and + target. Uses POSIX Extended regular expressions. +``` + +
+ +A most basic run may just include two arguments: + +```bash +trident jannocoalesce \ + --sourceFile path/to/source.janno \ + --targetFile path/to/target.janno +``` + +`jannocoalesce` generally works by reading a source `.janno` file with `-s|--sourceFile` (or all `.janno` files in a `-d|--baseDir`) and a target `.janno` file with `-t|--targetFile`. + +It then merges these files by a key column, which can be selected with `--sourceKey` and `--targetKey`. The default for both of these key columns is the `Poseidon_ID`. In case the entries in the key columns slightly and systematically differ, e.g. because the `Poseidon_ID`s in either have a special suffix (for example `_SG`), then the `--stripIdRegex` option allows to strip these with a regular expression to thus match the keys. + +`jannocoalesce` generally attempts to fill **all** empty cells in the target `.janno` file with information from the source. `--includeColumns` and `--excludeColumns` allow to select specific columns for which this should be done. In some cases it may be desirable to not just fill empty fields in the target, but overwrite the information already there with the `-f|--force` option. If the target file should be preserved, then the output can be directed to a new output `.janno` file with `-o|--outFile`. + +## Rectify command + +`rectify` automatically harmonizes POSEIDON.yml files of one or multiple packages. This is not an automatic update from one Poseidon version to the next, but rather a clean-up wizard after manual modifications. It also includes additional, automatic package editing features. + +
+ Command line details + +```default +Usage: trident rectify (-d|--baseDir DIR) [--ignorePoseidonVersion] + [--poseidonVersion ?.?.?] + [--packageVersion VPART [--logText STRING]] + [--checksumAll | [--checksumGeno] [--checksumJanno] + [--checksumSSF] [--checksumBib]] + [--newContributors DSL] [--jannoRemoveEmpty] + [--onlyLatest] + + Adjust POSEIDON.yml files automatically to package changes + +Available options: + -h,--help Show this help text + -d,--baseDir DIR A base directory to search for Poseidon packages. + --ignorePoseidonVersion Read packages even if their poseidonVersion is not + compatible with trident. + --poseidonVersion ?.?.? Poseidon version the packages should be updated to: + e.g. "2.5.3". + --packageVersion VPART Part of the package version number in the + POSEIDON.yml file that should be updated: Major, + Minor or Patch (see https://semver.org). + --logText STRING Log text for this version in the CHANGELOG file. + --checksumAll Update all checksums. + --checksumGeno Update genotype data checksums. + --checksumJanno Update .janno file checksum. + --checksumSSF Update .ssf file checksum + --checksumBib Update .bib file checksum. + --newContributors DSL Contributors to add to the POSEIDON.yml file in the + form "[Firstname Lastname](Email address);...". + --jannoRemoveEmpty Reorder the .janno file and remove empty colums. + Remember to pair this option with --checksumJanno to + also update the checksum. + --onlyLatest Consider only the latest versions of packages, or the + groups and individuals within the latest versions of + packages, respectively. +``` + +
+ +It can be called with a lot of optional arguments. Note that `rectify` by default does **not** apply any changes if none of these arguments are set. Each change requires explicit opt-in. + +```bash +trident rectify -d ... -d ... \ + --poseidonVersion "X.X.X" \ + --packageVersion Major|Minor|Patch \ + --logText "short description of the update" \ + --checksumAll \ + --newContributors "[Firstname Lastname](Email address);..." \ + --jannoRemoveEmpty +``` + +The following arguments determine which fields of the POSEIDON.yml file should be modified: + +- `--poseidonVersion` allows a simple change of the `poseidonVersion` field in the POSEIDON.yml file. +- `--packageVersion` increments the package version number in the first, the second or the third position. It can optionally be called with `--logText`, which appends an entry to the CHANGELOG file for the respecitve package version update. `--logText` also creates a new CHANGELOG.md file if it does not exist yet. +- `--checksumGeno`, `--checksumJanno`, `--checksumSSF` and `--checksumBib` add or modify the respective checksum fields in the POSEIDON.yml file. `--checksumAll` is a wrapper to call all of them at once. +- `--newContributors` adds new contributors. + +As `rectify` reads and rewrites POSEIDON.yml files, it may change their inner order, layout or even content (e.g. if they have fields which are not in the POSEIDON.yml specification). Create a backup of the POSEIDON.yml file before running `rectify` if you are uncertain if this might affect you negatively. + +`--jannoRemoveEmpty` is the first option that does not edit POSEIDON.yml, but .janno files. It allows to remove empty columns from .janno files, so columns that only feature empty strings or `n/a` values. As part of this process it reorders the columns of the .janno file. Remember to pair this option with `--checksumJanno` or `checksumAll` to automatically update the .janno checksum in the POSEIDON.yml file afterwards. + +## List command + +`list` lists packages, groups, individuals and bibliography entries of local Poseidon package datasets, or of packages available in the archives on the web server. + +
+ Command line details + +```default +Usage: trident list ((-d|--baseDir DIR) | --remote [--remoteURL URL] + [--archive STRING]) + (--packages [--fullOutput] | --groups | --individuals + [--fullJanno | [-j|--jannoColumn COLNAME]] | + --bibliography [--fullBib | [-b|--bibField BIB-FIELD]]) + [--raw] [--onlyLatest] + + List packages, groups or individuals from local or remote Poseidon + repositories + +Available options: + -h,--help Show this help text + -d,--baseDir DIR A base directory to search for Poseidon packages. + --remote List packages from a remote server instead the local + file system. + --remoteURL URL URL of the remote Poseidon server. + (default: "https://server.poseidon-adna.org") + --archive STRING The name of the Poseidon package archive that should + be queried. If not given, then the query falls back + to the default archive of the server selected with + --remoteURL. See the archive documentation at + https://www.poseidon-adna.org/#/archive_overview for + a list of archives currently available from the + official Poseidon Web API. (default: Nothing) + --packages List all packages. + --fullOutput extend the output to include information contained + the POSEIDON.yml file + --groups List all groups, ignoring any group names after the + first as specified in the .janno-file. + --individuals List all individuals/samples. + --fullJanno output all Janno Columns + -j,--jannoColumn COLNAME List additional fields from the janno files, using + the .janno column heading name, such as "Country", + "Site", "Date_C14_Uncal_BP", etc... Can be given + multiple times + --bibliography output bibliography information for packages + --fullBib output all bibliography fields found in any + bibliography item + -b,--bibField BIB-FIELD List information from the given bibliography field, + for example "abstract" or "publisher". Can be given + multiple times. + --raw Return the output table as tab-separated values + without header. This is useful for piping into grep + or awk. + --onlyLatest Consider only the latest versions of packages, or the + groups and individuals within the latest versions of + packages, respectively. +``` + +
+ +To list packages from your local repositories, as seen above you can run + +```bash +trident list -d ... -d ... --packages +``` + +This will yield a nicely formatted table of all packages, their version and the number of individuals in them. With `--fullOutput` the table includes additional fields from the packages' POSEIDON.yml files. + +You can use `--remote` to show packages on the remote server. For example + +```bash +trident list --packages --remote --archive "community-archive" +``` + +will result in a view of all packages available in one of the public Poseidon archives. Just as for `fetch`, the `--archive` flag allows to choose which public archive to query. + +Independent of whether you query a local or an online archive, you can not just list packages, but also groups, as defined in the third column of EIGENSTRAT `.ind` files (or the first/last column of a PLINK `.fam` file), and individuals with the flags `--groups` and `--individuals` (instead of `--packages`). `--bibliography` returns publication-wise bibliography information. + +The `--individuals` flag additionally provides a way to immediately access information from `.janno` files on the command line. This works with the `-j|--jannoColumn` option. For example adding `-j Country -j Date_C14_Uncal_BP` to the commands above will add the `Country` and the `Date_C14_Uncal_BP` columns to the respective output tables. `--fullJanno` outputs all available columns. + +Analogously, with `--bibliography` additional fields from the .bib files can be added to the output table with `-b|--bibField ...` and `--fullBib`. `-b journal`, for example, adds a column with the publication's journal. + +Note that if you want a less ornate table, for example because you want to load this into Excel, or pipe into another command that cannot deal with the table layout, you can use the `--raw` option to output that table as a simple tab-delimited stream. + +## Summarise command + +`summarise` prints some general summary statistics for a given Poseidon dataset taken from the `.janno` files. + +
+ Command line details + +```default +Usage: trident summarise (-d|--baseDir DIR) [--raw] + + Get an overview over the content of one or multiple Poseidon packages + +Available options: + -h,--help Show this help text + -d,--baseDir DIR A base directory to search for Poseidon packages. + --raw Return the output table as tab-separated values + without header. This is useful for piping into grep + or awk. +``` + +
+ +You can run it with + +```bash +trident summarise -d ... -d ... +``` + +which will show you context information like -- among others -- the number of individuals in the dataset, their sex distribution, the mean age of the samples or the mean coverage on the 1240K SNP array in a table. `summarise` depends on complete `.janno` files and will silently ignore missing information. + +You can use the `--raw` option to output the summary table in a simple, tab-delimited layout. + +## Survey command + +`survey` tries to indicate package completeness (mostly focused on `.janno` files) for Poseidon datasets. + +
+ Command line details + +```default +Usage: trident survey (-d|--baseDir DIR) [--raw] [--onlyLatest] + + Survey the degree of context information completeness for Poseidon packages + +Available options: + -h,--help Show this help text + -d,--baseDir DIR A base directory to search for Poseidon packages. + --raw Return the output table as tab-separated values + without header. This is useful for piping into grep + or awk. + --onlyLatest Consider only the latest versions of packages, or the + groups and individuals within the latest versions of + packages, respectively. +``` + +
+ +Running + +```bash +trident survey -d ... -d ... +``` + +will yield a table with one row for each package. See `trident survey -h` for a legend which cell of this table means what. + +Again you can use the `--raw` option to output the survey table in a tab-delimited format. + +## Validate command + +`validate` checks Poseidon packages and individual package components for structural correctness. + +
+ Command line details + +```default +Usage: trident validate ((-d|--baseDir DIR) [--ignoreGeno] [--fullGeno] + [--ignoreDuplicates] [-c|--ignoreChecksums] + [--ignorePoseidonVersion] | + --pyml FILE | (-p|--genoOne FILE) | --genoFile FILE + --snpFile FILE --indFile FILE | + --bedFile FILE --bimFile FILE --famFile FILE | + --vcfFile FILE | --janno FILE | --ssf FILE | + --bib FILE) [--noExitCode] [--onlyLatest] + + Check Poseidon packages or package components for structural correctness + +Available options: + -h,--help Show this help text + -d,--baseDir DIR A base directory to search for Poseidon packages. + --ignoreGeno Ignore snp and geno file. + --fullGeno Test parsing of all SNPs (by default only the first + 100 SNPs are probed). + --ignoreDuplicates Do not stop on duplicated individual names in the + package collection. + -c,--ignoreChecksums Whether to ignore checksums. Useful for speedup in + debugging. + --ignorePoseidonVersion Read packages even if their poseidonVersion is not + compatible with trident. + --pyml FILE Path to a POSEIDON.yml file. + -p,--genoOne FILE One of the input genotype data files. Expects .bed, + .bed.gz, .bim, .bim.gz or .fam for PLINK, .geno, + .geno.gz, .snp, .snp.gz or .ind for EIGENSTRAT, + or.vcf or .vcf.gz for VCF. In case of EIGENSTRAT and + PLINK, the two other files must be in the same + directory and must have the same base name. If a + gzipped file is given, it is assumed that the file + pairs (.geno.gz, .snp.gz) or (.bim.gz, .bed.gz) are + both zipped, but not the .fam or .ind file. If a .ind + or .fam file is given, it is assumed that none of the + file triples is zipped. + --genoFile FILE Eigenstrat genotype matrix, optionally gzipped. + Accepted file endings are .geno, .geno.gz + --snpFile FILE Eigenstrat snp positions file, optionally gzipped. + Accepted file endings are .snp, .snp.gz + --indFile FILE Eigenstrat individual file. Accepted file endings are + .ind + --bedFile FILE Plink genotype matrix, optionally gzipped. Accepted + file endings are .bed, .bed.gz + --bimFile FILE Plink snp positions file, optionally gzipped. + Accepted file endings are .bim, .bim.gz + --famFile FILE Plink individual file. Accepted file endings are .fam + --vcfFile FILE VCF (Variant Call Format) file, optionall gzipped. + Accepted file endings are .vcf, .vcf.gz + --janno FILE Path to a .janno file. + --ssf FILE Path to a .ssf file. + --bib FILE Path to a .bib file. + --noExitCode Do not produce an explicit exit code. + --onlyLatest Consider only the latest versions of packages, or the + groups and individuals within the latest versions of + packages, respectively. +``` + +
+ +You can run it with + +```bash +trident validate -d ... -d ... +``` + +to check packages and it will either report a success (`Validation passed`) or failure with specific error messages. + +Instead of validating entire packages with `-d` you can also apply it to individual files and package components: `--pyml` (POSEIDON.yml), `-p | --genoFile + --snpFile + --indFile | --bedFile + --bimFile + --famFile | --vcfFile` (genotype data), `--janno` (.janno file), `--ssf` (.ssf file) or `--bib` (.bib file). In this case `validate` attempts to read and parse the respecitve files individually and reports any issues it encounters. Note that this considers the files in isolation and does not include any cross-file consistency checks. + +When applied to packages, `validate` tries to ensure that each package adheres to the Poseidon package specification. Here is a list of what is checked: + +- Structural correctness of the POSEIDON.yml file. +- Presence of all files references in the POSEIDON.yml file. +- Full structural correctness of .janno, `.ssf` and `.bib` file. +- Superficial correctness of genotype data files by parsing the first 100 SNPs. A full check that parses all SNPs can be triggered with the `--fullGeno` option. `--ignoreGeno`, on the other hand, causes `validate` to ignore the genotype data entirely, which speeds up the validation significantly. +- Correspondence of BibTeX keys in `.bib` and .janno +- Correspondence of sample IDs in `.janno` and .ssf. +- Correspondence of sample and group IDs in `.janno` and genotype data files. + +In fact much of this validation already runs as part of the general package reading pipeline invoked for other `trident` subcommands (e.g. `forge`). `validate` is meant to be more thorough and brittle, though, and will explicitly fail if even a single package is broken. For special cases more flexibility can be enabled with the options `--ignoreDuplicates`, `--ignoreChecksums` and `--ignorePoseidonVersion`. + +Remember to run `validate` with `--debug` to get more information in case the default output is not sufficient to analyse an issue. diff --git a/version_table.md b/version_table.md index 639ab93f..7ff34b2f 100644 --- a/version_table.md +++ b/version_table.md @@ -2,6 +2,8 @@ The following table documents which versions of the Poseidon standard are compatible with which versions of the software tools. +**✓** indicates full support and **⚠** marks partial or imperfect support, but usually still to a well usable level. **`_`** means the tool may or may not work or break in unexpected ways depending on the input data. +