diff --git a/docs/v4.3.1/Citations.md b/docs/v4.3.1/Citations.md new file mode 100644 index 0000000..d600b86 --- /dev/null +++ b/docs/v4.3.1/Citations.md @@ -0,0 +1,30 @@ +# Citing Truvari + +English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 + +# Citations + +List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. +* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) +* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) +* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) +* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) +* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) +* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) +* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) +* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) +* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) +* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) +* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) +* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) +* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) +* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) +* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) +* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) +* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) +* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) +* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) +* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) +* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) +* [MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170)](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170) +* [Automated filtering of genome-wide large deletions through an ensemble deep learning framework](https://www.sciencedirect.com/science/article/pii/S1046202322001712#b0110) diff --git "a/docs/v4.3.1/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" "b/docs/v4.3.1/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" new file mode 100644 index 0000000..1ea2421 --- /dev/null +++ "b/docs/v4.3.1/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" @@ -0,0 +1,23 @@ +Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made. + +If you find `truvari collapse` 'freezing' during processing where it is no longer writing variants to the output VCFs, you should investigate if there are regions which should be excluded from analysis. To do this, first run: + +``` +truvari anno chunks input.vcf.gz > counts.bed +``` + +The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 3rd column has the number of SVs inside the chunk while the 4th and 5th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 4th is after separating by svtype and svlen, the 5th is re-chunking the svtype/svlen separation by distance. + +If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset the regions of interest and then use `bedtools` to create a bed file that will skip them. + +``` +# exclude regions with more than 30k SVs. You should test this threshold. +awk '$4 >= 30000' counts.bed > to_exclude.bed +bedtools complement -g genome.bed -i to_exclude.bed > to_analyize.bed + +truvari collapse --bed to_analyze.bed -i input.vcf.gz +``` + +When considering what threshold you would like to use, just looking at the 3rd column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if there are 100k SVs in a window, but they all could collapse, it would only take O(N - 1) comparisons to perform the collapse. The problems arise when the 100k SVs have few redundant SVs and therefore requires O(N**2) comparisons. + +A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded an a collapsing could be repeated until success. \ No newline at end of file diff --git a/docs/v4.3.1/Development.md b/docs/v4.3.1/Development.md new file mode 100644 index 0000000..ccb3849 --- /dev/null +++ b/docs/v4.3.1/Development.md @@ -0,0 +1,90 @@ +# Truvari API +Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). + +Documentation can also be seen using +```python +import truvari +help(truvari) +``` + +# docker + +A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run +```bash +docker build -t truvari . +``` + +You can then run Truvari through docker using +```bash +docker run -v `pwd`:/data -it truvari +``` +Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. +```bash +docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz +``` + +If you'd like to interact within the docker container for things like running the CI/CD scripts +```bash +docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari +``` +You'll now be inside the container and can run FuncTests or run Truvari directly +```bash +bash repo_utils/truvari_ssshtests.sh +truvari anno svinfo -i example.vcf.gz +``` + +# CI/CD + +Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. + +```bash +pip install pylint anybadge coverage +``` + +Check code formatting with +```bash +python repo_utils/pylint_maker.py +``` +We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. + +Test the code and generate a coverage report with +```bash +bash repo_utils/truvari_ssshtests.sh +``` + +Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. + +# git flow + +To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` + +When contributing to the code, be sure you're working off of develop and have run `git flow init`. + +# versioning + +Truvari uses [Semantic Versioning](https://semver.org/) and tries to stay compliant to [PEP440](https://peps.python.org/pep-0440/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. If you install Truvari from the develop branch, the git repo hash is appended to the installed version as well as '.uc' if there are un-staged commits in the repo. + +# docs + +The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. + +# Creating a release +Follow these steps to create a release + +0) Bump release version +1) Run tests locally +2) Update API Docs +3) Change Updates Wiki +4) Freeze the Wiki +5) Ensure all code is checked in +6) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) +7) Use github action to make a testpypi release +8) Check test release +```bash +python3 -m venv test_truvari +python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari +``` +9) Use GitHub action to make a pypi release +10) Download release-tarball.zip from step #9’s action +11) Create release (include #9) from the tag +12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v4.3.1/Home.md b/docs/v4.3.1/Home.md new file mode 100644 index 0000000..e76a3cb --- /dev/null +++ b/docs/v4.3.1/Home.md @@ -0,0 +1,36 @@ +The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) + +Citation: +English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 + +# Before you start +VCFs aren't always created with a strong adherence to the format's specification. + +Truvari expects input VCFs to be valid so that it will only output valid VCFs. + +We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. + +Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. + +Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. + +Finally, Truvari does not handle multi-allelic VCF entries and as of v4.0 will throw an error if multi-allelics are encountered. Please use `bcftools norm` to split multi-allelic entries. + +# Index + +- [[Updates|Updates]] +- [[Installation|Installation]] +- Truvari Commands: + - [[anno|anno]] + - [[bench|bench]] + - [[collapse|collapse]] + - [[consistency|consistency]] + - [[divide|divide]] + - [[ga4gh|ga4gh]] + - [[phab|phab]] + - [[refine|refine]] + - [[segment|segment]] + - [[stratify|stratify]] + - [[vcf2df|vcf2df]] +- [[Development|Development]] +- [[Citations|Citations]] \ No newline at end of file diff --git a/docs/v4.3.1/Installation.md b/docs/v4.3.1/Installation.md new file mode 100644 index 0000000..a8aae74 --- /dev/null +++ b/docs/v4.3.1/Installation.md @@ -0,0 +1,72 @@ +Recommended +=========== +For stable versions of Truvari, use pip +``` +python3 -m pip install truvari +``` +Specific versions can be installed via +``` +python3 -m pip install truvari==3.2.0 +``` +See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. + +Manual Installation +=================== +To build Truvari directly, clone the repository and switch to a specific tag. +``` +git clone https://github.com/ACEnglish/truvari.git +git checkout tags/v3.0.0 +python3 -m pip install . +``` + +To see a list of all available tags, run: +``` +git tag -l +``` + +If you have an older clone of the repository and don't see the version you're looking for in tags, make sure to pull the latest changes: +``` +git pull +git fetch --all --tags +``` + +Mamba / Conda +============= +NOTE!! There is a very old version of Truvari on bioconda that - for unknown reasons - supersedes the newer, supported versions. Users may need to specify to conda which release to build. See [this ticket](https://github.com/ACEnglish/truvari/issues/130#issuecomment-1196607866) for details. + +Truvari releases are automatically deployed to bioconda. +Users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) + +Creating an environment with Truvari and its dependencies. +``` +mamba create -c conda-forge -c bioconda -n truvari truvari +``` + +Alternatively, see the [conda page](https://anaconda.org/bioconda/truvari) for details +``` +conda install -c bioconda truvari +``` + +Building from develop +===================== +The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. + +Docker +====== +See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. + +edlib error +=========== +Some environments have a hard time installing edlib due to a possible cython bug ([source](https://bugs.launchpad.net/ubuntu/+source/cython/+bug/2006404)). If you seen an error such as the following: +``` +edlib.bycython.cpp:198:12: fatal error: longintrepr.h: No such file or directory + 198 | #include "longintrepr.h" +``` + +One method to prevent the error message is to run the following commands ([source](https://github.com/Martinsos/edlib/issues/212)) +```bash +python3 -m venv my_env +source my_env/bin/activate +python3 -m pip install --no-cache-dir cython setuptools wheel +EDLIB_USE_CYTHON=1 python3 -m pip install truvari --no-cache +``` \ No newline at end of file diff --git a/docs/v4.3.1/MatchIds.md b/docs/v4.3.1/MatchIds.md new file mode 100644 index 0000000..ca52076 --- /dev/null +++ b/docs/v4.3.1/MatchIds.md @@ -0,0 +1,74 @@ +MatchIds are used to tie base/comparison calls together in post-processing for debugging or other exploring. MatchIds have a structure of `{chunkid}.{callid}`. The chunkid is unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were compared. The callid is unique to a call in a chunk for each VCF. Because `bench` processes two VCFs (the base and comparison VCFs), the `MatchId` has two values: the first is the base variant's MatchId and the second the comparison variant's MatchId. + +For `--pick single`, the two MatchIds will be identical in the e.g. tp-base.vcf.gz and tp-comp.vcf.gz. However, for `--pick ac|multi`, it's possible to have cases such as one base variant matching to multiple comparison variants. That would give us MatchIds like: + +``` +# tp-base.vcf +MatchId=4.0,4.1 + +# tp-comp.vcf +MatchId=4.0,4.1 +MatchId=4.0,4.2 +``` + +This example tells us that the tp-comp variants are both pointing to `4.0` in tp-base. The tp-base variant has a higher match to the tp-comp `4.1` variant. + +One easy way to combine matched variants is to use `truvari vcf2df` to convert a benchmarking result to a pandas DataFrame and leverage pandas' merge operation. First, we convert the `truvari bench` result. + +```bash +truvari vcf2df --info --bench-dir bench_result/ data.jl +``` + +Next, we combine rows of matched variants: +```python +import joblib +import pandas as pd + +# Load the data +data = joblib.load("data.jl") + +# Separate out the variants from the base VCF and add new columns of the base/comp ids +base = data[data['state'].isin(['tpbase', 'fn'])].copy() +base['base_id'] = base['MatchId'].apply(lambda x: x[0]) +base['comp_id'] = base['MatchId'].apply(lambda x: x[1]) + +# Separate out the variants from the comparison VCF and add new columns of the base/comp ids +comp = data[data['state'].isin(['tp', 'fp'])].copy() +comp['base_id'] = comp['MatchId'].apply(lambda x: x[0]) +comp['comp_id'] = comp['MatchId'].apply(lambda x: x[1]) + +# Merge the base/comparison variants +combined = pd.merge(base, comp, left_on='base_id', right_on='comp_id', suffixes=('_base', '_comp')) + +# How many comp variants matched to multiple base variants? +counts1 = combined['base_id_comp'].value_counts() +print('multi-matched comp count', (counts1 != 1).sum()) + +# How many base variants matched to multiple comp variants? +counts2 = combined['comp_id_base'].value_counts() +print('multi-matched base count', (counts2 != 1).sum()) +``` + +The `MatchId` is also used by `truvari collapse`. However there are two differences. First, in the main `collapse` output, the relevant INFO field is named `CollapsedId`. Second, because collapse only has a single input VCF, it is much easier to merge DataFrames. To merge collapse results kept variants with those that were removed, we again need to convert the VCFs to DataFrames: + +```bash +truvari vcf2df -i kept.vcf.gz kept.jl +truvari vcf2df -i removed.vcf.gz remov.jl +``` + +Then we combine them: +```python +import joblib +import pandas as pd + +# Load the kept variants and set the index. +kept = joblib.load("kept.jl").set_index('CollapseId') + +# Load the removed variants and set the index. +remov = joblib.load("remov.jl") +remov['CollapseId'] = remov['MatchId'].apply(lambda x: x[0]) +remov.set_index('CollapseId', inplace=True) + +# Join the two sets of variants +result_df = kept.join(remov, how='right', rsuffix='_removed') +``` \ No newline at end of file diff --git a/docs/v4.3.1/Multi-allelic-VCFs.md b/docs/v4.3.1/Multi-allelic-VCFs.md new file mode 100644 index 0000000..fd0eb23 --- /dev/null +++ b/docs/v4.3.1/Multi-allelic-VCFs.md @@ -0,0 +1,11 @@ +Truvari only compares the first alternate allele in VCFs. If a VCF contains multi-allelic sites such as: + +``` +chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... +``` + +Then pre-process the VCFs with bcftools: + +```bash +bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz +``` \ No newline at end of file diff --git a/docs/v4.3.1/Updates.md b/docs/v4.3.1/Updates.md new file mode 100644 index 0000000..1cb41cf --- /dev/null +++ b/docs/v4.3.1/Updates.md @@ -0,0 +1,297 @@ +# Truvari 4.3.1 +*September 9, 2024* +* `bench` + * Correctly filtering `ALT=*` alleles ([details](https://github.com/ACEnglish/truvari/discussions/219)) and monomorphic reference + * including test coverage this time +* `stratify` + * Default behavior is to count variants within ([#221](https://github.com/ACEnglish/truvari/issues/221)) +* `collapse` + * Faster sub-chunking operations by dropping use of pyintervaltree +* `anno chunks` + * New command for identifying windows with a high number of SVs ([details](https://github.com/ACEnglish/truvari/wiki/Collapse-on-Regions-with-a-High%E2%80%90Density-of-SVs)) + + +# Truvari 4.3.0 +*July 31, 2024* +* `refine` & `stratify` + * Fixed variant and bed boundary overlapping issue +* general + * Definition of variants within a region now includes replacement style from TR callers having an anchor base 1bp upstream of catalog/includebed regions + * Propagating MAFFT errors ([#204](https://github.com/ACEnglish/truvari/issues/204)) + * FIPS compliance ([#205](https://github.com/ACEnglish/truvari/issues/205)) + * Allow csi-indexed vcf files ([#209](https://github.com/ACEnglish/truvari/issues/209)) + * bcftools sort tempfile ([#213](https://github.com/ACEnglish/truvari/issues/213)) + +# Truvari 4.2.2 +*March 28, 2024* + +* `collapse` + * Fewer comparisons needed per-chunk on average + * Fixed `--chain` functionality ([details](https://github.com/ACEnglish/truvari/issues/196)) + * Fixed `--gt` consolidation of format fields +* `bench` + * Faster result at the cost of less complete annotations with `--short` flag +* `refine` + * Assures variants are sequence resolved before incorporating into consensus + * `bench --passonly --sizemax` parameters are used when building consensus for a region. Useful for `refine --use-original-vcfs` + * When a refined region has more than 5k variants, it is skipped and a warning of the region is written to the log + * Flag `--use-region-coords` now expands `--region` coordinates by 100bp (`phab --buffer` default) to allow variants to harmonize out of regions. +* general + * Dynamic bed/vcf parsing tries to choose faster of streaming/fetching variants + +# Truvari 4.2.1 +*February 6, 2024* +* `collapse` + * Faster handling of genotype data for `--gt` and `--keep common` +* general + * Fix to bed end position bug for including variants ([details](https://github.com/ACEnglish/truvari/issues/193)) + * Fix to Dockerfile +* `refine` + * Changes to `--recount` that accompany the fix to bed end positions. +* New command `ga4gh` to convert Truvari results into GA4GH truth/query VCFs with intermediates tags + +# Truvari 4.2 +*January 12, 2024* +* `collapse` + * New parameter `--gt` disallows intra-sample events to collapse ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--gt)) + * New parameter `--intra` for consolidating SAMPLE information during intra-sample collapsing ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--intra)) + * Preserve phasing information when available + * Faster O(n-1) algorithm instead of O(n^2) + * Faster sub-chunking strategy makes smaller chunks of variants needing fewer comparisons + * Fixed rare non-determinism error in cases where multiple variants are at the same position and equal qual/ac could be ordered differently. +* `phab` + * Correct sample handling with `--bSamples` `--cSamples` parameters + * Faster generation of consensus sequence + * Resolved 'overlapping' variant issue causing variants to be dropped + * New `poa` approach to harmonization. Faster than mafft but less accurate. Slower than wfa but more accurate. +* `bench` + * New, easier `MatchId` field to track which baseline/comparison variants match up [details](https://github.com/ACEnglish/truvari/wiki/MatchIds) + * `entry_is_present` method now considers partial missing variants (e.g. `./1`) as present + * Removed the 'weighted' metrics from `summary.json` +* `consistency` + * Fixed issue with counting duplicate records + * Added flag to optionally ignore duplicate records +* `anno svinfo` now overwrites existing SVLEN/SVTYPE info fields +* general + * Reduced fn matches for unroll sequence similarity by reporting maximum of multiple manipulations of variant sequence (roll up/down/none). Comes at a small, but reasonable, expense of some more fp matches. + * Bump pysam version + * Fixed bug in `unroll` sequence similarity that sometimes rolled from the wrong end + * Fixed bug for handling of None in ALT field + * `truvari.compress_index_vcf` forces overwriting of tabix index to prevent annoying crashes + + +# Truvari 4.1 +*August 7, 2023* + +* `bench` + * Creates `candidate.refine.bed` which hooks into `refine` on whole-genome VCFs [details](https://github.com/ACEnglish/truvari/wiki/bench#refining-bench-output) + * `--recount` for correctly assessing whole-genome refinement results + * experimental 'weighted' summary metrics [details](https://github.com/ACEnglish/truvari/wiki/bench#weighted-performance) + * Unresolved SVs (e.g. `ALT == `) are filtered when `--pctseq != 0` +* `phab` + * ~2x faster via reduced IO from operating in stages instead of per-region + * Removed most external calls (e.g. samtools doesn't need to be in the environment anymore) + * new `--align wfa` allows much faster (but slightly less accurate) variant harmonization + * increased determinism of results [detals](https://github.com/ACEnglish/truvari/commit/81a9ab85b91b0c530f9faeedfa4e7e0d68a5e8c2) +* `refine` + * Faster bed file intersection of `--includebed` and `--regions` + * Refine pre-flight check + * Correct refine.regions.txt end position from IntervalTree correction + * Better refine region selection with `--use-original` + * `--use-includebed` switched to `--use-region-coords` so that default behavior is to prefer the includebed's coordinates + * `--use-original-vcfs` to use the original pre-bench VCFs + * `refine.variant_summary.json` is cleaned of uninformative metrics +* `stratify` + * parallel parsing of truvari directory to make processing ~4x faster +* `msa2vcf` Fixed REPL decomposition bug to now preserve haplotypes +* `anno grpaf` - expanded annotation info fields +* `anno density` - new parameter `--stepsize` for sliding windows +* `collapse` + * New optional `--median-info` fields [#146](https://github.com/ACEnglish/truvari/issues/146) +* Minor updates + * Fix some `anno` threading on macOS [#154](https://github.com/ACEnglish/truvari/issues/154) + * Monomorphic/multiallelic check fix in `bench` + * `PHAB_WRITE_MAFFT` environment variable to facilitate updating functional test answer key + * Slightly slimmer docker container + +# Truvari 4.0 +*March 13, 2023* + +As part of the GIAB TR effort, we have made many changes to Truvari's tooling to enable comparison of variants in TR regions down to 5bp. Additionally, in order to keep Truvari user friendly we have made changes to the UI. Namely, we've updated some default parameters, some command-line arguments, and some outputs. There are also a few new tools and how a couple of tools work has changed. Therefore, we decided to bump to a new major release. If you're using Truvari in any kind of production capacity, be sure to test your pipeline before moving to v4.0. + +* New `refine` command for refining benchmarking results. [Details](refine) +* `bench` + * [Unroll](bench#unroll) is now the default sequence comparison approach. + * New `--pick` parameter to control the number of matches a variant can participate in [details](bench#controlling-the-number-of-matches) + * The `summary.txt` is now named `summary.json` + * Outputs parameters to `params.json` + * Output VCFs are sorted, compressed, and indexed + * Ambiguous use of 'call' in outputs corrected to 'comp' (e.g. `tp-call.vcf.gz` is now `tp-comp.vcf.gz`) + * Renamed `--pctsim` parameter to `--pctseq` + * Fixed bug where FP/FN weren't getting the correct, highest scoring match reported + * Fixed bug where `INFO/Multi` wasn't being properly applied + * Fixed bug where variants spanning exactly one `--includebed` region were erroneously being counted. + * Removed parameters: `--giabreport`, `--gtcomp`,`--multimatch`, `--use-lev`, `--prog`, `--unroll` +* `collapse` + * Renamed `--pctsim` parameter to `--pctseq` + * Runtime reduction by ~40% with short-circuiting during `Matcher.build_match` + * Better output sorting which may allow pipelines to be a little faster. +* `vcf2df` + * More granular sizebins for `[0,50)` including better handling of SNPs + * `--multisample` is removed. Now automatically add all samples with `--format` + * key index column removed and replaced by chrom, start, end. Makes rows easier to read and easier to work with e.g. pyranges +* `anno` + * Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use `--input` but a positional argument. + * Added `addid` +* `consistency` + * Slight speed improvement + * Better json output format +* `segment` + * Added `--passonly` flag + * Changed UI, including writing to stdout by default + * Fixed END and 1bp DEL bugs, now adds N to segmented variants' REF, and info fields SVTYPE/SVLEN +* API + * Began a focused effort on improving re-usability of Truvari code. + * Entry point to run benchmarking programmatically with [Bench object](https://truvari.readthedocs.io/en/latest/truvari.html#bench). + * Better development version tracking. [details](https://github.com/ACEnglish/truvari/commit/4bbf8d9a5be3b6a3f935afbd3a9b323811b676a0) + * Improved developer documentation. See [readthedocs](https://truvari.readthedocs.io/) +* general + * msa2vcf now left-trims and decomposes variants into indels + * Functional tests reorganization + * Fix for off-by-one errors when using pyintervaltree. See [ticket](https://github.com/ACEnglish/truvari/issues/137) + * Removed progressbar and Levenshtein dependencies as they are no longer used. + +# Truvari 3.5 +*August 27, 2022* + +* `bench` + * `--dup-to-ins` flag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks + * New `--unroll` sequence comparison method for `bench` and `collapse` ([details](bench#unroll)) +* Major `anno trf` refactor (TODO write docs) including: + * annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT + * allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations +* New `anno grpaf` annotates vcf with allele frequency info for groups of samples +* New `phab` for variant harmonization ([details](../phab)) +* backend + * `truvari.entry_size` returns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) + * New key utility for `truvari.build_anno_trees` +* general + * Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4 + * Nice colors in some `--help` with [rich](https://github.com/Textualize/rich/) +* `divide` + * output shards are now more easily sorted (i.e. `ls divide_result/*.vcf.gz` will return the shards in the order they were made) + * compression/indexing of sub-VCFs in separate threads, reducing runtime +* user issues + * Monomorphic reference ALT alleles no longer throw an error in `bench` ([#131](https://github.com/ACEnglish/truvari/issues/131)) + * `SVLEN Number=A` fix ([#132](https://github.com/ACEnglish/truvari/issues/132)) + +# Truvari 3.4 +*July 7, 2022* + +* Improved performance of `consistency` (see [#127](https://github.com/ACEnglish/truvari/pull/127)) +* Added optional json output of `consistency` report +* Allow GT to be missing, which is allowed by VCF format specification +* TRF now uses `truvari.entry_variant_type` instead of trying to use `pysam.VariantRecord.info["SVLEN"]` +directly which allows greater flexibility. +* vcf2df now parses fields with `Number=\d` (e.g. 2+), which is a valid description +* `truvari.seqsim` is now case insensitive (see [#128](https://github.com/ACEnglish/truvari/issues/128)) +* Collapse option to skip consolidation of genotype information so kept alleles are unaltered +* `truvari anno dpcnt --present` will only count the depths of non ./. variants +* New collapse annotation `NumConsolidate` records how many FORMATs were consolidated +* Official [conda](https://anaconda.org/bioconda/truvari) support + +# Truvari 3.3 +*May 25, 2022* + +* New utilities `vcf_ranges` and `make_temp_filename` +* New annotations `dpcnt` and `lcr` +* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working +* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF +* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details + +# Truvari 3.2 +*Apr 1, 2022* + +* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation +* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) +* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. +* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 +* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) +* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. +* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) +* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) +* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity +* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions +* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working +* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) +* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) + +# Truvari 3.1 +*Dec 22, 2021* + +* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. +* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) +* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically +* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` +* `collapse` refactored to use Match object and for prettier code, cleaner output. +* `anno remap` now optionally adds `INFO` field of the location of the top N hits. +* An experimental tool `truvari segment` added to help SV association analysis. +* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. +* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. +* Improved documentation, including http://truvari.readthedocs.io/ for developers. +* Increasing/diversifying test coverage exposed minor bugs which were fixed. +* `bench --no-ref --cSample` bug fixes. +* Minor usability feature implemented in `help_unknown_cmd`. + +# Truvari 3.0 +*Sep 15, 2021* + +As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. + +* Working on speed improvements +* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). +* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. +* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. +* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being +properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best +match was chosen. The latter is a reporting problem, but the former had an effect on the results with +~3% of collapsed calls being mis-collapsed. +* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. +* `truvari anno grm` has min_size and regions arguments added. +* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. +* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. +* The method af_calc now makes MAC/AC. +* Added 'partial' to `truvari anno remap`. +* Added `truvari anno svinfo`. +* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). +* Ubiquitous single version. +* Added a Dockerfile and instructions for making a Truvari docker container. +* Code and repository cleaning. +* Github actions for automated pylint, testing, and releases to pypi. +* Preserving per-version documentation from the wiki in `docs/`. + + +# Truvari 2.1 +*Jan 27, 2021* + +We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. + +# Truvari 2.0 +*May 14, 2020* + +After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. + +Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). + +We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. + +# Truvari 1.3 +*September 25th, 2019* + +Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ +We're now only supporting Python 3. + +Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation +below. This will enable us to refactor the code for easier maintenance and reusability. + +Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v4.3.1/anno.md b/docs/v4.3.1/anno.md new file mode 100644 index 0000000..5583590 --- /dev/null +++ b/docs/v4.3.1/anno.md @@ -0,0 +1,494 @@ + +Truvari annotations: +* [gcpct](anno#truvari-anno-gcpct) - GC Percent +* [gtcnt](anno#truvari-anno-gtcnt) - Genotype Counts +* [trf](anno#truvari-anno-trf) - Tandem Repeats +* [grm](anno#truvari-anno-grm) - Mappability +* [repmask](anno#truvari-anno-repmask) - Repeats +* [remap](anno#truvari-anno-remap) - Allele Remapping +* [hompct](anno#truvari-anno-hompct) - Homozygous Percent +* [numneigh](anno#truvari-anno-numneigh) - Number of Neighbors +* [svinfo](anno#truvari-anno-svinfo) - SVINFO Fields +* [bpovl](anno#truvari-anno-bpovl) - Annotation Intersection +* [density](anno#truvari-anno-density) - Call Density +* [dpcnt](anno#truvari-anno-dpcnt) - Depth (DP) and Alt-Depth (AD) Counts +* [lcr](anno#truvari-anno-lcr) - Low-complexity Regions +* [grpaf](anno#truvari-anno-grpaf) - Sample Group Allele-Frequency Annotations + +# truvari anno gcpct + +This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. + +For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. +``` +usage: gcpct [-h] [-o OUTPUT] -r REFERENCE [input] + +Annotates GC content of SVs + +positional arguments: + input VCF to annotate (stdin) + +options: + -h, --help show this help message and exit + -o OUTPUT, --output OUTPUT + Output filename (stdout) + -r REFERENCE, --reference REFERENCE + Reference fasta +``` + +# truvari anno gtcnt +This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. + +``` +usage: gtcnt [-h] [-o OUTPUT] [input] + +Annotates GTCounts of alleles + +positional arguments: + input VCF to annotate (stdin) + +options: + -h, --help show this help message and exit + -o OUTPUT, --output OUTPUT + Output filename (stdout) +``` + +# truvari anno trf +Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants a VCF. + +### Annotations added +| Field Name | Description | +|------------|-------------------------------------------------------------| +| TRF | Entry hits a tandem repeat region | +| TRFdiff | ALT TR copy difference from reference | +| TRFrepeat | Repeat motif | +| TRFovl | Percent of ALT covered by TRF annotation | +| TRFstart | tart position of discovered repeat | +| TRFend | End position of discovered repeat | +| TRFperiod | eriod size of the repeat | +| TRFcopies | Number of copies aligned with the consensus pattern | +| TRFscore | Alignment score | +| TRFentropy | Entropy measure | +| TRFsim | Similarity of ALT sequence to generated motif faux sequence | + +### Details +Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via: +```bash +zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz +tabix anno.trf.bed.gz +``` +For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)]. + +For insertions, by default the tool first tries to estimate which motif is contained in the alternate sequence. For each overlapping annotation, the copy number difference of the motif is calculated via `copy_diff = len(entry.alts[0][1:]) / anno["period"]`. Next, a 'feaux sequence' is made from `copy_diff` number of the motif. If the sequence is above the `--motif-similarity` with the insertion sequence, that is considered the insertion's motif. If no estimate is above the `--motif-similarity`, the insertion is incorporated into the reference and TRF is run. If the discovered TRF hits match a motif in the tandem repeat regions file, that annotation is used. If the highest scoring TRF hit doesn't match the tandem repeats region file, the nominally de novo annotation is added to the insertion's vcf entry. + +``` +usage: trf [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] -r REPEATS -f REFERENCE [-s MOTIF_SIMILARITY] + [-m MIN_LENGTH] [-R] [--no-estimate] [-C CHUNK_SIZE] [-t THREADS] [--debug] + +Intersect vcf with reference tandem repeats and annotate +variants with the best fitting repeat motif and its copy number +relative to the reference + +options: + -h, --help show this help message and exit + -i INPUT, --input INPUT + VCF to annotate + -o OUTPUT, --output OUTPUT + Output filename (stdout) + -e EXECUTABLE, --executable EXECUTABLE + Path to tandem repeat finder (trf409.linux64) + -T TRF_PARAMS, --trf-params TRF_PARAMS + Default parameters to send to trf (3 7 7 80 5 40 500 -h -ngs) + -r REPEATS, --repeats REPEATS + Reference repeat annotations + -f REFERENCE, --reference REFERENCE + Reference fasta file + -s MOTIF_SIMILARITY, --motif-similarity MOTIF_SIMILARITY + Motif similarity threshold (0.9) + -m MIN_LENGTH, --min-length MIN_LENGTH + Minimum size of entry to annotate (50) + -R, --regions-only Only write variants within --repeats regions (False) + --no-estimate Skip INS estimation procedure and run everything through TRF. (False) + -C CHUNK_SIZE, --chunk-size CHUNK_SIZE + Size (in mbs) of reference chunks for parallelization (5) + -t THREADS, --threads THREADS + Number of threads to use (1) + --debug Verbose logging +``` + +# truvari anno grm + +For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. +We then remap that kmer to the reference genome and report alignment information. +This does not alter the VCF traditional annotations, but instead will create a pandas +DataFrame and save it to a joblib object. + +There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. +So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". + +In the alignment information per-query, there are three 'hit' counts: +- nhits : number of query hits +- dir_hits : direct strand hit count +- com_hits : compliment strand hit count + +The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) + +The suffixes are: +- q : mapping quality score of the hits +- ed : edit distance of the hits +- mat : number of matches +- mis : number of mismatches + +For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. + +``` +usage: grm [-h] -i INPUT -r REFERENCE [-R REGIONS] [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] [-t THREADS] [--debug] + +Maps graph edge kmers with BWA to assess Graph Reference Mappability + +options: + -h, --help show this help message and exit + -i INPUT, --input INPUT + Input VCF + -r REFERENCE, --reference REFERENCE + BWA indexed reference + -R REGIONS, --regions REGIONS + Bed file of regions to parse (None) + -o OUTPUT, --output OUTPUT + Output dataframe (results.jl) + -k KMERSIZE, --kmersize KMERSIZE + Size of kmer to map (50) + -m MIN_SIZE, --min-size MIN_SIZE + Minimum size of variants to map (25) + -t THREADS, --threads THREADS + Number of threads (1) + --debug Verbose logging +``` + +# truvari anno repmask + +``` +usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] + [--debug] + + Wrapper around RepeatMasker to annotate insertion sequences in a VCF + +options: + -h, --help show this help message and exit + -i INPUT, --input INPUT + VCF to annotate (None) + -o OUTPUT, --output OUTPUT + Output filename (/dev/stdout) + -e EXECUTABLE, --executable EXECUTABLE + Path to RepeatMasker (RepeatMasker) + -m MIN_LENGTH, --min-length MIN_LENGTH + Minimum size of entry to annotate (50) + -M MAX_LENGTH, --max-length MAX_LENGTH + Maximum size of entry to annotate (50000) + -t THRESHOLD, --threshold THRESHOLD + Threshold for pct of allele covered (0.8) + -p PARAMS, --params PARAMS + Default parameter string to send to RepeatMasker (-pa {threads} -qq -e hmmer -species human -lcambig + -nocut -div 50 -no_id -s {fasta}) + -T THREADS, --threads THREADS + Number of threads to use (1) + --debug Verbose logging +``` + +# truvari anno remap + +Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. + +![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) + +``` +usage: remap [-h] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] [-t THRESHOLD] [-d DIST] [-H HITS] [--debug] [input] + +Remap VCF'S alleles sequence to the reference to annotate REMAP + +- novel : Allele has no hits in reference +- tandem : Allele's closest hit is within len(allele) bp of the SV's position +- interspersed : Allele's closest hit is not tandem +- partial : Allele only has partial hit(s) less than --threshold + +Which alleles and alignments to consider can be altered with: +- --minlength : minimum SV length to considred (50) +- --dist : For deletion SVs, do not consider alignments that hit within Nbp of the SV's position +(a.k.a. alignments back to the source sequence) (10) +- --threshold : Minimum percent of allele's sequence used by alignment to be considered (.8) + +positional arguments: + input Input VCF (/dev/stdin) + +options: + -h, --help show this help message and exit + -r REFERENCE, --reference REFERENCE + BWA indexed reference + -o OUTPUT, --output OUTPUT + Output VCF (/dev/stdout) + -m MINLENGTH, --minlength MINLENGTH + Smallest length of allele to remap (50) + -t THRESHOLD, --threshold THRESHOLD + Threshold for pct of allele covered to consider hit (0.8) + -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's position to be considered (10)) + -H HITS, --hits HITS Report top hits as chr:start-end.pct (max 0) + --debug Verbose logging +``` +# truvari anno hompct + +``` +usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] [-c MINCOUNT] [--debug] + +Calcluate the the Hom / (Het + Hom) of variants in the region of SVs +Requires the VCF to contain SVs beside SNPs/Indels + +options: + -h, --help show this help message and exit + -i INPUT, --input INPUT + Compressed, indexed VCF to annotate + -o OUTPUT, --output OUTPUT + Output filename (stdout) + -b BUFFER, --buffer BUFFER + Number of base-pairs up/dn-stream to query (5000) + -m MINANNO, --minanno MINANNO + Minimum size of event to annotate (50) + -M MAXGT, --maxgt MAXGT + Largest event size to count for genotyping (1) + -c MINCOUNT, --mincount MINCOUNT + Minimum number of genotyping events to report HOMPCT (0) + --debug Verbose logging +``` + +# truvari anno numneigh + +``` +usage: numneigh [-h] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] [--passonly] [--debug] [input] + +For every call within size boundaries, +Add NumNeighbors info field of how many calls are within the distance +Add NeighId clustering field in the same chained neighborhood +For example, +:: + -- is a call, refdist is 2 + - - - - - - + nn: 1 2 1 0 1 1 + id: 0 0 0 1 2 2 + +positional arguments: + input VCF to annotate + +options: + -h, --help show this help message and exit + -o OUTPUT, --output OUTPUT + Output vcf (stdout) + -r REFDIST, --refdist REFDIST + Max reference location distance (1000) + -s SIZEMIN, --sizemin SIZEMIN + Minimum variant size to consider for annotation (50) + --passonly Only count calls with FILTER == PASS + --debug Verbose logging +``` + +# truvari anno svinfo + +Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. + +How SVLEN is determined: +- Starts by trying to use INFO/SVLEN +- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. +- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) + +How SVTYPE is determined: +- Starts by trying to use INFO/SVTYPE +- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences +- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. +- Otherwise, assume 'UNK' + +``` +usage: svinfo [-h] [-o OUTPUT] [-m MINSIZE] [input] + +Adds SVTYPE and SVLEN INFO fields + +positional arguments: + input VCF to annotate (stdin) + +options: + -h, --help show this help message and exit + -o OUTPUT, --output OUTPUT + Output filename (stdout) + -m MINSIZE, --minsize MINSIZE + Minimum size of entry to annotate (50) +``` + +# truvari anno bpovl + +After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. +The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: + +- vcf_key : Variant key from `truvari.entry_to_key` +- intersection : Type of intersection between the SV and the annotation + - start_bnd - SV's start breakpoints hits the annotation + - end_bnd - SV's end breakpoint hits the annotation + - overlaps - Annotation's start/end boundaries are completely within the SV + - contains - Annotation's start/end boundaries span the entire SV +- anno_key : Annotation file's line index + +The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. + +For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. + +```bash +truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff +truvari vcf2df variants.vcf.gz variants.jl +``` + +We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. + +```python +import joblib +from gtfparse import read_gtf +variants = joblib.load("variants.jl") +genes = read_gtf("genes.gtf.gz") +annos = joblib.load("anno.jl") +to_check = annos.iloc[0] + +print(to_check) +# vcf_key chr20:958486-958487.A +# intersection start_bnd +# anno_key 11 + +print(variants.loc[to_check['vcf_key']]) +# id None +# svtype INS +# ... etc + +print(annos.loc[to_check['anno_key']]) +# seqname chr20 +# source ensembl_havana +# feature exon +# start 958452 +# ... etc +``` + +Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file +has arguments `-s 0 -b 1 -e 2 -c #`. + +``` +usage: bpovl [-h] -a ANNO -o OUTPUT [--sizemin SIZEMIN] [--spanmax SPANMAX] [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] + [-e END] [-1] + [input] + +Creates intersection of features in an annotation file with SVs' breakpoints and overlap + +positional arguments: + input VCF to annotate (stdin) + +options: + -h, --help show this help message and exit + -a ANNO, --anno ANNO Tab-delimited annotation file + -o OUTPUT, --output OUTPUT + Output joblib DataFrame + --sizemin SIZEMIN Minimum size of variant to annotate (50) + --spanmax SPANMAX Maximum span of SVs to annotate (50000) + +Annotation File Arguments: + -p {bed,gff}, --preset {bed,gff} + Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) + -c COMMENT, --comment COMMENT + Skip lines started with character. (#) + -s SEQUENCE, --sequence SEQUENCE + Column of sequence/chromosome name. (0) + -b BEGIN, --begin BEGIN + Column of start chromosomal position. (1) + -e END, --end END Column of end chromosomal position. (2) + -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) +``` +# truvari anno density +Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate +regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard +deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns +`chrom, start, end, count, anno`. + +``` +usage: density [-h] -g GENOME -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-s STEPSIZE] [-t THRESHOLD] [input] + +Identify 'dense' and 'sparse' variant windows of the genome + +positional arguments: + input Input VCF (/dev/stdin) + +optional arguments: + -h, --help show this help message and exit + -g GENOME, --genome GENOME + Genome bed file + -o OUTPUT, --output OUTPUT + Output joblib DataFrame + -m MASK, --mask MASK Mask bed file + -w WINDOWSIZE, --windowsize WINDOWSIZE + Window size (10000) + -s STEPSIZE, --stepsize STEPSIZE + Window step size (10000) + -t THRESHOLD, --threshold THRESHOLD + std for identifying 'dense' regions (3) +``` + +# truvari anno dpcnt + +For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; + +``` +usage: dpcnt [-h] [-b BINS] [--no-ad] [-p] [-o OUTPUT] [input] + +Quick utility to count how many samples have >= Nx coverage per-variant + +positional arguments: + input VCF to annotate (stdin) + +options: + -h, --help show this help message and exit + -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) + --no-ad Skip adding ADCNT bins + -p, --present Only count sites with present (non ./.) genotypes + -o OUTPUT, --output OUTPUT + Output filename (stdout) +``` + +# truvari anno lcr + +``` +usage: lcr [-h] [-o OUTPUT] [input] + +Annotate low complexity region entropy score for variants +Credit: https://jszym.com/blog/dna_protein_complexity/ + +positional arguments: + input VCF to annotate (stdin) + +options: + -h, --help show this help message and exit + -o OUTPUT, --output OUTPUT + Output filename (stdout) +``` + +# truvari anno grpaf + +Add INFO tags of allele frequency annotations for groups of samples. For every group in `--labels` tab-delimited file, calculate the AF,MAF,ExcHet,HWE,MAC,AC for the samples in the group. Adds INFO tags with suffix of the group identifier (e.g. `AF_EAS`). `--strict` will hard fail if there are samples in the `--labels` not present in the vcf header. + +``` +usage: grpaf [-h] [-o OUTPUT] -l LABELS [-t TAGS] [--strict] [--debug] [input] + +Add allele frequency annotations for subsets of samples + +positional arguments: + input VCF to annotate + +options: + -h, --help show this help message and exit + -o OUTPUT, --output OUTPUT + Output filename (stdout) + -l LABELS, --labels LABELS + Tab-delimited file of sample and group + -t TAGS, --tags TAGS Comma-separated list of tags to add from AF,MAF,ExcHet,HWE,MAC,AC,AN (all) + --strict Exit if sample listed in labels is not present in VCF (False) + --debug Verbose logging +``` \ No newline at end of file diff --git a/docs/v4.3.1/bench.md b/docs/v4.3.1/bench.md new file mode 100644 index 0000000..145bf97 --- /dev/null +++ b/docs/v4.3.1/bench.md @@ -0,0 +1,295 @@ + +Quick start +=========== +Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. +```bash +truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/ +``` + +Matching Parameters +=================== +Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. + +For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctseq=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. + +[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. + +If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility. + +Here is a rundown of each matching parameter. +| Parameter | Default | Definition | +|------------|---------|------------| +| refdist | 500 | Maximum distance comparison calls must be within from base call's start/end | +| pctseq | 0.7 | Edit distance ratio between the REF/ALT haplotype sequences of base and
comparison call. See "Comparing Sequences of Variants" below. | +| pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) | +| pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) | +| typeignore | False | Types don't need to match to compare calls. | + +Below are matching parameter diagrams to illustrate (approximately) how they work. + +``` + █ = Deletion ^ = Insertion + +--refdist REFDIST (500) + Max reference location distance + + ACTGATCATGAT + |--████--| + █████ + + Calls are within reference distance of 2 + +--pctsize PCTSIZE (0.7) + Min pct allele size similarity + + ACTGATCATGA sizes + █████ -> 5bp + ████ -> 4bp + + variants have 0.8 size similarity + + +--pctovl PCTOVL (0.0) + Min pct reciprocal overlap + + ACTGATCATGA ranges + █████ [2,7) + ████ [4,8) + + variants have 0.6 reciprocial overlap + + +--pctseq PCTSEQ (0.7) + Min percent allele sequence similarity + + A-CTG-ACTG + ^ ^ haplotypes + | └ACTG -> CTGACTGA + └CTGA -> CTGACTGA + + haplotypes have 100% sequence similarity +``` + +Outputs +======= +Truvari bench writes the following files to the `--output` directory. +| File | Description | +|----------------------|---------------------------------------------| +| tp-base.vcf.gz | True positive calls form the base VCF | +| tp-comp.vcf.gz | True positive calls from the comparison VCF | +| fp.vcf.gz | False positive calls from comparison | +| fn.vcf.gz | False negative calls from base | +| summary.json | Json output of performance stats | +| params.json | Json output of parameters used | +| candidate.refine.bed | Bed file of regions for `refine` | +| log.txt | Run's log | + +summary.json +------------ +Stats generated by benchmarking are written to `summary.json`. + +| Metric | Definition | +|----------------|------------------------------------------------------------| +| TP-base | Number of matching calls from the base vcf | +| TP-comp | Number of matching calls from the comp vcf | +| FP | Number of non-matching calls from the comp vcf | +| FN | Number of non-matching calls from the base vcf | +| precision | TP-comp / (TP-comp + FP) | +| recall | TP-base / (TP-base + FN) | +| f1 | 2 * ((recall * precision) / (recall + precision)) | +| base cnt | Number of calls in the base vcf | +| comp cnt | Number of calls in the comp vcf | +| TP-comp_TP-gt | TP-comp with genotype match | +| TP-comp_FP-gt | TP-comp without genotype match | +| TP-base_TP-gt | TP-base with genotype match | +| TP-base_FP-gt | TP-base without genotype match | +| gt_concordance | TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt) | +| gt_matrix | Base GT x Comp GT Matrix of all Base calls' best, TP match | +| weighted | Metrics weighed by variant sequence/size similarity | + +The `gt_matrix` is a table. For example: +```json +"gt_matrix": { + "(0, 1)": { + "(0, 1)": 500, + "(1, 1)": 10 + }, + "(1, 1)": { + "(1, 1)": 800, + "(0, 1)": 20 + } +} +``` +Represents -> +``` +comp (0,1) (1,1) +base +(0,1) 500 10 +(1,1) 20 800 +``` + +Added annotations +----------------- +The output vcfs are annotated with INFO fields and then sorted, compressed, and indexed inside of the output directory. + +| Anno | Definition | +|-------------------|-----------------------------------------------------------------------------------------------------------------| +| TruScore | Truvari score for similarity of match. `((pctseq + pctsize + pctovl) / 3 * 100)` | +| PctSeqSimilarity | Pct sequence similarity between this variant and its closest match | +| PctSizeSimilarity | Pct size similarity between this variant and its closest match | +| PctRecOverlap | Percent reciprocal overlap percent of the two calls | +| StartDistance | Distance of the base call's start from comparison call's start | +| EndDistance | Distance of the base call's end from comparison call's end | +| SizeDiff | Difference in size of base and comp calls | +| GTMatch | Base/comp calls' Genotypes match | +| MatchId | Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See [[MatchIds wiki\|MatchIds]] for details. | + + +Refining bench output +===================== +As described in the [[refine wiki|refine]], a limitation of Truvari bench is 1-to-1 variant comparison. However, `truvari refine` can harmonize the variants to give them more consistent representations. A bed file named `candidate.refine.bed` is created by `truvari bench` and holds a set of regions which may benefit from refinement. To use it, simply run +```bash +truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ +truvari refine --regions result/candidate.refine.bed \ + --reference reference.fasta \ + --recount --use-region-coords \ + result/ +``` +See [[refine wiki|refine]] for details. + +Comparing Sequences of Variants +=============================== + +Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison. + +## Unroll +The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). + +The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. + +This unroll method has a number of benefits and a couple of considerations, including: +* not needing a `--reference` for comparison, which saves I/O time +* increasing the number of correctly matching SVs +* decreasing the number of 'suspect' matches in smaller size regimes +* providing a simpler pattern between PctSizeSimilarity and PctSeqSimilarity + +## Reference context +For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this +variant would be: + +```text +#POS INS Haplotype + 0 ab abAB + 1 ba AbaB + 2 ab ABab +``` + +Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' +`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line +looks like: + +``` python +hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq +``` + +Where `a1_seq1` is the longer of the REF or ALT allele. + +## SVs without sequences + +If the base or comp vcfs do not have sequence resolved calls (e.g. ``, simply set `--pctseq=0` to turn off +sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. If +`--pctseq != 0` and an unresolved SV is encountered, a warning will be raised and the variant will not be compared. + +Controlling the number of matches +================================= + +How many matches a variant is allowed to participate in is controlled by the `--pick` parameter. The available pickers are `single`, `ac`, and `multi`. + +* `single` (the default option) allows each variant to participate in up to one match. +* `ac` uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (its GT is 1/1 so AC=2). A heterozygous variant can only participate in one match (GT 0/1, AC=1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. (GT ./1, AC=1). +* `multi` variants can participate in all matches available. + +As an example, imagine we have three variants in a pVCF with two samples we want to compare. + +``` +CHROM POS ID REF ALT base comp +chr20 17785968 ID1 A ACGCGCGCGCG 1/1 1/0 +chr20 17785968 ID2 A ACGCGCGCGCGCG 0/0 0/1 +chr20 17785969 ID3 C CGCGCGCGCGCGC 0/0 1/1 +``` + +To compare samples inside the same vcf, we would use the command: +```bash +truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a +``` + +This VCF makes different results depending on the `--pick` parameter + +| Parameter | ID1 State | ID2 State | ID3 State | +|-----------|-----------|-----------|-----------| +| single | TP | FP | FP | +| ac | TP | TP | FP | +| multi | TP | TP | TP | + +--dup-to-ins +============ + +Most SV benchmarks only report DEL and INS SVTYPEs. The flag `--dup-to-ins` will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, `--dup-to-ins` typically should be used without sequence comparison via `--pctseq 0` + +Size filtering +============== + +`--sizemax` is the maximum size of a base or comparison call to be considered. + +`--sizemin` is the minimum size of a base call to be considered. + +`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It can +be less than `sizemin` for edge case variants. + +For example: Imagine `sizemin` is set at 50 and `sizefilt` at 30, and a 50bp base call is 98% similar to a 49bp comparison +call at the same position. + +These two calls could be considered matching. However, if we removed comparison calls less than `sizemin`, +we'd incorrectly classify the 50bp base call as a false negative. Instead, we allow comparison calls between `[sizefilt,sizemin)` to find matches. + +This has the side effect of artificially inflating specificity. For example, if that same 49bp call described +above were below the similarity threshold, it would not be classified as a FP since it is below the `sizemin` +threshold. So we're giving the call a better chance to be useful and less chance to be detrimental +to final statistics. + +Include Bed & VCF Header Contigs +================================ + +If an `--includebed` is provided, only base and comp calls contained within the defined regions are used +for comparison. This is similar to pre-filtering your base/comp calls using: + +```bash +(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz +``` + +with the exception that Truvari requires the start and the end to be contained in the same includebed region +whereas `bedtools intersect` does not. + +If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF +header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary +statistics and will not be present in any output vcfs. + +Extending an Include Bed +------------------------ +The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer. + +See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. + +Methodology +=========== +Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. +``` +* zip the Base and Comp calls together in sorted order +* create chunks of all calls overlapping within ±`--chunksize` basepairs +* make a |BaseCall| x |CompCall| match matrix for each chunk +* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds +* if the chunk has no Base or Comp calls +** return them all as FNs/FPs +* use `--pick` method to sort and annotate variants with their best match +``` +![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file diff --git a/docs/v4.3.1/collapse.md b/docs/v4.3.1/collapse.md new file mode 100644 index 0000000..7ac22ad --- /dev/null +++ b/docs/v4.3.1/collapse.md @@ -0,0 +1,163 @@ +`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. + +Example +======= +To start, we merge multiple VCFs (each with their own sample) and ensure there are no multi-allelic entries via: +```bash +bcftools merge -m none one.vcf.gz two.vcf.gz | bgzip > merge.vcf.gz +``` + +This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. +For example, consider two vcfs: + + >> one.vcf: + chr1 1 ... GT 0/1 + chr1 5 ... GT 1/1 + >> two.vcf: + chr1 1 ... GT 1/1 + chr1 7 ... GT 0/1 + +`bcftools merge` creates: + + >> merge.vcf: + chr1 1 ... GT 0/1 1/1 + chr1 5 ... GT 1/1 ./. + chr1 7 ... GT ./. 0/1 + +This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. + +```bash +truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf +``` + +For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: + + >> truvari_merge.vcf + chr1 1 ... GT 0/1 1/1 + chr1 5 ... GT 1/1 0/1 + >> truvari_collapsed.vcf + chr1 7 ... GT ./. 0/1 + +--choose behavior +================= +When collapsing, the default `--choose` behavior is to take the `first` variant by position from a cluster to +be written to the output while the others will be placed in the collapsed output. +Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). + +Samples with no genotype information in the kept variant will be filled by the first +collapsed variant containing genotype information. + +--gt +==== +For some results, one may not want to collapse variants with conflicting genotypes from a single sample. With the `--gt all` parameter, variants which are present (non `0/0` or `./.`) in the same sample are not collapsed. With the `-gt het` parameter, only variants which are both heterozygous in a sample (e.g. `0/1` and `0/1`) are prevented from collapsing. The `--gt het` is useful for some SV callers which will redundantly call variants and typically genotype them all as `1/1`. + +--intra +======= +When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with: +``` +FORMAT RESULT1 RESULT2 +GT:GQ:AD ./.:.:3,0 1/1:20:0,30 +``` +The `--intra` output would be: +``` +FORMAT RESULT1 +GT:GQ:AD 1/1:20:3,0 +``` +As you can see in this example, 1) The first sample name is the only one preserved. 2) conflicting FORMAT fields can be consolidated in a non-useful way (here the AD of `3,0` isn't informative to a `1/1` genotype). We're working to provide an API to help users write custom intra-sample consolidation scripts. + +--hap mode +========== +When using `--hap`, we assume phased variants from a single individual. Only the +single best matching call from the other haplotype will be collapsed, +and the consolidated genotype will become 1/1 + +For example, if we collapse anything at the same position: + + chr1 1 .. GT 0|1 + chr1 1 .. GT 1|0 + chr1 2 .. GT 1|0 + +will become: + + chr1 1 .. GT 1/1 + chr1 2 .. GT 1|0 + +--chain mode +============ +Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one. +For example, if we have + + chr1 5 .. + chr1 7 .. + chr1 9 .. + +When we collapse anything within 2bp of each other, without `--chain`, we output: + + chr1 5 .. + chr1 9 .. + +With `--chain`, we would collapse `chr1 9` as well, producing + + chr1 5 .. + +Annotations +=========== +`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. + +The output file has only two annotations added to the `INFO`. +- `CollapseId` - Identifier of the variant when comparing to the collapse outputs. +- `NumCollapsed` - Number of variants collapsed into this variant +- `NumConsolidated` - Number of samples' genotypes consolidated into this call's genotypes + +The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. + +``` +usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [-k {first,maxqual,common}] [--debug] + [-r REFDIST] [-p PCTSIM] [-B MINHAPLEN] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] [--chain] + [--no-consolidate] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] [-S SIZEMAX] [--passonly] + +Structural variant collapser + +Will collapse all variants within sizemin/max that match over thresholds + +options: + -h, --help show this help message and exit + -i INPUT, --input INPUT + Comparison set of calls + -o OUTPUT, --output OUTPUT + Output vcf (stdout) + -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT + Where collapsed variants are written (collapsed.vcf) + -f REFERENCE, --reference REFERENCE + Indexed fasta used to call variants + -k {first,maxqual,common}, --keep {first,maxqual,common} + When collapsing calls, which one to keep (first) + --debug Verbose logging + --hap Collapsing a single individual's haplotype resolved calls (False) + --chain Chain comparisons to extend possible collapsing (False) + --no-consolidate Skip consolidation of sample genotype fields (True) + --null-consolidate NULL_CONSOLIDATE + Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null + from all neighbors (None) + +Comparison Threshold Arguments: + -r REFDIST, --refdist REFDIST + Max reference location distance (500) + -p PCTSIM, --pctsim PCTSIM + Min percent allele sequence similarity. Set to 0 to ignore. (0.95) + -B MINHAPLEN, --minhaplen MINHAPLEN + Minimum haplotype sequence length to create (50) + -P PCTSIZE, --pctsize PCTSIZE + Min pct allele size similarity (minvarsize/maxvarsize) (0.95) + -O PCTOVL, --pctovl PCTOVL + Min pct reciprocal overlap (0.0) for DEL events + -t, --typeignore Variant types don't need to match to compare (False) + --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) + +Filtering Arguments: + -s SIZEMIN, --sizemin SIZEMIN + Minimum variant size to consider for comparison (50) + -S SIZEMAX, --sizemax SIZEMAX + Maximum variant size to consider for comparison (50000) + --passonly Only consider calls with FILTER == PASS +``` \ No newline at end of file diff --git a/docs/v4.3.1/consistency.md b/docs/v4.3.1/consistency.md new file mode 100644 index 0000000..17ba23b --- /dev/null +++ b/docs/v4.3.1/consistency.md @@ -0,0 +1,179 @@ + +In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. + +Running +======= + +``` +usage: consistency [-h] [-d] [-j] [-o OUTPUT] VCFs [VCFs ...] + +Over multiple vcfs, calculate their intersection/consistency. + +Calls will match between VCFs if they have a matching key of: + CHROM POS ID REF ALT + +positional arguments: + VCFs VCFs to intersect + +optional arguments: + -h, --help show this help message and exit + -d, --no-dups Disallow duplicate SVs + -j, --json Output report in json format + -o OUTPUT, --output OUTPUT + Write tsv of variant keys and their flag +``` + +Example +======= + +```bash +truvari consistency fileA.vcf fileB.vcf fileC.vcf +``` + +Matching Entries +================ + +VCF entries will be considered matching if and only if they have an exact same key of `CHROM POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. The key and flags can be written with the `--output` option. + +Duplicates +========== + +If there are VCFs with duplicate keys, they are handled by appending a e.g. `.1` to the key. If you'd like to ignore duplicates, just add `--no-dups` + +Output Report +============= + +Below is an example report: + +```text +# +# Total 5534 calls across 3 VCFs +# +#File NumCalls +fileA.vcf 4706 +fileB.vcf 4827 +fileC.vcf 4882 +# +# Summary of consistency +# +#VCFs Calls Pct +3 3973 71.79% +2 935 16.90% +1 626 11.31% +# +# Breakdown of VCFs' consistency +# +#Group Total TotalPct PctOfFileCalls +111 3973 71.79% 84.42% 82.31% 81.38% +011 351 6.34% 7.27% 7.19% +101 308 5.57% 6.54% 6.31% +110 276 4.99% 5.86% 5.72% +001 250 4.52% 5.12% +010 227 4.10% 4.70% +100 149 2.69% 3.17% +``` + +At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. + +The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. + +Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). + +We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. + +Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. + +Json +==== +Below is a consistency report in json format. +```json +{ + "vcfs": [ + "repo_utils/test_files/variants/input1.vcf.gz", + "repo_utils/test_files/variants/input2.vcf.gz", + "repo_utils/test_files/variants/input3.vcf.gz" + ], + "total_calls": 3513, + "num_vcfs": 3, + "vcf_counts": { + "repo_utils/test_files/variants/input1.vcf.gz": 2151, + "repo_utils/test_files/variants/input2.vcf.gz": 1783, + "repo_utils/test_files/variants/input3.vcf.gz": 2065 + }, + "shared": [ + { + "vcf_count": 3, + "num_calls": 701, + "call_pct": 0.1995445488186735 + }, + { + "vcf_count": 2, + "num_calls": 1084, + "call_pct": 0.3085681753487048 + }, + { + "vcf_count": 1, + "num_calls": 1728, + "call_pct": 0.4918872758326217 + } + ], + "detailed": [ + { + "group": "111", + "total": 701, + "total_pct": 0.1995445488186735, + "repo_utils/test_files/variants/input1.vcf.gz": 0.32589493258949326, + "repo_utils/test_files/variants/input2.vcf.gz": 0.393157599551318, + "repo_utils/test_files/variants/input3.vcf.gz": 0.3394673123486683 + }, + { + "group": "001", + "total": 645, + "total_pct": 0.18360375747224594, + "repo_utils/test_files/variants/input1.vcf.gz": 0, + "repo_utils/test_files/variants/input2.vcf.gz": 0, + "repo_utils/test_files/variants/input3.vcf.gz": 0.31234866828087166 + }, + { + "group": "100", + "total": 598, + "total_pct": 0.17022487902077996, + "repo_utils/test_files/variants/input1.vcf.gz": 0.2780102278010228, + "repo_utils/test_files/variants/input2.vcf.gz": 0, + "repo_utils/test_files/variants/input3.vcf.gz": 0 + }, + { + "group": "101", + "total": 487, + "total_pct": 0.1386279533162539, + "repo_utils/test_files/variants/input1.vcf.gz": 0.22640632264063226, + "repo_utils/test_files/variants/input2.vcf.gz": 0, + "repo_utils/test_files/variants/input3.vcf.gz": 0.2358353510895884 + }, + { + "group": "010", + "total": 485, + "total_pct": 0.13805863933959578, + "repo_utils/test_files/variants/input1.vcf.gz": 0, + "repo_utils/test_files/variants/input2.vcf.gz": 0.27201346045989905, + "repo_utils/test_files/variants/input3.vcf.gz": 0 + }, + { + "group": "110", + "total": 365, + "total_pct": 0.10389980074010817, + "repo_utils/test_files/variants/input1.vcf.gz": 0.1696885169688517, + "repo_utils/test_files/variants/input2.vcf.gz": 0.2047111609646663, + "repo_utils/test_files/variants/input3.vcf.gz": 0 + }, + { + "group": "011", + "total": 232, + "total_pct": 0.06604042129234272, + "repo_utils/test_files/variants/input1.vcf.gz": 0, + "repo_utils/test_files/variants/input2.vcf.gz": 0.13011777902411667, + "repo_utils/test_files/variants/input3.vcf.gz": 0.11234866828087167 + } + ] +} +``` \ No newline at end of file diff --git a/docs/v4.3.1/divide.md b/docs/v4.3.1/divide.md new file mode 100644 index 0000000..3bbad3c --- /dev/null +++ b/docs/v4.3.1/divide.md @@ -0,0 +1,58 @@ +Divide a VCF into independent shards. + +Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts (or shards) which can the be processed over multiple processes. + +`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. + +For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: + +```bash +truvari divide big_input.vcf.gz sub_vcfs_directory/ +``` + +Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple bash script + +```bash +NJOBS=$(nproc) # use all processors by default +mkdir -p output_vcfs/ +mkdir -p collap_vcfs/ +mkdir -p logs/ + +for in_vcf in sub_vcfs_directory/*.vcf.gz +do + # Setup file names + base_name=$(basename $in_vcf) + base_name=${base_name%.vcf.gz} + output_name=output_vcfs/${base_name}.vcf + collap_name=collap_vcfs/${base_name}.vcf + log_name=logs/${base_name}.log + # Run the command and send it to the background + truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & + # If too many jobs are running, wait for one to finish + while [ $( jobs | wc -l ) -ge ${NJOBS} ] + do + sleep 5 + done +done +``` + +Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. + +``` +usage: divide [-h] [-b BUFFER] [-m MIN] [--no-compress] [-T THREADS] VCF DIR + +Divide a VCF into independent parts + +positional arguments: + VCF VCF to split + DIR Output directory to save parts + +options: + -h, --help show this help message and exit + -b BUFFER, --buffer BUFFER + Buffer to make mini-clusters (1000) + -m MIN, --min MIN Minimum number of entries per-vcf (100) + --no-compress Don't attempt to compress/index sub-VCFs + -T THREADS, --threads THREADS + Number of threads for compressing/indexing sub-VCFs (1) +``` \ No newline at end of file diff --git a/docs/v4.3.1/ga4gh.md b/docs/v4.3.1/ga4gh.md new file mode 100644 index 0000000..864c24c --- /dev/null +++ b/docs/v4.3.1/ga4gh.md @@ -0,0 +1,3 @@ +The ga4gh command will consolidate tp-base/fn and tp-comp/fp VCF files into a single truth and query vcf, respectively. Additional tags are added that are compliant with the [ga4gh intermediates](https://github.com/ga4gh/benchmarking-tools/blob/master/doc/ref-impl/intermediate.md). + +When consolidating a result which has `truvari refine`, the `truvari ga4gh --with-refine` flag will extract original variants from unharmonized regions and altered variants from harmonized regions inside the `phab_bench/` sub-directory. \ No newline at end of file diff --git a/docs/v4.3.1/phab.md b/docs/v4.3.1/phab.md new file mode 100644 index 0000000..bdcaec9 --- /dev/null +++ b/docs/v4.3.1/phab.md @@ -0,0 +1,157 @@ +Introduction +------------ + +Truvari's comparison engine can match variants using a wide range of thresholds. However, some alleles can produce radically different variant representations. We could dramatically lower our thresholds to identify the match, but this would cause variants from unidentical alleles to be falsely matched. + +This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology. + +So how do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab` + +`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison. + +Requirements +------------ +Since `truvari phab` uses mafft v7.505 via a command-line call, it expects it to be in the environment path. Download mafft and have its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/) + +Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use. + +Also, you can use wave front aligner (pyWFA) or partial order alignment (pyabpoa). While wfa is the fastest approach, it will independently align haplotypes and therefore may produce less parsimonous aligments. And while poa is more accurate than wfa and faster than mafft, it is less accurate than mafft. + +Example +------- +As an example, we'll use Truvari's test files in `repo_utils/test_files/phab*` which were created from real data over a tandem repeat at GRCh38 chr1:26399065-26401053 and translated to a small test genome with coordinates chr1:1-1988. + +* `phab_base.vcf.gz` - an 86 sample squared-off pVCF +* `phab_comp.vcf.gz` - a single sample's VCF +* `phab_ref.fa` - a subset of the GRCh38 reference + +This dataset is interesting because the `HG002` sample in `phab_base.vcf.gz` uses the same sequencing experiment ([HPRC](https://github.com/human-pangenomics/HPP_Year1_Assemblies)) as the sample `syndip` in `phab_comp.vcf.gz`, but processed with a different pipeline. And as we will see, the pipeline can make all the difference. + +To start, let's use `truvari bench` to see how similar the variant calls are in this region. +```bash +truvari bench --base phab_base.vcf.gz \ + --comp phab_comp.vcf.gz \ + --sizemin 1 --sizefilt 1 \ + --bSample HG002 \ + --cSample syndip \ + --no-ref a \ + --output initial_bench +``` +This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows: +```json +{ + "TP-base": 5, + "TP-comp": 5, + "FP": 2, + "FN": 22, + "precision": 0.7142857142857143, + "recall": 0.18518518518518517, + "f1": 0.2941176470588235, +} +``` + +These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region. +```bash +truvari phab --base phab_base.vcf.gz \ + --comp phab_comp.vcf.gz \ + --bSample HG002 \ + --cSample syndip \ + --reference phab_ref.fa \ + --region chr1:700-900 \ + -o harmonized.vcf.gz +``` + +In our `harmonized.vcf.gz` we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after being harmonized. + +```bash +truvari bench -b harmonized.vcf.gz \ + -c harmonized.vcf.gz \ + -S 1 -s 1 \ + --no-ref a \ + --bSample HG002 \ + --cSample syndip \ + -o harmonized_bench/ +``` +Looking at `harmonized_bench/summary.txt` shows: +```json +{ + "TP-base": 8, + "TP-comp": 8, + "FP": 0, + "FN": 0, + "precision": 1.0, + "recall": 1.0, + "f1": 1.0, +} +``` +Now there is no difference between our two sets of variants in this region. + +For this variant call-set, `truvri phab` makes `truvari bench` overkill since the variants create identical haplotypes. In fact, we can benchmark simply by counting the genotypes. +```bash +$ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c + 1 0/1 1/0 + 1 1/0 0/1 + 6 1/1 1/1 +``` +(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order) + +MSA +--- + +If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz` +```bash +truvari phab -b phab_base.vcf.gz \ + -f phab_ref.fa \ + -r chr1:700-900 \ + -o msa_example.vcf.gz +``` + +As a simple check, we can count the number of variants before/after `phab`: +```bash +bcftools view -r chr1:700-900 phab_base.vcf.gz | grep -vc "#" +bcftools view -r chr1:700-900 msa_example.vcf.gz | grep -vc "#" +``` +The `160` original variants given to `phab` became just `60`. + +Better yet, these fewer variants occur on fewer positions: +```bash + +bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l +bcftools query -r chr1:700-900 -f "%POS\n" msa_example.vcf.gz | sort | uniq | wc -l +``` +This returns that the variants were over `98` positions but now sit at just `16` + +We can also observe changes in the allele frequency after running `phab`: +```bash +bcftools +fill-tags -r chr1:700-900 phab_base.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c +bcftools +fill-tags -r chr1:700-900 msa_example.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c +``` +The allele-count (AC) shows a 15% reduction in singletons and removal of all variants with an AF > 0.50 which would have suggested the reference holds a minor allele. +```txt + original phab + # AC # AC + 39 1 33 1 + 18 2 4 2 + 3 3 2 3 + 3 4 2 4 + 2 5 1 5 + ... + 3 69 1 35 + 1 89 1 40 + 8 109 1 53 + 1 132 1 56 + 1 150 1 81 +``` + +(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.) + +`--align` +========= +By default, `phab` will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. + +Limitations +----------- +* Creating and aligning haplotypes is impractical for very long sequences and maybe practically impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions. +* By giving the variants new representations, variant counts will likely change. +* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try out unphased variants, but proceed with caution. + diff --git a/docs/v4.3.1/refine.md b/docs/v4.3.1/refine.md new file mode 100644 index 0000000..eb2ef07 --- /dev/null +++ b/docs/v4.3.1/refine.md @@ -0,0 +1,142 @@ +As described in the [[phab|phab]] documentation, a constraint on Truvari `bench` finding matches is that there needs to be some consistency in how the variants are represented. To help automate the process of running Truvari `phab` on a benchmarking result and recomputing benchmarking performance on harmonized variants, we present the tool `refine`. + +Quick Start +=========== + +Basic +----- +After making a `bench` result: +```bash +truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ +``` +Use `refine` on the `result/` +```bash +truvari refine -r subset.bed -f ref.fa result/ +``` + +The regions spanned by `subset.bed` should be shorter and focused around the breakpoints of putative FNs/FPs. Haplotypes from these boundaries are fed into a realignment procedure which can take an extremely long time on e.g entire chromosomes. Also, the genotypes within these regions must be phased. + +Whole genome +------------ +After making a `bench` result: +```bash +truvari bench -b base.vcf.gz -c comp.vcf.gz --includebed hc_regions.bed -o result/ +``` +Use `refine` on the `result/` analyzing only the regions with putative FP/FN that would benefit from harmonization +```bash +truvari refine -R -U -f ref.fa --regions result/candidate.refine.bed result/ +``` + +Tandem Repeats +-------------- +For benchmarks such as the [GIAB TR](https://www.biorxiv.org/content/10.1101/2023.10.29.564632v1), a TR caller may analyze a different subset of regions. In order to avoid unnecessarily penalizing the performance with FNs from unanalyzed regions: + +```bash +truvari bench -b base.vcf.gz -c comp.vcf.gz --includebed hc_regions.bed -o result/ +``` + +Use `refine` on the `result/` analyzing only the `hc_regions.bed` covered by the TR caller's `tool_regions.bed` +``` +truvari refine -f ref.fa --regions tool_regions.bed result/ +``` + +Output +====== +* `refine.variant_summary.json` - result of re-evaluating calls within the specified regions. Same structure as [[summary.json|bench#summaryjson]] +* `refine.regions.txt` - Tab-delimited file with per-region variant counts +* `refine.region_summary.json` - Per-region performance metrics +* `phab_bench/` - Bench results on the subset of variants harmonized + +To see an example output, look at [test data](https://github.com/ACEnglish/truvari/tree/develop/answer_key/refine/refine_output_one) + +Using `refine.regions.txt` +========================== +| Column | Description | +| ----------------- | --------------------------------------- | +| chrom | Region's chromosome | +| start | Region's start | +| end | Region's end | +| in_tpbase | Input's True Positive base count | +| in_tp | Input's True Positive comparison count | +| in_fp | Input's false positive count | +| in_fn | Input's false negative count | +| refined | Boolean for if region was re-evaluated | +| out_tpbase | Output's true positive base count | +| out_tp | Output's true positive comparison count | +| out_fn | Outputs false positive count | +| out_fp | Output's false negative count | +| state | True/False state of the region | + + +Performance by Regions +====================== + +Because `truvari phab` can alter variant counts during harmonization, one may wish to assess the performance on a per-region basis rather than the per-variant basis. In the `refine.regions.txt`, a column `state` will have a TP/FN/FP value as defined by the following rules: + +```python +false_pos = (data['out_fp'] != 0) +false_neg = (data['out_fn'] != 0) +any_false = false_pos | false_neg + +true_positives = (data['out_tp'] != 0) & (data['out_tpbase'] != 0) & ~any_false + +true_negatives = (data[['out_tpbase', 'out_tp', 'out_fn', 'out_fp']] == 0).all(axis=1) + +baseP = (data['out_tpbase'] != 0) | (data['out_fn'] != 0) +compP = (data['out_tp'] != 0) | (data['out_fp'] != 0) +``` + +This logic has two edge cases to consider. 1) a region with at least one false-positive and one false-negative will be counted as both a false-positive and a false-negative. 2) Regions within `--refdist` may experience 'variant bleed' where they e.g. have an out_tp, but no other variants because a neighboring region actually contains the the corresponding `out_tpbase`. For the first case, we simply count the region twice and set its state in `refine.regions.txt` to "FP,FN". For the second case, we set the state to 'UNK' and ignore it when calculating the region summary. Future versions may figure out exactly how to handle (prevent?) 'UNK' regions. + +These by-region state counts are summarized and written to `refine.region_summary.json`. The definition of metrics inside this json are: +| Key | Definition | Formula | +|--------|----------------------------------------------|---------------------------------| +| TP | True Positive region count | | +| TN | True Negative region count | | +| FP | False Positive region count | | +| FN | False Negative region count | | +| base P | Regions with base variant(s) | | +| base N | Regions without base variant(s) | | +| comp P | Regions with comparison variant(s) | | +| comp N | Regions without comparison variant(s) | | +| PPV | Positive Predictive Value (a.k.a. precision) | TP / comp P | +| TPR | True Positive Rate (a.k.a. recall) | TP / base P | +| TNR | True Negative Rate (a.k.a. specificity) | TN / base N | +| NPV | Negative Predictive Value | TN / comp N | +| ACC | Accuracy | (TP + TN) / (base P + base N) | +| BA | Balanced Accuracy | (TPR + TNR) / 2 | +| F1 | f1 score | 2 * ((PPV * TPR) / (PPV + TPR)) | +| UND | Regions without an undetermined state | | + +Even though PPV is synonymous with precision, we use these abbreviated names when dealing with per-region performance in order to help users differentiate from the by-variant performance reports. + +`--align` +========= +By default, Truvari will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. + +`--use-original-vcfs` +===================== + +By default, `refine` will use the base/comparison variants from the `bench` results `tp-base.vcf.gz`, `fn.vcf.gz`, `tp-comp.vcf.gz`, and `fp.vcf.gz` as input for `phab`. However, this contains a filtered subset of variants originally provided to `bench` since it removes variants e.g. below `--sizemin` or not `--passonly`. + +With the `--use-original-vcfs` parameter, all of the original calls from the input vcfs are fetched. This parameter is useful in recovering matches in situations when variants in one call set are split into two variants which are smaller than the minimum size analyzed by `bench`. For example, imagine a base VCF with a 20bp DEL, a comp VCF with two 10bp DEL, and `bench --sizemin 20` was used. `--use-original-vcfs` will consider the two 10bp comp variants during phab harmonization with the 20bp base DEL. + + +`--regions` +=========== + +This parameter specifies which regions to re-evaluate. If this is not provided, the original `bench` result's `--includebed` is used. If both `--regions` and `--includebed` are provided, the `--includebed` is subset to only those intersecting `--regions`. + +This parameter is helpful for cases when the `--includebed` is not the same set of regions that a caller analyzes. For example, if a TR caller only discovers short tandem repeats (STR), but a benchmark has TRs of all lengths, it isn't useful to benchmark against the non-STR variants. Therefore, you can run `bench` on the full benchmark's regions (`--includebed`), and automatically subset to only the regions analyzed by the caller with `refine --regions`. + +Note that the larger these regions are the slower MAFFT (used by `phab`) will run. Also, when performing the intersection as described above, there may be edge effects in the reported `refine.variant_summary.json`. For example, if a `--region` partially overlaps an `--includebed` region, you may not be analyzing a subset of calls looked at during the original `bench` run. Therefore, the `*summary.json` should be compared with caution. + +`--use-region-coords` +===================== + +When intersecting `--includebed` with `--regions`, use `--regions` coordinates. By default, `refine` will prefer the `--includebed` coordinates. However, the region's coordinates should be used when using the `candidates.refine.bed` to limit analysis to only the regions with putative FP/FN that would benefit from harmonization - for example, when performing whole genome benchmarking. + +`--reference` +============= + +By default, the reference is pulled from the original `bench` result's `params.json`. If a reference wasn't used with `bench`, it must be specified with `refine` as it's used by `phab` to realign variants. \ No newline at end of file diff --git a/docs/v4.3.1/segment.md b/docs/v4.3.1/segment.md new file mode 100644 index 0000000..953e582 --- /dev/null +++ b/docs/v4.3.1/segment.md @@ -0,0 +1,18 @@ +Segmentation: Normalization of SVs into disjointed genomic regions + +For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. + +This tool adds an INFO field `SEGCNT` which holds the number of original SVs that overlap the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. + +In the below example, we have three deletions found across three samples. + +![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) + +The `segment` added annotations for the regions would then be: +| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | +|--------|-------------|--------|--------|--------| +| A | 1 | 2 | 0 | 0 | +| B | 2 | 2 | 1 | 0 | +| C | 3 | 2 | 2 | 2 | +| D | 2 | 2 | 1 | 0 | +| E | 1 | 0 | 1 | 0 | \ No newline at end of file diff --git a/docs/v4.3.1/stratify.md b/docs/v4.3.1/stratify.md new file mode 100644 index 0000000..353dcab --- /dev/null +++ b/docs/v4.3.1/stratify.md @@ -0,0 +1,58 @@ +`stratify` is a helper utility for counting variants within bed regions which is essentially the same as running `bedtools intersect -c`. When working with benchmarking results, there are are four vcfs to count (tp-base, tp-comp, fn, fp). Instead of running bedtools four times and collating the results, `stratify` can be given a single `bench` result directory to generate the counts. + +For example: +```bash +$ truvari stratify input.bed bench/ +chrom start end tpbase tp fn fp +chr20 280300 280900 0 0 0 0 +chr20 100000 200000 1 1 0 0 +chr20 642000 642350 1 1 2 1 +``` + +The output from this can then be parsed to generate more details: + +```python +import pandas as pd +import truvari + +df = pd.read_csv("stratify.output.txt", sep='\t') + +# If the input.bed didn't have a header and so we couldn't use the `--header` parameter, we need to name columns +df.columns = ['chrom', 'start', 'end', 'tpbase', 'tp', 'fn', 'fp'] + +# Create the precision, recall, and f1 for each row +metrics = df[["tpbase", "tp", "fn", "fp"]].apply((lambda x: truvari.performance_metrics(*x)), axis=1) + +# metrics is now a DataFrame with a single column of tuples, lets separate them into columns +metrics = pd.DataFrame(metrics.to_list(), columns=["precision", "recall", "f1"]) + +# Extend the dataframe's columns +df = df.join(metrics) +df.head() +``` +Which gives the result: +``` + chrom start end tpbase tp fn fp precision recall f1 +0 chr20 135221 239308 1 1 0 0 1.00 1.00 1.000000 +1 chr20 260797 465632 3 3 3 1 0.75 0.50 0.600000 +2 chr20 465866 622410 1 1 0 0 1.00 1.00 1.000000 +3 chr20 623134 655257 1 1 3 1 0.50 0.25 0.333333 +4 chr20 708338 732041 1 1 1 0 1.00 0.50 0.666667 +``` + +``` +usage: stratify [-h] [-o OUT] [--header] [-w] [--debug] BED VCF + +Count variants per-region in vcf + +positional arguments: + BED Regions to process + VCF Truvari bench result directory or a single VCF + +optional arguments: + -h, --help show this help message and exit + -o OUT, --output OUT Output bed-like file + --header Input regions have header to preserve in output + -w, --within Only count variants contained completely within region boundaries + --debug Verbose logging +``` \ No newline at end of file diff --git a/docs/v4.3.1/vcf2df.md b/docs/v4.3.1/vcf2df.md new file mode 100644 index 0000000..14d9f06 --- /dev/null +++ b/docs/v4.3.1/vcf2df.md @@ -0,0 +1,81 @@ +We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: +* chrom: variant chromosome +* start: 0-based start from pysam.VariantRecord.start +* end: 0-based end from pysam.VariantRecord.stop +* id : VCF column ID +* svtype : SVTYPE as determined by `truvari.entry_variant_type` +* svlen : SVLEN as determined by `truvari.entry_size` +* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` +* qual : VCF column QUAL +* filter : VCF column FILTER +* is_pass : boolean of if the filter is empty or PASS + +Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. + +After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column +* state : The truvari state assigned to the variant + * tpbase : Parsed from the tp-base.vcf + * tp : Parsed from the tp-comp.vcf + * fp : Parsed from the fp.vcf + * fn : Parsed from the fn.vcf + +The created DataFrame is saved into a joblib file, which can then be plotted as simply as: +```python +import joblib +import seaborn as sb +import matplotlib.pyplot as plt + +data = joblib.load("test.jl") +p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) +plt.xticks(rotation=45, ha='right') +p.set(title="True Positives by svtype and szbin") +``` +![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) + +This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: + +```python +import glob +import joblib +import pandas as pd + +files = glob.glob("*/*/data.jl") +dfs = [] +for f in files: + params, sample, frame = f.split('/') + d = joblib.load(f) + d["params"] = params + d["sample"] = sample + dfs.append(d) +df = pd.concat(dfs) +joblib.dump(df, "results.jl") +``` + +To facilitate range queries, PyRanges is helpful. `vcf2df` results can be parsed quickly by pyranges with the command: +```python +result = pyranges.PyRanges(df.rename(columns={'chrom':"Chromosome", "start":"Start", "end":"End"})) +``` + +``` +usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-n] [-S] [-c LVL] [--debug] VCF JL + +Takes a vcf and creates a data frame. Can parse a bench output directory + +positional arguments: + VCF VCF to parse + JL Output joblib to save + +optional arguments: + -h, --help show this help message and exit + -b, --bench-dir Input is a truvari bench directory + -i, --info Attempt to put the INFO fields into the dataframe + -f, --format Attempt to put the FORMAT fileds into the dataframe + -s SAMPLE, --sample SAMPLE + SAMPLE name to parse when building columns for --format + -n, --no-prefix Don't prepend sample name to format columns + -S, --skip-compression + Skip the attempt to optimize the dataframe's size + -c LVL, --compress LVL + Compression level for joblib 0-9 (3) + --debug Verbose logging +``` \ No newline at end of file