From 37ac52b1d2db473b4a207e27033e1776f66f09bd Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Armin=20To=CC=88pfer?= Date: Wed, 26 Jan 2022 16:05:45 +0100 Subject: [PATCH] Version 6.3.0 initial docs --- docs/changelog.md | 12 +++++- docs/faq/accuracy-vs-passes.md | 55 ++++++++++++------------- docs/faq/bam-output.md | 30 +++++++------- docs/faq/bioconda-binary.md | 4 +- docs/faq/chemistry.md | 32 +++++++------- docs/faq/kinetics.md | 25 ++++++----- docs/faq/low-complexity.md | 26 +++++------- docs/faq/missing-adapters.md | 28 +++++++++++++ docs/faq/mode-heteroduplex-filtering.md | 28 ++++++++----- docs/index.md | 2 +- 10 files changed, 141 insertions(+), 101 deletions(-) create mode 100644 docs/faq/missing-adapters.md diff --git a/docs/changelog.md b/docs/changelog.md index 8e5ac07d..e0cf5b39 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -6,8 +6,18 @@ nav_order: 99 # Version changelog -**6.2.0** +**6.3.0** * Upcoming SMRT Link release + * Heteroduplex finder + * Change file output, single file with multiple read groups + * Add HD detection for substitutions differences between strands + * Add missing adapter tags `ma` and `ac` + * Adhere to latest [PacBio BAM spec](https://pacbiofileformats.readthedocs.io/) + * Add `--subsample-clr-perc` and `--subsample-clr-file` to store a percentage of productive ZMWs as subreads + * Add `--fastq` as an additional output file + +6.2.0 + * SMRT Link v10.2 release * Improved low-complexity handling, better runtime and lower memory usage * Improved BAM merge step, up to 5x faster * Improved compute run time diff --git a/docs/faq/accuracy-vs-passes.md b/docs/faq/accuracy-vs-passes.md index 9e96adaa..a55c5135 100644 --- a/docs/faq/accuracy-vs-passes.md +++ b/docs/faq/accuracy-vs-passes.md @@ -5,24 +5,23 @@ title: Accuracy vs. passes --- ## What impacts the number and quality of HiFi reads that are generated? -The longer the polymerase read gets, more passes of the SMRTbell -are produced and consequently more evidence is accumulated per molecule. -This increase in evidence translates into higher consensus accuracy, as -depicted in the following plot: +The longer the polymerase read gets, more passes of the SMRTbell are produced +and consequently more evidence is accumulated per molecule. This increase in +evidence translates into higher consensus accuracy, as depicted in the following +plot:

## How is number of passes computed? -Each read is annotated with a `np` tag that contains the number of -full-length subreads used for polishing. Full-length subreads are flanked by -adapters and thus cover the full insert. -Since the first version of _ccs_, number of passes has only accounted for -full-length subreads. In version v3.3.0 windowing has been added, which -takes the minimum number of full-length subreads across all windows. -Starting with version v4.0.0, minimum has been replaced with mode to get a -better representation across all windows. Only subreads that pass the subread -length filter (please see next FAQ about filters) and were not dropped during -polishing are counted. +Each read is annotated with a `np` tag that contains the number of full-length +subreads used for polishing. Full-length subreads are flanked by adapters and +thus cover the full insert. Since the first version of _ccs_, number of passes +has only accounted for full-length subreads. In version v3.3.0 windowing has +been added, which takes the minimum number of full-length subreads across all +windows. Starting with version v4.0.0, minimum has been replaced with mode to +get a better representation across all windows. Only subreads that pass the +subread length filter (please see next FAQ about filters) and were not dropped +during polishing are counted. Similarly, the tag `ec` reports effective coverage, the average subread coverage across all windows. This metric includes all subreads, independent of being @@ -30,24 +29,24 @@ full- or partial-length subreads, that pass length filters and did not fail during polishing. In most cases `ec` will be roughly `np + 1`. ## Why do I get more yield if I increase `--min-passes`? -For versions newer than 3.0.0 and older than 4.2.0, we required that after -draft generation, at least `--min-passes` subreads map back to the draft. -Imagine the following scenario, a ZMW with 10 subreads generates a draft to which -only a single subread aligns. This draft is of low quality and does not -represent the ZMW, yet if you ask for `--min-passes 1`, this low-quality draft -is being used. Starting with version 4.2.0, we switch to an additional -percentage threshold of more than 50% aligning subreads to avoid this problem. -This fixes the majority of discrepancies for fewer than three passes. +For versions newer than 3.0.0 and older than 4.2.0, we required that after draft +generation, at least `--min-passes` subreads map back to the draft. Imagine the +following scenario, a ZMW with 10 subreads generates a draft to which only a +single subread aligns. This draft is of low quality and does not represent the +ZMW, yet if you ask for `--min-passes 1`, this low-quality draft is being used. +Starting with version 4.2.0, we switch to an additional percentage threshold of +more than 50% aligning subreads to avoid this problem. This fixes the majority +of discrepancies for fewer than three passes. Why do we have this problem at all, shouldn't the draft stage be robust enough? -Robustness comes with inherent speed trade-offs. We have a cascade of different draft -generators, from very fast and unstable to slow and robust. If a ZMW fails +Robustness comes with inherent speed trade-offs. We have a cascade of different +draft generators, from very fast and unstable to slow and robust. If a ZMW fails to generate a draft for a fast generator, it falls back multiple times until it reaches the slower and more robust generator. This approach is still much faster than always relying on the robust generator. ## Is there an upper limit on number of passes used? -Per default, _ccs_ uses at most the top 60 full-length passes after sorting -by median length. -Beyond this threshold, it has been shown that quality does not improve. -You can change this limit with `--top-passes`, whereas `0` means unlimited. +Per default, _ccs_ uses at most the top 60 full-length passes after sorting by +median length. Beyond this threshold, it has been shown that quality does not +improve. You can change this limit with `--top-passes`, whereas `0` means +unlimited. diff --git a/docs/faq/bam-output.md b/docs/faq/bam-output.md index 876d5c16..1cedeb1a 100644 --- a/docs/faq/bam-output.md +++ b/docs/faq/bam-output.md @@ -6,20 +6,22 @@ title: BAM output ## What BAM tags are generated? -| Tag | Type | Description | -|:----:|:-----:|--------------------------------------------------------------------------------------------| -| `ec` | `f` | [Effective coverage](/faq/accuracy-vs-passes#how-is-number-of-passes-computed) | -| `fi` | `B,C` | [Forward IPD (codec V1)](/faq/kinetics) | -| `fn` | `i` | [Forward number of complete passes (zero or more)](/faq/kinetics) | -| `fp` | `B,C` | [Forward PulseWidth (codec V1)](/faq/kinetics) | -| `np` | `i` | [Number of full-length subreads](/faq/accuracy-vs-passes#how-is-number-of-passes-computed) | -| `ri` | `B,C` | [Reverse IPD (codec V1)](/faq/kinetics) | -| `rn` | `i` | [Reverse number of complete passes (zero or more)](/faq/kinetics) | -| `rp` | `B,C` | [Reverse PulseWidth (codec V1)](/faq/kinetics) | -| `rq` | `f` | [Predicted average read accuracy](/how-does-ccs-work#9-qv-calculation) | -| `sn` | `B,f` | Signal-to-noise ratios for each nucleotide | -| `zm` | `i` | ZMW hole number | -| `RG` | `z` | Read group | +| Tag | Type | Description | +| :---: | :---: | ------------------------------------------------------------------------------------------ | +| `ac` | `B,i` | [Detected and missing adapter counts](/faq/missing-adapters) | +| `ec` | `f` | [Effective coverage](/faq/accuracy-vs-passes#how-is-number-of-passes-computed) | +| `fi` | `B,C` | [Forward IPD (codec V1)](/faq/kinetics) | +| `fn` | `i` | [Forward number of complete passes (zero or more)](/faq/kinetics) | +| `fp` | `B,C` | [Forward PulseWidth (codec V1)](/faq/kinetics) | +| `ma` | `i` | [Missing adapters bitmask](/faq/missing-adapters) | +| `np` | `i` | [Number of full-length subreads](/faq/accuracy-vs-passes#how-is-number-of-passes-computed) | +| `ri` | `B,C` | [Reverse IPD (codec V1)](/faq/kinetics) | +| `rn` | `i` | [Reverse number of complete passes (zero or more)](/faq/kinetics) | +| `rp` | `B,C` | [Reverse PulseWidth (codec V1)](/faq/kinetics) | +| `rq` | `f` | [Predicted average read accuracy](/how-does-ccs-work#9-qv-calculation) | +| `sn` | `B,f` | Signal-to-noise ratios for each nucleotide | +| `zm` | `i` | ZMW hole number | +| `RG` | `z` | Read group | ## How does the output BAM file size scale with yield? diff --git a/docs/faq/bioconda-binary.md b/docs/faq/bioconda-binary.md index 72de741a..8c940c2d 100644 --- a/docs/faq/bioconda-binary.md +++ b/docs/faq/bioconda-binary.md @@ -21,5 +21,5 @@ not bundle a newer `glibc`. Please use this alternative binary. For _ccs_, we offer two binaries in bioconda: - * `ccs`, statically links `glibc` v2.32 and `mimalloc` v1.3.0. - * `ccs-alt`, was build by dynamically linking `glibc` v2.12, but statically links `mimalloc` v1.3.0. + * `ccs`, statically links `glibc` v2.33 and `mimalloc` v1.6.3. + * `ccs-alt`, was build by dynamically linking `glibc` v2.17, but statically links `mimalloc` v1.6.3. diff --git a/docs/faq/chemistry.md b/docs/faq/chemistry.md index b3bd5f8d..bba103d2 100644 --- a/docs/faq/chemistry.md +++ b/docs/faq/chemistry.md @@ -7,22 +7,22 @@ title: Chemistry ## Help! I am getting "Unsupported ..."! If you encounter the error `Unsupported chemistries found: (...)` or `unsupported sequencing chemistry combination`, your _ccs_ binaries do not -support the used sequencing chemistry kit, from here on referred to as "chemistry". -This may be because we removed support of an older chemistry or your binary predates -release of the used chemistry. -This is unlikely to happen with _ccs_ from SMRT Link installations, as SMRT Link -is able to automatically update and install new chemistries. -Thus, the easiest solution is to always use _ccs_ from the SMRT Link version that -shipped with the release of the sequencing chemistry kit. - -**Old chemistries:** -With _ccs_ 4.0.0, we have removed support for the last RSII chemistry `P6-C4`. -The only option is to downgrade _ccs_ with `conda install pbccs==3.4`. - -**New chemistries:** -It might happen that your _ccs_ version predates the sequencing chemistry kit. -To fix this, install the latest version of _ccs_ with `conda update --all`. -If you are an early access user, follow the [monkey patch tutorial](/faq/chemistry#monkey-patch-ccs-to-support-additional-sequencing-chemistry-kits). +support the used sequencing chemistry kit, from here on referred to as +"chemistry". This may be because we removed support of an older chemistry or +your binary predates release of the used chemistry. This is unlikely to happen +with _ccs_ from SMRT Link installations, as SMRT Link is able to automatically +update and install new chemistries. Thus, the easiest solution is to always use +_ccs_ from the SMRT Link version that shipped with the release of the sequencing +chemistry kit. + +**Old chemistries:** With _ccs_ 4.0.0, we have removed support for the last RSII +chemistry `P6-C4`. The only option is to downgrade _ccs_ with `conda install +pbccs==3.4`. + +**New chemistries:** It might happen that your _ccs_ version predates the +sequencing chemistry kit. To fix this, install the latest version of _ccs_ with +`conda update --all`. If you are an early access user, follow the [monkey patch +tutorial](/faq/chemistry#monkey-patch-ccs-to-support-additional-sequencing-chemistry-kits). ## Monkey patch _ccs_ to support additional sequencing chemistry kits Please create a directory that is used to inject new chemistry information diff --git a/docs/faq/kinetics.md b/docs/faq/kinetics.md index 20dac740..fb078af0 100644 --- a/docs/faq/kinetics.md +++ b/docs/faq/kinetics.md @@ -6,20 +6,19 @@ title: Kinetics ## Is it possible to use HiFi reads to call base modifications? Base modifications can be inferred from per-base pulse width (PW) and -inter-pulse duration (IPD) kinetics. -Running _ccs_ with `--hifi-kinetics` generates averaged kinetic information -for polished reads, independently for both strands of the insert. -Forward is defined with respect to the orientation represented in ``SEQ`` and -is considered to be the native orientation. As with other PacBio-specific -tags, aligners will not re-orient these fields. +inter-pulse duration (IPD) kinetics. Running _ccs_ with `--hifi-kinetics` +generates averaged kinetic information for polished reads, independently for +both strands of the insert. Forward is defined with respect to the orientation +represented in ``SEQ`` and is considered to be the native orientation. As with +other PacBio-specific tags, aligners will not re-orient these fields. -Minor cases exist where a certain orientation may get filtered out entirely -from a ZMW, preventing valid values from being passed for that record. In -these cases, empty lists will be passed for the respective record/orientation -and number of passes will be set to zero. +Minor cases exist where a certain orientation may get filtered out entirely from +a ZMW, preventing valid values from being passed for that record. In these +cases, empty lists will be passed for the respective record/orientation and +number of passes will be set to zero. In order to facilitate the use of HiFi reads with base modifications workflows, we have added an executable in pbbam called `ccs-kinetics-bystrandify` which -creates a pseudo `--by-strand` BAM with corresponding `pw` and `ip` tags -that imitates a normal, unaligned subreads BAM. You can install pbbam from -Bioconda by calling `conda install pbbam`. +creates a pseudo `--by-strand` BAM with corresponding `pw` and `ip` tags that +imitates a normal, unaligned subreads BAM. You can install pbbam from Bioconda +by calling `conda install pbbam`. diff --git a/docs/faq/low-complexity.md b/docs/faq/low-complexity.md index 10bd88c7..d873616a 100644 --- a/docs/faq/low-complexity.md +++ b/docs/faq/low-complexity.md @@ -5,18 +5,14 @@ title: Low complexity --- ## Does _ccs_ dislike low-complexity regions? -Low-complexity comes in many shapes and forms. -A particular challenge for _ccs_ are highly enriched tandem repeats, like -hundreds of copies of `AGGGGT`. -Prior _ccs_ v5.0, inserts with many copies of a small repeat likely not generate -a consensus sequence. -Since _ccs_ v5.0, every ZMW is tested if it contains a tandem repeat -of length `--min-tandem-repeat-length 1000`. -For this, we use [symmetric DUST](https://doi.org/10.1089/cmb.2006.13.1028) -and in particular the [sdust](https://github.com/lh3/sdust) implementation, -but slightly modified. -If a ZMW is flagged as a tandem repeat, internally `--disable-heuristics` -is activated for only this ZMW, and various filters that are known to exclude -low-complexity sequences are disabled. -This recovers most of the low-complexity consensus sequences, without impacting -run time performance. +Low-complexity comes in many shapes and forms. A particular challenge for _ccs_ +are highly enriched tandem repeats, like hundreds of copies of `AGGGGT`. Prior +_ccs_ v5.0, inserts with many copies of a small repeat likely not generate a +consensus sequence. Since _ccs_ v5.0, every ZMW is tested if it contains a +tandem repeat of length `--min-tandem-repeat-length 1000`. For this, we use +[symmetric DUST](https://doi.org/10.1089/cmb.2006.13.1028) and in particular the +[sdust](https://github.com/lh3/sdust) implementation, but slightly modified. If +a ZMW is flagged as a tandem repeat, internally `--disable-heuristics` is +activated for only this ZMW, and various filters that are known to exclude +low-complexity sequences are disabled. This recovers most of the low-complexity +consensus sequences, without impacting run time performance. diff --git a/docs/faq/missing-adapters.md b/docs/faq/missing-adapters.md new file mode 100644 index 00000000..24670c28 --- /dev/null +++ b/docs/faq/missing-adapters.md @@ -0,0 +1,28 @@ +--- +layout: default +parent: FAQ +title: Missing adapters +--- + +## What are missing adapter tags `ma` and `ac`? + +The `ma` and `ac` tags indicate whether the molecule that produces a CCS +read is missing a SMRTbell adapter on its left/start or right/end. The tags +produced by _ccs_ v6.3.0 and newer are based on the `ADAPTER_BEFORE_BAD` and +`ADAPTER_AFTER_BAD` information in the subread `cx` tag. + +### Tag `ac` + +Array containing four counts, in order: +- *detected* adapters on left/start +- *missing* adapters on left/start +- *detected* adapters on right/end +- *missing* adapter on right/end + +### Tag `ma` + +Bitmask storing if an adapter is missing on either side of the +molecule. A value of 0 indicates neither end has a confirmed +missing adapter. +- `0x1` if adapter is missing on left/start +- `0x2` if adapter is missing on right/end diff --git a/docs/faq/mode-heteroduplex-filtering.md b/docs/faq/mode-heteroduplex-filtering.md index 8e82de89..96db7e94 100644 --- a/docs/faq/mode-heteroduplex-filtering.md +++ b/docs/faq/mode-heteroduplex-filtering.md @@ -1,30 +1,35 @@ --- layout: default parent: FAQ -title: Mode --split-heteroduplex +title: Heteroduplex finder --- # Attention: This is an early access feature! ## What is heteroduplex filtering? Starting with _ccs_ v6.2.0, single-strand artifacts, such as insertions larger -than 20 bases, do not necessarily have to be filtered out. Using `--split-heteroduplexes`, -_ccs_ is able to split ZMW on-the-fly after detecting such heteroduplex and +than 20 bases, do not necessarily have to be filtered out. In addition, with +_ccs_ v6.3.0, substitution differences between strands can be detected. +Using `--hd-finder`, _ccs_ is able to split ZMW on-the-fly after detecting such heteroduplex and process each strand separately. As a consequence, _ccs_ has to distinguish between double-stranded (DS) and single-stranded (DS) ZMWs and their consensus reads. Implications: - * Single-strand reads are stored in an extra file + * The BAM output file will have three read groups instead of one * Summary logs report double-strand and single-strand metrics * `ccs_reports.txt` file contains two columns, double-strand and single-strand reads -We are currently investigating how reliable we can detect indel and SNV -heteroduplexes and might add those to strand-aware splitting in future versions. +### Additional read groups in BAM +The BAM file contains two different kinds of reads, single-strand and +double-strand reads. Single-strand reads follow the by-strand scheme with `/fwd` +and `/rev` name suffixes and _ccs_ generates up to two single-strand reads per +ZMW. Double-strand reads have no special distinguishment. Each of the three types +of stranded reads have their own read groups. Single-stranded reads have an +additional field in the `DS` tag of the read group. Simplified example -### Additional `*.stranded.bam` file -The file `outputPrefix.stranded.bam` contains all single-strand reads. Read names -follow the by-strand scheme with `/fwd` and `/rev` suffixes. There are up to two -reads per split ZMW. + @RG ID:793f140b PL:PACBIO DS:READTYPE=CCS;STRAND=FORWARD <- single-strand reads /fwd + @RG ID:36fc54d5 PL:PACBIO DS:READTYPE=CCS;STRAND=REVERSE <- single-strand reads /rev + @RG ID:5d30364d PL:PACBIO DS:READTYPE=CCS <- double-strand reads ### Summary logs At the end of each execution, _ccs_ reports for `--log-level INFO` a summary. @@ -62,7 +67,8 @@ HiFi Avg QV : 30.2 Typical content of the strand-aware `ccs_reports.txt` file. Contrary to the default output, this file does not report numbers in ZMWs, but actual DS and SS reads. Accounting in SS ZMWs is not possible, as one strand might fail and the -other succeed. +other succeed. The percentage of the `Inputs` is with respect to the number of +ZMWs, all other percentages are with repect to reads in their column. ``` Double-Strand Reads Single-Strand Reads diff --git a/docs/index.md b/docs/index.md index d6479565..781fd4b9 100644 --- a/docs/index.md +++ b/docs/index.md @@ -31,7 +31,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie for information on Installation, Support, License, Copyright, and Disclaimer. ## Latest Version -Version **6.2.0**: [Full changelog here](/changelog) +Version **6.3.0**: [Full changelog here](/changelog) ## Open position [**Apply now!**](https://www.thegravityapp.com/shared/job?clientId=8a7882525cf735a1015d1e0affa16ff0&id=8a7887a878f52449017961e2b27a1844&u=1629497634&v=9&token=eyJ1aWQiOjMzMDQ0LCJwcm92aWRlciI6ImJvdW5jZSIsInR5cGUiOiJlbWFpbCJ9.tkiUIP-M0EtiHBfAg07lTu4Hlwc)