Skip to content

Commit

Permalink
Version 6.3.0 initial docs
Browse files Browse the repository at this point in the history
  • Loading branch information
armintoepfer committed Jan 26, 2022
1 parent db520e8 commit 37ac52b
Show file tree
Hide file tree
Showing 10 changed files with 141 additions and 101 deletions.
12 changes: 11 additions & 1 deletion docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,18 @@ nav_order: 99

# Version changelog

**6.2.0**
**6.3.0**
* Upcoming SMRT Link release
* Heteroduplex finder
* Change file output, single file with multiple read groups
* Add HD detection for substitutions differences between strands
* Add missing adapter tags `ma` and `ac`
* Adhere to latest [PacBio BAM spec](https://pacbiofileformats.readthedocs.io/)
* Add `--subsample-clr-perc` and `--subsample-clr-file` to store a percentage of productive ZMWs as subreads
* Add `--fastq` as an additional output file

6.2.0
* SMRT Link v10.2 release
* Improved low-complexity handling, better runtime and lower memory usage
* Improved BAM merge step, up to 5x faster
* Improved compute run time
Expand Down
55 changes: 27 additions & 28 deletions docs/faq/accuracy-vs-passes.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,49 +5,48 @@ title: Accuracy vs. passes
---

## What impacts the number and quality of HiFi reads that are generated?
The longer the polymerase read gets, more passes of the SMRTbell
are produced and consequently more evidence is accumulated per molecule.
This increase in evidence translates into higher consensus accuracy, as
depicted in the following plot:
The longer the polymerase read gets, more passes of the SMRTbell are produced
and consequently more evidence is accumulated per molecule. This increase in
evidence translates into higher consensus accuracy, as depicted in the following
plot:

<p align="center"><img width="600px" src="../img/ccs-acc.png"/></p>

## How is number of passes computed?
Each read is annotated with a `np` tag that contains the number of
full-length subreads used for polishing. Full-length subreads are flanked by
adapters and thus cover the full insert.
Since the first version of _ccs_, number of passes has only accounted for
full-length subreads. In version v3.3.0 windowing has been added, which
takes the minimum number of full-length subreads across all windows.
Starting with version v4.0.0, minimum has been replaced with mode to get a
better representation across all windows. Only subreads that pass the subread
length filter (please see next FAQ about filters) and were not dropped during
polishing are counted.
Each read is annotated with a `np` tag that contains the number of full-length
subreads used for polishing. Full-length subreads are flanked by adapters and
thus cover the full insert. Since the first version of _ccs_, number of passes
has only accounted for full-length subreads. In version v3.3.0 windowing has
been added, which takes the minimum number of full-length subreads across all
windows. Starting with version v4.0.0, minimum has been replaced with mode to
get a better representation across all windows. Only subreads that pass the
subread length filter (please see next FAQ about filters) and were not dropped
during polishing are counted.

Similarly, the tag `ec` reports effective coverage, the average subread coverage
across all windows. This metric includes all subreads, independent of being
full- or partial-length subreads, that pass length filters and did not fail
during polishing. In most cases `ec` will be roughly `np + 1`.

## Why do I get more yield if I increase `--min-passes`?
For versions newer than 3.0.0 and older than 4.2.0, we required that after
draft generation, at least `--min-passes` subreads map back to the draft.
Imagine the following scenario, a ZMW with 10 subreads generates a draft to which
only a single subread aligns. This draft is of low quality and does not
represent the ZMW, yet if you ask for `--min-passes 1`, this low-quality draft
is being used. Starting with version 4.2.0, we switch to an additional
percentage threshold of more than 50% aligning subreads to avoid this problem.
This fixes the majority of discrepancies for fewer than three passes.
For versions newer than 3.0.0 and older than 4.2.0, we required that after draft
generation, at least `--min-passes` subreads map back to the draft. Imagine the
following scenario, a ZMW with 10 subreads generates a draft to which only a
single subread aligns. This draft is of low quality and does not represent the
ZMW, yet if you ask for `--min-passes 1`, this low-quality draft is being used.
Starting with version 4.2.0, we switch to an additional percentage threshold of
more than 50% aligning subreads to avoid this problem. This fixes the majority
of discrepancies for fewer than three passes.

Why do we have this problem at all, shouldn't the draft stage be robust enough?
Robustness comes with inherent speed trade-offs. We have a cascade of different draft
generators, from very fast and unstable to slow and robust. If a ZMW fails
Robustness comes with inherent speed trade-offs. We have a cascade of different
draft generators, from very fast and unstable to slow and robust. If a ZMW fails
to generate a draft for a fast generator, it falls back multiple times until it
reaches the slower and more robust generator. This approach is still much faster
than always relying on the robust generator.

## Is there an upper limit on number of passes used?
Per default, _ccs_ uses at most the top 60 full-length passes after sorting
by median length.
Beyond this threshold, it has been shown that quality does not improve.
You can change this limit with `--top-passes`, whereas `0` means unlimited.
Per default, _ccs_ uses at most the top 60 full-length passes after sorting by
median length. Beyond this threshold, it has been shown that quality does not
improve. You can change this limit with `--top-passes`, whereas `0` means
unlimited.
30 changes: 16 additions & 14 deletions docs/faq/bam-output.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,22 @@ title: BAM output

## What BAM tags are generated?

| Tag | Type | Description |
|:----:|:-----:|--------------------------------------------------------------------------------------------|
| `ec` | `f` | [Effective coverage](/faq/accuracy-vs-passes#how-is-number-of-passes-computed) |
| `fi` | `B,C` | [Forward IPD (codec V1)](/faq/kinetics) |
| `fn` | `i` | [Forward number of complete passes (zero or more)](/faq/kinetics) |
| `fp` | `B,C` | [Forward PulseWidth (codec V1)](/faq/kinetics) |
| `np` | `i` | [Number of full-length subreads](/faq/accuracy-vs-passes#how-is-number-of-passes-computed) |
| `ri` | `B,C` | [Reverse IPD (codec V1)](/faq/kinetics) |
| `rn` | `i` | [Reverse number of complete passes (zero or more)](/faq/kinetics) |
| `rp` | `B,C` | [Reverse PulseWidth (codec V1)](/faq/kinetics) |
| `rq` | `f` | [Predicted average read accuracy](/how-does-ccs-work#9-qv-calculation) |
| `sn` | `B,f` | Signal-to-noise ratios for each nucleotide |
| `zm` | `i` | ZMW hole number |
| `RG` | `z` | Read group |
| Tag | Type | Description |
| :---: | :---: | ------------------------------------------------------------------------------------------ |
| `ac` | `B,i` | [Detected and missing adapter counts](/faq/missing-adapters) |
| `ec` | `f` | [Effective coverage](/faq/accuracy-vs-passes#how-is-number-of-passes-computed) |
| `fi` | `B,C` | [Forward IPD (codec V1)](/faq/kinetics) |
| `fn` | `i` | [Forward number of complete passes (zero or more)](/faq/kinetics) |
| `fp` | `B,C` | [Forward PulseWidth (codec V1)](/faq/kinetics) |
| `ma` | `i` | [Missing adapters bitmask](/faq/missing-adapters) |
| `np` | `i` | [Number of full-length subreads](/faq/accuracy-vs-passes#how-is-number-of-passes-computed) |
| `ri` | `B,C` | [Reverse IPD (codec V1)](/faq/kinetics) |
| `rn` | `i` | [Reverse number of complete passes (zero or more)](/faq/kinetics) |
| `rp` | `B,C` | [Reverse PulseWidth (codec V1)](/faq/kinetics) |
| `rq` | `f` | [Predicted average read accuracy](/how-does-ccs-work#9-qv-calculation) |
| `sn` | `B,f` | Signal-to-noise ratios for each nucleotide |
| `zm` | `i` | ZMW hole number |
| `RG` | `z` | Read group |


## How does the output BAM file size scale with yield?
Expand Down
4 changes: 2 additions & 2 deletions docs/faq/bioconda-binary.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,5 @@ not bundle a newer `glibc`. Please use this alternative binary.

For _ccs_, we offer two binaries in bioconda:

* `ccs`, statically links `glibc` v2.32 and `mimalloc` v1.3.0.
* `ccs-alt`, was build by dynamically linking `glibc` v2.12, but statically links `mimalloc` v1.3.0.
* `ccs`, statically links `glibc` v2.33 and `mimalloc` v1.6.3.
* `ccs-alt`, was build by dynamically linking `glibc` v2.17, but statically links `mimalloc` v1.6.3.
32 changes: 16 additions & 16 deletions docs/faq/chemistry.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,22 @@ title: Chemistry
## Help! I am getting "Unsupported ..."!
If you encounter the error `Unsupported chemistries found: (...)` or
`unsupported sequencing chemistry combination`, your _ccs_ binaries do not
support the used sequencing chemistry kit, from here on referred to as "chemistry".
This may be because we removed support of an older chemistry or your binary predates
release of the used chemistry.
This is unlikely to happen with _ccs_ from SMRT Link installations, as SMRT Link
is able to automatically update and install new chemistries.
Thus, the easiest solution is to always use _ccs_ from the SMRT Link version that
shipped with the release of the sequencing chemistry kit.

**Old chemistries:**
With _ccs_ 4.0.0, we have removed support for the last RSII chemistry `P6-C4`.
The only option is to downgrade _ccs_ with `conda install pbccs==3.4`.

**New chemistries:**
It might happen that your _ccs_ version predates the sequencing chemistry kit.
To fix this, install the latest version of _ccs_ with `conda update --all`.
If you are an early access user, follow the [monkey patch tutorial](/faq/chemistry#monkey-patch-ccs-to-support-additional-sequencing-chemistry-kits).
support the used sequencing chemistry kit, from here on referred to as
"chemistry". This may be because we removed support of an older chemistry or
your binary predates release of the used chemistry. This is unlikely to happen
with _ccs_ from SMRT Link installations, as SMRT Link is able to automatically
update and install new chemistries. Thus, the easiest solution is to always use
_ccs_ from the SMRT Link version that shipped with the release of the sequencing
chemistry kit.

**Old chemistries:** With _ccs_ 4.0.0, we have removed support for the last RSII
chemistry `P6-C4`. The only option is to downgrade _ccs_ with `conda install
pbccs==3.4`.

**New chemistries:** It might happen that your _ccs_ version predates the
sequencing chemistry kit. To fix this, install the latest version of _ccs_ with
`conda update --all`. If you are an early access user, follow the [monkey patch
tutorial](/faq/chemistry#monkey-patch-ccs-to-support-additional-sequencing-chemistry-kits).

## Monkey patch _ccs_ to support additional sequencing chemistry kits
Please create a directory that is used to inject new chemistry information
Expand Down
25 changes: 12 additions & 13 deletions docs/faq/kinetics.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,19 @@ title: Kinetics

## Is it possible to use HiFi reads to call base modifications?
Base modifications can be inferred from per-base pulse width (PW) and
inter-pulse duration (IPD) kinetics.
Running _ccs_ with `--hifi-kinetics` generates averaged kinetic information
for polished reads, independently for both strands of the insert.
Forward is defined with respect to the orientation represented in ``SEQ`` and
is considered to be the native orientation. As with other PacBio-specific
tags, aligners will not re-orient these fields.
inter-pulse duration (IPD) kinetics. Running _ccs_ with `--hifi-kinetics`
generates averaged kinetic information for polished reads, independently for
both strands of the insert. Forward is defined with respect to the orientation
represented in ``SEQ`` and is considered to be the native orientation. As with
other PacBio-specific tags, aligners will not re-orient these fields.

Minor cases exist where a certain orientation may get filtered out entirely
from a ZMW, preventing valid values from being passed for that record. In
these cases, empty lists will be passed for the respective record/orientation
and number of passes will be set to zero.
Minor cases exist where a certain orientation may get filtered out entirely from
a ZMW, preventing valid values from being passed for that record. In these
cases, empty lists will be passed for the respective record/orientation and
number of passes will be set to zero.

In order to facilitate the use of HiFi reads with base modifications workflows,
we have added an executable in pbbam called `ccs-kinetics-bystrandify` which
creates a pseudo `--by-strand` BAM with corresponding `pw` and `ip` tags
that imitates a normal, unaligned subreads BAM. You can install pbbam from
Bioconda by calling `conda install pbbam`.
creates a pseudo `--by-strand` BAM with corresponding `pw` and `ip` tags that
imitates a normal, unaligned subreads BAM. You can install pbbam from Bioconda
by calling `conda install pbbam`.
26 changes: 11 additions & 15 deletions docs/faq/low-complexity.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,14 @@ title: Low complexity
---

## Does _ccs_ dislike low-complexity regions?
Low-complexity comes in many shapes and forms.
A particular challenge for _ccs_ are highly enriched tandem repeats, like
hundreds of copies of `AGGGGT`.
Prior _ccs_ v5.0, inserts with many copies of a small repeat likely not generate
a consensus sequence.
Since _ccs_ v5.0, every ZMW is tested if it contains a tandem repeat
of length `--min-tandem-repeat-length 1000`.
For this, we use [symmetric DUST](https://doi.org/10.1089/cmb.2006.13.1028)
and in particular the [sdust](https://github.com/lh3/sdust) implementation,
but slightly modified.
If a ZMW is flagged as a tandem repeat, internally `--disable-heuristics`
is activated for only this ZMW, and various filters that are known to exclude
low-complexity sequences are disabled.
This recovers most of the low-complexity consensus sequences, without impacting
run time performance.
Low-complexity comes in many shapes and forms. A particular challenge for _ccs_
are highly enriched tandem repeats, like hundreds of copies of `AGGGGT`. Prior
_ccs_ v5.0, inserts with many copies of a small repeat likely not generate a
consensus sequence. Since _ccs_ v5.0, every ZMW is tested if it contains a
tandem repeat of length `--min-tandem-repeat-length 1000`. For this, we use
[symmetric DUST](https://doi.org/10.1089/cmb.2006.13.1028) and in particular the
[sdust](https://github.com/lh3/sdust) implementation, but slightly modified. If
a ZMW is flagged as a tandem repeat, internally `--disable-heuristics` is
activated for only this ZMW, and various filters that are known to exclude
low-complexity sequences are disabled. This recovers most of the low-complexity
consensus sequences, without impacting run time performance.
28 changes: 28 additions & 0 deletions docs/faq/missing-adapters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
layout: default
parent: FAQ
title: Missing adapters
---

## What are missing adapter tags `ma` and `ac`?

The `ma` and `ac` tags indicate whether the molecule that produces a CCS
read is missing a SMRTbell adapter on its left/start or right/end. The tags
produced by _ccs_ v6.3.0 and newer are based on the `ADAPTER_BEFORE_BAD` and
`ADAPTER_AFTER_BAD` information in the subread `cx` tag.

### Tag `ac`

Array containing four counts, in order:
- *detected* adapters on left/start
- *missing* adapters on left/start
- *detected* adapters on right/end
- *missing* adapter on right/end

### Tag `ma`

Bitmask storing if an adapter is missing on either side of the
molecule. A value of 0 indicates neither end has a confirmed
missing adapter.
- `0x1` if adapter is missing on left/start
- `0x2` if adapter is missing on right/end
28 changes: 17 additions & 11 deletions docs/faq/mode-heteroduplex-filtering.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,35 @@
---
layout: default
parent: FAQ
title: Mode --split-heteroduplex
title: Heteroduplex finder
---

# Attention: This is an early access feature!

## What is heteroduplex filtering?
Starting with _ccs_ v6.2.0, single-strand artifacts, such as insertions larger
than 20 bases, do not necessarily have to be filtered out. Using `--split-heteroduplexes`,
_ccs_ is able to split ZMW on-the-fly after detecting such heteroduplex and
than 20 bases, do not necessarily have to be filtered out. In addition, with
_ccs_ v6.3.0, substitution differences between strands can be detected.
Using `--hd-finder`, _ccs_ is able to split ZMW on-the-fly after detecting such heteroduplex and
process each strand separately. As a consequence, _ccs_ has to distinguish between
double-stranded (DS) and single-stranded (DS) ZMWs and their consensus reads.
Implications:

* Single-strand reads are stored in an extra file
* The BAM output file will have three read groups instead of one
* Summary logs report double-strand and single-strand metrics
* `ccs_reports.txt` file contains two columns, double-strand and single-strand reads

We are currently investigating how reliable we can detect indel and SNV
heteroduplexes and might add those to strand-aware splitting in future versions.
### Additional read groups in BAM
The BAM file contains two different kinds of reads, single-strand and
double-strand reads. Single-strand reads follow the by-strand scheme with `/fwd`
and `/rev` name suffixes and _ccs_ generates up to two single-strand reads per
ZMW. Double-strand reads have no special distinguishment. Each of the three types
of stranded reads have their own read groups. Single-stranded reads have an
additional field in the `DS` tag of the read group. Simplified example

### Additional `*.stranded.bam` file
The file `outputPrefix.stranded.bam` contains all single-strand reads. Read names
follow the by-strand scheme with `/fwd` and `/rev` suffixes. There are up to two
reads per split ZMW.
@RG ID:793f140b PL:PACBIO DS:READTYPE=CCS;STRAND=FORWARD <- single-strand reads /fwd
@RG ID:36fc54d5 PL:PACBIO DS:READTYPE=CCS;STRAND=REVERSE <- single-strand reads /rev
@RG ID:5d30364d PL:PACBIO DS:READTYPE=CCS <- double-strand reads

### Summary logs
At the end of each execution, _ccs_ reports for `--log-level INFO` a summary.
Expand Down Expand Up @@ -62,7 +67,8 @@ HiFi Avg QV : 30.2
Typical content of the strand-aware `ccs_reports.txt` file. Contrary to the
default output, this file does not report numbers in ZMWs, but actual DS and SS
reads. Accounting in SS ZMWs is not possible, as one strand might fail and the
other succeed.
other succeed. The percentage of the `Inputs` is with respect to the number of
ZMWs, all other percentages are with repect to reads in their column.

```
Double-Strand Reads Single-Strand Reads
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
for information on Installation, Support, License, Copyright, and Disclaimer.

## Latest Version
Version **6.2.0**: [Full changelog here](/changelog)
Version **6.3.0**: [Full changelog here](/changelog)

## Open position
[**Apply now!**](https://www.thegravityapp.com/shared/job?clientId=8a7882525cf735a1015d1e0affa16ff0&id=8a7887a878f52449017961e2b27a1844&u=1629497634&v=9&token=eyJ1aWQiOjMzMDQ0LCJwcm92aWRlciI6ImJvdW5jZSIsInR5cGUiOiJlbWFpbCJ9.tkiUIP-M0EtiHBfAg07lTu4Hlwc)
Expand Down

0 comments on commit 37ac52b

Please sign in to comment.