Skip to content

Developer Note: Tracking Reads through Grouping and Duplex Consensus Calling

Nils Homer edited this page Apr 29, 2024 · 5 revisions

The following is meant as a developer note on some conventions that relate raw reads to molecular identifiers, and single strand consensus reads to duplex consensus reads.

Please see the CallDuplexConsensusReads tool for additional information.

Top and Bottom Strand for Raw Reads

GroupReadsByUmi will assign the same molecular ID to raw reads from the same source molecule, with trailing /A and /B based on which "strand" they belong (top or bottom, AB or BA). By convention, the /A raw reads will be those where the 5' unclipped position of read one (of the pair) is less than or equal to the 5' unclipped position of read two (of the pair). The 5' unclipped position is relative to sequencing order, not the strand of the reference genome.

For example given the following read pairs:

x: R1----------------->    <-------------------R2
y: R2----------------->    <-------------------R1
z: R1----------------->
     <-----------------R2

x would be given /A, y would be given /B, and z would be given /A (even though R1 and R2 are fully overlapped, R1's 5' end is earlier).

Top and Bottom Strand Single-Strand Reads Relative to Duplex Consensus Reads

CallDuplexConsensusReads will write single-strand information into SAM attributes for each duplex consensus read (see Consensus Tags). The choice of which single-strand consensus information is stored in the "AB" and "BA" tags is determined as follows:

  1. If both strands generated a single-strand consensus, then the information for the raw reads with the trailing /A in their molecular identifier will be in the "AB" tags, while the information for the raw reads with the trailing /B in their molecular identifier will be in the "BA" tags.
  2. If only one of the two strands create a consensus (for example, because no raw reads were present for the other strand), then the "AB" tags will contain the information for the single-strand consensus that was present, while the "BA" tags will contain only "per-read" tags.

This also means that sequence of the duplex consensus will have the same "strand" as the the "AB" single-strand consensus.

Consensus Calling Tags

Contains SAM tags for single-strand and duplex consensus reads, when available.

Value AB BA Final
per-read-depth aD bD cD
per-read-min-depth aM bM cM
per-read-error-rate aE bE cE
per-base-depth ad bd cd
per-base-error-count ae be ce
per-base-bases ac bc bases
per-base-quals aq bq quals

The second letter in the tag is lower case if it is per-base, upper case if it is per-read. Please see the CallDuplexConsensusReads tool and source Consensus Tags code for more information.