Skip to content

General metrics

Clay McLeod edited this page Sep 24, 2022 · 10 revisions

The General Metrics facet reports general statistics about the records contained within the file. The report is delivered at under the general key within the results.json file. You can easily examine the output of the general facet by using jq:

cat results.json | jq .general

Outputs

This facet has the following top-level keys,

Key Description
records Metrics regarding record counts, including total number of records, unmapped records, duplicate records, the designation of records (primary, secondary, supplementary), how many paired records exist, how many read one and read two records exist, how many records are properly paired, how many singleton records exist, how many record's mate is mapped to a different sequence (both unfiltered and high-quality).
cigar Metrics regarding the pileups of Cigar counts for both read ones and read twos.
summary Contains summary metrics for this facet, including duplication record percentage, the unmapped record percentage, and the percentage of records whose mate is mapped to another sequence (both unfiltered and high-quality).

Records

This section of the general metrics comprises multiple general counting metrics regarding records. Many of these counts are simply cycling through the reads and counting up reads with particular flags. This is similar to the functionality you would get with a samtools flagstat command. The current set of record metrics collected include:

Unconstrained metrics

  • Total (total). The total number of records within the file.
  • Unmapped (unmapped). The total number of records marked as unmapped (0x4) within the file.
  • Duplicate (duplicate). The number of records marked as duplicate (0x400) within the file.
  • Designation (designation). The number of primary, secondary, and supplementary records in the file respectively.
    • If a read is marked as secondary (0x100), then the read is counted as secondary.
    • Else, if a read is marked as supplementary (0x800), then the read is counted as supplementary.
    • Else, the read is counted as primary.

Primary-only metrics

Past this point, only records designated as primary are counted towards the following metrics.

  • Primary mapped (primary_mapped). The number of records that are counted as primary and and are marked as mapped (!0x4).
  • Primary duplicate (primary_duplicate). The number of records that counted as primary and are marked as duplicate (0x400).

Primary and segmented-only metrics

Past this point, only records that are designated as primary and marked as segmented (0x01) are counted towards the following metrics.

  • Paired (paired). The number of records that are designated as primary and marked as segmented (0x01).
  • Read 1 (read_1). The number of records that are designated as primary, marked as segmented (0x01), and marked as being the first record within a segment (0x40).
  • Read 2 (read_2). The number of records that are designated as primary, marked as segmented (0x01), and marked as being the last record within a segment (0x80).

Primary, segmented, and mapped-only metrics

Past this point, only records that are designated as primary, marked as segmented (0x01), and marked as mapped (!0x04) are counted towards the following metrics.

  • Proper pair (proper_pair). The number of records that are designated as primary, marked as segmented (0x01), marked as mapped (!0x04), and properly aligned (0x2).
  • Singleton (singleton). The number of records that are designated as primary, marked as segmented (0x01), marked as mapped (!0x04), and marked as mate is unmapped (0x08).

Primary, segmented, mapped, and mate is mapped-only metrics

Past this point, only records that are designated as primary, marked as segmented (0x01), marked as mapped (!0x04), and the mate is marked as mapped (0x08) are counted towards the following metrics.

  • Mate mapped (mate_mapped). The number of records that are designated as primary, marked as segmented (0x01), marked as mapped (!0x04), and marked as mate is mapped (!0x08).

Primary, segmented, mapped, mate is mapped, and mate is mapped to a different sequence-only metrics

Past this point, only records that are designated as primary, marked as segmented (0x01), marked as mapped (!0x04), the mate is marked as mapped (0x08), and the mate is mapped to a different sequence are counted towards the following metrics.

  • Mate mapped with reference sequence mismatch (mate_reference_sequence_id_mismatch). The number of records that are designated as primary, marked as segmented (0x01), marked as mapped (!0x04), marked as mate is mapped (!0x08), but the sequence id that the mate is matched to is different that the record being examined.

  • Mate mapped with reference sequence mismatch (high-quality) (mate_reference_sequence_id_mismatch_hq). The number of records that are designated as primary, marked as segmented (0x01), marked as mapped (!0x04), marked as mate is mapped (!0x08), but the sequence id that the mate is matched to is different that the record being examined and the mapping quality of the current record is greater than 5.

Cigar

Cigar metrics are generally pileups of Cigar operations for every record in the file.

  • Read one cigar ops (read_one_cigar_ops). Pileup of the Cigar operations for all read ones.
  • Read two cigar ops (read_two_cigar_ops). Pileup of the Cigar operations for all read twos.

Summary

Summary metrics are generally percentages that are of interest to users. The current set of summary metrics collected include:

  • Duplication percentage (duplication_pct). The percentage of records that are marked as duplicate (0x400) in the file.
  • Unmapped percentage (unmapped_pct). The percentage of records that are marked as unmapped (0x04) in the file. This also allows one to trivially calculate the mapped percentage.
  • Mate reference sequence mistmatch percentage (mate_reference_sequence_id_mismatch_pct). The number of records counted as "Mate mapped with reference sequence mismatch" divided by the total number of records in the file as a percentage.
  • Mate reference sequence mistmatch percentage (high-quality) (mate_reference_sequence_id_mismatch_hq_pct). The number of records counted as "Mate mapped with reference sequence mismatch (high-quality)" divided by the total number of records in the file as a percentage.