Skip to content

GC Content metrics

Clay McLeod edited this page Sep 24, 2022 · 4 revisions

The GC Content metrics facet reports statistics regarding the GC content for records within the file. The report is delivered at under the gc_content key within the results.json file. You can easily examine the output of the general facet by using jq:

cat results.json | jq .gc_content

Overview

A histogram representing 0% to 100% GC content per record is initialized with a counter per percentage point (all bins start at zero). For every record in the file, the following happens:

  • If the record is marked as a duplicate record (0x0400), a secondary record (0x0100) the ignored_flags counter is incremented by one and the record is ignored. Note that unmapped records as considered here, as we want to include any non-mapped records that might be introduced due to contamination.
  • If the sequence length is too short (< 100 nucleobases), this can bias our GC content distribution. Thus, the ignored_too_short counter is incremented and the record is ignored.
  • A random selection of 100 nucleobases is taken from the record for evaluation. The GC content of that selection is calculated as a percentage from 0% to 100%, and the respective bin within the histogram is incremented by one. Further, the processed counter is incremented by one.

Outputs

This facet has the following top-level keys,

Key Description
histogram Contains a histogram representing the number of records that have 0% to 100% GC content.
records Contains metrics related to simple record counting for this facet. Includes details on how many records were processed versus how many were ignored and for what reason.
nucleobases Contains metrics related to simple nucleobase counting for this facet.
summary Contains summary statistics regarding this QC facet, most notably the mean GC content for this file.

Histogram

As described above, the histogram spans a range of 0% to 100% GC content for a particular record. The number within each bin represents the number of records that (a) passed the filtering criteria outlined above and (b) had a GC content of that particular percentage.

Records

The records field contains metrics regarding how many records were processed, how many records were ignored, and the reason for each ignored record.

Nucleobases

The nucleobases field counts up G/C, A/T, and other nucleobases contained within the file. This is used in the final determination for the mean GC content of the file.

Summary

Contains summary statistics for the file:

  • GC Content Percentage (gc_content_pct). Mean GC content for all records contained within this file.
  • Records ignored because of flags percentage (ignored_flags_pct). The percentage of records that were filtered because of disqualifying flags.
  • Records ignore because they were too short percentage (ignored_too_short_pct). The percentage of records that were ignored because the length of the read was too short.