Skip to content

FCS GX output

Eric Tvedte edited this page Apr 17, 2024 · 2 revisions

Expected outputs for fcs.py screen genome:

  • FCS-GX summary report (printed to console): Information about the FCS-GX run, including source genome taxonomy information and summary of identified contaminants.
  • FCS-GX action report *.fcs_gx_report.txt: Final contamination report with contaminant cleaning actions. Interpreted by fcs.py clean genome to separate cleaned sequences from contaminants.
  • FCS-GX taxonomy report *.taxonomy.rpt: Intermediate report with assigned taxonomies to individual sequences.

Expected outputs for fcs.py clean genome:

FCS-GX summary report

A successful fcs.py screen genome run will print the parameters of the run, sequence masking progress, and a contamination summary report:

-----------------------------------------------------------------------------

tax-id    : 4932
fasta     : /sample-volume/FCS_combo_test.fa
size      : 12.18 MiB
split-fa  : True
BLAST-div : budding yeasts
gx-div    : fung:budding yeasts
w/same-tax: True
bin-dir   : /app/bin
gx-db     : /app/db/gxdb/gxdb/all.gxi
gx-ver    : Nov 27 2023 12:29:26; git:v0.5.0
output    : /output-volume//FCS_combo_test.4932.taxonomy.rpt

-----------------------------------------------------------------------------

Collecting masking statistics...
Collected masking stats:  0.0125624 Gbp; 3.36762s; 3.73035 Mbp/s. Baseline: 1.04906

Processed 420 queries, 12.5732Mbp in 4.35433s. (2.88751Mbp/s); num-jobs:120
Species                    : None
Asserted div               : fung:budding yeasts
Inferred primary-divs      : ['fung:budding yeasts', 'fung:ascomycetes']
Corrected primary-divs     : ['fung:budding yeasts', 'fung:ascomycetes']
Putative contaminant divs  : ['prok:g-proteobacteria', 'anml:primates']
Aggregate coverage         : 100%
Minimum contam. coverage   : 20%

-----------------------------------------------------------------------------

fcs_gx_report.txt contamination summary:
----------------------------------------
                                seqs      bases
                               ----- ----------
TOTAL                            405     404339
-----                          ----- ----------
prok:g-proteobacteria            202     201923
anml:primates                    201     200894
virs:eukaryotic viruses            1       1000
anml:nematodes                     1        522

-----------------------------------------------------------------------------

fcs_gx_report.txt action summary:
---------------------------------
                                seqs      bases
                               ----- ----------
TOTAL                            405     404339
-----                          ----- ----------
EXCLUDE                          401     400522
FIX                                2       1922
TRIM                               2       1895

-----------------------------------------------------------------------------

FCS-GX action report

A final report of recommended actions from FCS-GX is provided in the file <out-basename>.fcs_gx_report.txt.

The following table illustrates column numbers (first column) with corresponding column headers (second column):

1:      seq_id        seq_00019 
2:      start_pos     1 
3:      end_pos       1000 
4:      seq_len       1000 
5:      action        EXCLUDE 
6:      div           anml:primates 
7:      agg_cont_cov  100 
8:      top_tax_name  Homo sapiens
  • Column 1: A seq-id (sequence ID) for a whole sequence, as found in the input FASTA.

  • Columns 2 and 3: Start and end coordinates for the identified contamination. If only a portion of the sequence is identified as contaminant, these values indicate the range that should be removed.

  • Column 4: Length of the entire sequence in Column 1. Only a portion may be identified as contaminant, according to the start_pos and end_pos columns.

  • Column 5: The recommended action. Action values are as follows:

    • EXCLUDE: Remove the entire sequence.
    • TRIM: Remove the sequence at the beginning or end of the sequence. GenBank generally requires that sequences do not start or end with Ns, so the recommended course of action is to trim off contaminant sequences.
    • FIX: If a contaminant range is found in the middle of a sequence, it should either be hardmasked (converted to Ns) or split into two new sequences if it suggests misassembly.
    • REVIEW: Additional sequences that may be contaminants but with lower signal. In many cases, these should also be treated as contaminant and dropped. The indicated range may be whole or part (i.e. treat as EXCLUDE, FIX, or TRIM).
    • REVIEW_RARE: This category reports prokaryote assemblies contaminated with sequences from other prokaryotes if the total sequence length is under 1% of the genome length. Our analyses indicate that most of this is real contamination, but reporting prokaryote-in-prokaryote contamination is a new feature for GenBank submission and we are therefore phasing in the reporting of such issues. You may adjust the total sequence length threshold by setting an environment variable GX_ACTION_REPORT_PA_SAME_KINGDOM_THRESHOLD=<some number> in the env.txt file explained in the Environment Variables section above.
    • INFO: Chimeras involving sequences that are known to be integrated into host genomes. Currently only Wolbachia integrations into insect genomes are assigned this category. Make a Feature Request Issue if you would like us to consider other cases for this assignment.
  • Column 6: The taxonomic division assigned to the contaminant sequence by FCS-GX.

  • Column 7: The percentage alignment coverage for the contaminant in the range indicated by the start_pos and end_pos columns. If the range is composed of multiple contigs separated by gaps, these gaps are ignored for computing coverage. Low coverage values often indicate contamination with novel organisms (e.g., novel genera of bacteria). FCS-GX is tuned for high specificity, even in cases of low reported coverage values. See Known Issues for more details.

  • Column 8: The taxonomic name of the top contaminant organism identified by FCS-GX.

    ⚠️ Due to limitations in taxonomic representation for some groups, the actual contaminanting species may be different species or even different genus(es) from the reported top-organism.

⚠️ See Separating cleaned and contaminated sequences rules for potential user modification of action report.

Interpreting Outputs

The following steps will help you parse/interpret the fcs_gx_report.txt output:

Retrieve accessions marked as EXCLUDE:

grep -w EXCLUDE GCA_000222875.2.fcs_gx_report.txt

Retrieve the total number of accessions marked as EXCLUDE:

grep -w EXCLUDE GCA_000222875.2.fcs_gx_report.txt | cut -f 1 | sort -u | wc -l

Retrieve the total number of base pairs marked as EXCLUDE:

grep -w EXCLUDE GCA_000222875.2.fcs_gx_report.txt | awk '{sum+=$3-$2+1}END{print sum}'

You can also replace "EXCLUDE" in the above commands with "FIX," "REVIEW," or "TRIM" to get corresponding values for these actions, respectively.

FCS-GX taxonomy report

The initial report from FCS-GX is provided in the file <out-basename>.taxonomy.rpt.

The following table illustrates column numbers (first column) with corresponding column headers (second column):

1:      #seq-id          seq_00002
2:      seq-len          813242
3:      (xp,lc,co,n)-len 15154,4875,9700,0,0
4:      cvg-by-all       813224
5:      sep1             |
6:      tax-name-1       Saccharomyces cerevisiae
7:      tax-id-1         559292
8:      div-1            fung:budding yeasts
9:      cvg-by-div-1     813224
10:     cvg-by-tax-1     813212
11:     score-1          6379
12:     sep2             |
13:     tax-id-2         27292
14:     div-2            fung:budding yeasts
15:     cvg-by-div-2     813224
16:     cvg-by-tax-2     810666
17:     score-2          6362
18:     sep3             |
19:     tax-id-3         483514
20:     div-3            fung:ascomycetes
21:     cvg-by-div-3     52115
22:     cvg-by-tax-3     6495
23:     score-3          167
24:     sep4             |
25:     tax-id-4         2767002
26:     div-4            fung:chytrids
27:     cvg-by-div-4     9900
28:     cvg-by-tax-4     4270
29:     score-4          162
30:     sep5             |
31:     reserved         n/a
32:     result           primary-div
33:     div              fung:budding yeasts
34:     div_pct_cvg      100
  • Column 1: A seq-id (sequence ID). This can be in the following formats:

    • A whole sequence with a hit to a taxonomic division.

      #seq-id
      OU830638.1
      
    • A sequence split on runs of 10+Ns. The seq-id includes the start and end for each split range of the sequence formatted as ~start..end.

      #seq-id
      CH476754.1~1..212539
      CH476754.1~212640..216643
      CH476754.1~218504..255730
      
    • A sequence with hits to multiple taxonomic divisions, making it a putative chimeric sequence. The seq-id includes the start and end for each chimeric range of the sequence formatted as ~~start..end.

      #seq-id
      CR382124.1~~1164..1687942
      CR382124.1~~1694735..1696001
      
    • A split sequence that is also chimeric. The seq-id includes ~start..end~~substart..subend where the subranges are relative to the starting coordinate of the split sequence.

      #seq-id
      UYJD01000002.1~1709646..1813733~~5112..84751
      UYJD01000002.1~1709646..1813733~~100474..101416
      
  • Columns 2 and 3: The seq-len (sequence length) and masked-len (masked length) representing the length of the sequence (whole, split, or chimeric) and the masked length of the sequence, respectively. The masked length is a comma-separated tuple corresponding to regions masked on four tracks: transposons (xp), low-complexity (lc), highly-conserved regions (co), Ns (n).

  • Column 4: The cvg-by-all representing the total alignment length found from all sequences in the FCS-GX database.

  • Columns 6-30: The alignment information for a maximum of four sets of tax-ids along with their divisions. FCS-GX prints the taxonomic name (see column 6) for the first set. It also prints the tax-id, division (derived from the “BLAST name” divisions in taxonomy), total alignment length from the division hits or just the specified tax-id, and a score for all four sets. FCS-GX returns information for a maximum of two tax-ids from the same division.

  • Column 31: reserved column

  • Column 32: FCS-GX result. This result can be any one of the following:

    Result Description
    primary-div sequence belongs to division of the input tax-id
    primary-div(virus) prokaryote viruses in prokaryotes and integrated eukaryote viruses are treated as belonging to the source genome
    contaminant sequence identified as a contaminant
    contaminant(human) contaminant identified as likely human from special thresholding parameters
    contaminant(virus) cross-superkingdom virus hits, or non-chimeric virus sequences in eukaryotes
    contaminant(cross-kingdom) cross-kingdom contaminants with higher score threshold requirements for reporting
    contaminant(cross-div) same-kingdom, cross-division contaminants with higher score threshold requirements for reporting
    same-kingdom-chimeric same-kingdom chimera sequences <10 kbp
    transposon inconclusive due to high identified transposon content
    repeat inconclusive because the sequence is highly repeat-specific
    low-coverage inconclusive due to low coverage
    inconclusive inconclusive for other reasons
  • Column 33: The taxonomic division assigned to the sequence by FCS-GX.

  • Column 34: The percentage alignment coverage for the sequence in the taxonomic division.

Interpreting Outputs

The following steps will help you parse/interpret the taxonomy.rpt output:

  1. Retrieve a list of sequences with at least one contaminant identifier (including chimeras):
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '$32~/contaminant/{print $1}' |  cut -d '~' -f 1 | uniq  
  1. Retrieve the FCS-GX output for all sequences with a mix of contaminant and primary-div ranges (putative chimeric sequences):
cat GCA_000006565.2.taxonomy.rpt | grep primary-div | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt | grep contaminant | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt
  1. Calculate the percentage of the total genome length classified as primary-div:
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '($32 == "primary-div"){ pr_div_len += $2 }; 1{ tot_len += $2 } END{ print pr_div_len/tot_len*100 }'

FCS-GX cleaning report

A successful fcs.py clean genome run will print the summary of cleaning actions:

Applied 405 actions; 402417 bps dropped; 1922 bps hardmasked.

Separated cleaned and contaminated sequences

fcs.py clean genome performs the following actions to separate "clean" from "contaminated" sequences:

  • EXCLUDE : whole sequences are removed in clean.fasta, sent to contam.fasta
  • TRIM : beginning or end of sequence is removed in clean.fasta, sent to contam.fasta
  • FIX : internal contamination range is masked in clean.fasta at the range defined by start-pos>end-pos, sent to contam.fasta
  • REVIEW/REVIEW_RARE/INFO : no action taken. Manual review is recommended followed by conversion to EXCLUDE/TRIM/FIX where appropriate for cleaning
  • SPLIT : clean.fasta is split at the internal contamination range defined by start-pos>end-pos. This action is not defined automatically by fcs.py screen genome and must be substituted by the user for FIX ranges where appropriate