-
Notifications
You must be signed in to change notification settings - Fork 14
FCS GX output
Expected outputs for fcs.py screen genome
:
- FCS-GX summary report (printed to console): Information about the FCS-GX run, including source genome taxonomy information and summary of identified contaminants.
-
FCS-GX action report
*.fcs_gx_report.txt
: Final contamination report with contaminant cleaning actions. Interpreted by fcs.py clean genome to separate cleaned sequences from contaminants. -
FCS-GX taxonomy report
*.taxonomy.rpt
: Intermediate report with assigned taxonomies to individual sequences.
Expected outputs for fcs.py clean genome
:
- FCS-GX cleaning report (printed to console): Information about the contamination cleaning actions taken.
- Separated cleaned and contaminated sequences: Two FASTA files corresponding to the cleaned and contaminated sequence set.
A successful fcs.py screen genome
run will print the parameters of the run, sequence masking progress, and a contamination summary report:
-----------------------------------------------------------------------------
tax-id : 4932
fasta : /sample-volume/FCS_combo_test.fa
size : 12.18 MiB
split-fa : True
BLAST-div : budding yeasts
gx-div : fung:budding yeasts
w/same-tax: True
bin-dir : /app/bin
gx-db : /app/db/gxdb/gxdb/all.gxi
gx-ver : Nov 27 2023 12:29:26; git:v0.5.0
output : /output-volume//FCS_combo_test.4932.taxonomy.rpt
-----------------------------------------------------------------------------
Collecting masking statistics...
Collected masking stats: 0.0125624 Gbp; 3.36762s; 3.73035 Mbp/s. Baseline: 1.04906
Processed 420 queries, 12.5732Mbp in 4.35433s. (2.88751Mbp/s); num-jobs:120
Species : None
Asserted div : fung:budding yeasts
Inferred primary-divs : ['fung:budding yeasts', 'fung:ascomycetes']
Corrected primary-divs : ['fung:budding yeasts', 'fung:ascomycetes']
Putative contaminant divs : ['prok:g-proteobacteria', 'anml:primates']
Aggregate coverage : 100%
Minimum contam. coverage : 20%
-----------------------------------------------------------------------------
fcs_gx_report.txt contamination summary:
----------------------------------------
seqs bases
----- ----------
TOTAL 405 404339
----- ----- ----------
prok:g-proteobacteria 202 201923
anml:primates 201 200894
virs:eukaryotic viruses 1 1000
anml:nematodes 1 522
-----------------------------------------------------------------------------
fcs_gx_report.txt action summary:
---------------------------------
seqs bases
----- ----------
TOTAL 405 404339
----- ----- ----------
EXCLUDE 401 400522
FIX 2 1922
TRIM 2 1895
-----------------------------------------------------------------------------
A final report of recommended actions from FCS-GX is provided in the file <out-basename>.fcs_gx_report.txt
.
The following table illustrates column numbers (first column) with corresponding column headers (second column):
1: seq_id seq_00019
2: start_pos 1
3: end_pos 1000
4: seq_len 1000
5: action EXCLUDE
6: div anml:primates
7: agg_cont_cov 100
8: top_tax_name Homo sapiens
-
Column 1: A seq-id (sequence ID) for a whole sequence, as found in the input FASTA.
-
Columns 2 and 3: Start and end coordinates for the identified contamination. If only a portion of the sequence is identified as contaminant, these values indicate the range that should be removed.
-
Column 4: Length of the entire sequence in Column 1. Only a portion may be identified as contaminant, according to the start_pos and end_pos columns.
-
Column 5: The recommended action. Action values are as follows:
- EXCLUDE: Remove the entire sequence.
- TRIM: Remove the sequence at the beginning or end of the sequence. GenBank generally requires that sequences do not start or end with Ns, so the recommended course of action is to trim off contaminant sequences.
- FIX: If a contaminant range is found in the middle of a sequence, it should either be hardmasked (converted to Ns) or split into two new sequences if it suggests misassembly.
- REVIEW: Additional sequences that may be contaminants but with lower signal. In many cases, these should also be treated as contaminant and dropped. The indicated range may be whole or part (i.e. treat as EXCLUDE, FIX, or TRIM).
-
REVIEW_RARE: This category reports prokaryote assemblies contaminated with sequences from other prokaryotes if the total sequence length is under 1% of the genome length. Our analyses indicate that most of this is real contamination, but reporting prokaryote-in-prokaryote contamination is a new feature for GenBank submission and we are therefore phasing in the reporting of such issues. You may adjust the total sequence length threshold by setting an environment variable
GX_ACTION_REPORT_PA_SAME_KINGDOM_THRESHOLD=<some number>
in the env.txt file explained in the Environment Variables section above. - INFO: Chimeras involving sequences that are known to be integrated into host genomes. Currently only Wolbachia integrations into insect genomes are assigned this category. Make a Feature Request Issue if you would like us to consider other cases for this assignment.
-
Column 6: The taxonomic division assigned to the contaminant sequence by FCS-GX.
-
Column 7: The percentage alignment coverage for the contaminant in the range indicated by the start_pos and end_pos columns. If the range is composed of multiple contigs separated by gaps, these gaps are ignored for computing coverage. Low coverage values often indicate contamination with novel organisms (e.g., novel genera of bacteria). FCS-GX is tuned for high specificity, even in cases of low reported coverage values. See Known Issues for more details.
-
Column 8: The taxonomic name of the top contaminant organism identified by FCS-GX.
⚠️ Due to limitations in taxonomic representation for some groups, the actual contaminanting species may be different species or even different genus(es) from the reported top-organism.
The following steps will help you parse/interpret the fcs_gx_report.txt output:
Retrieve accessions marked as EXCLUDE:
grep -w EXCLUDE GCA_000222875.2.fcs_gx_report.txt
Retrieve the total number of accessions marked as EXCLUDE:
grep -w EXCLUDE GCA_000222875.2.fcs_gx_report.txt | cut -f 1 | sort -u | wc -l
Retrieve the total number of base pairs marked as EXCLUDE:
grep -w EXCLUDE GCA_000222875.2.fcs_gx_report.txt | awk '{sum+=$3-$2+1}END{print sum}'
You can also replace "EXCLUDE" in the above commands with "FIX," "REVIEW," or "TRIM" to get corresponding values for these actions, respectively.
The initial report from FCS-GX is provided in the file <out-basename>.taxonomy.rpt
.
The following table illustrates column numbers (first column) with corresponding column headers (second column):
1: #seq-id seq_00002
2: seq-len 813242
3: (xp,lc,co,n)-len 15154,4875,9700,0,0
4: cvg-by-all 813224
5: sep1 |
6: tax-name-1 Saccharomyces cerevisiae
7: tax-id-1 559292
8: div-1 fung:budding yeasts
9: cvg-by-div-1 813224
10: cvg-by-tax-1 813212
11: score-1 6379
12: sep2 |
13: tax-id-2 27292
14: div-2 fung:budding yeasts
15: cvg-by-div-2 813224
16: cvg-by-tax-2 810666
17: score-2 6362
18: sep3 |
19: tax-id-3 483514
20: div-3 fung:ascomycetes
21: cvg-by-div-3 52115
22: cvg-by-tax-3 6495
23: score-3 167
24: sep4 |
25: tax-id-4 2767002
26: div-4 fung:chytrids
27: cvg-by-div-4 9900
28: cvg-by-tax-4 4270
29: score-4 162
30: sep5 |
31: reserved n/a
32: result primary-div
33: div fung:budding yeasts
34: div_pct_cvg 100
-
Column 1: A seq-id (sequence ID). This can be in the following formats:
-
A whole sequence with a hit to a taxonomic division.
#seq-id OU830638.1
-
A sequence split on runs of 10+Ns. The seq-id includes the start and end for each split range of the sequence formatted as ~start..end.
#seq-id CH476754.1~1..212539 CH476754.1~212640..216643 CH476754.1~218504..255730
-
A sequence with hits to multiple taxonomic divisions, making it a putative chimeric sequence. The seq-id includes the start and end for each chimeric range of the sequence formatted as ~~start..end.
#seq-id CR382124.1~~1164..1687942 CR382124.1~~1694735..1696001
-
A split sequence that is also chimeric. The seq-id includes ~start..end~~substart..subend where the subranges are relative to the starting coordinate of the split sequence.
#seq-id UYJD01000002.1~1709646..1813733~~5112..84751 UYJD01000002.1~1709646..1813733~~100474..101416
-
-
Columns 2 and 3: The seq-len (sequence length) and masked-len (masked length) representing the length of the sequence (whole, split, or chimeric) and the masked length of the sequence, respectively. The masked length is a comma-separated tuple corresponding to regions masked on four tracks: transposons (xp), low-complexity (lc), highly-conserved regions (co), Ns (n).
-
Column 4: The cvg-by-all representing the total alignment length found from all sequences in the FCS-GX database.
-
Columns 6-30: The alignment information for a maximum of four sets of tax-ids along with their divisions. FCS-GX prints the taxonomic name (see column 6) for the first set. It also prints the tax-id, division (derived from the “BLAST name” divisions in taxonomy), total alignment length from the division hits or just the specified tax-id, and a score for all four sets. FCS-GX returns information for a maximum of two tax-ids from the same division.
-
Column 31: reserved column
-
Column 32: FCS-GX result. This result can be any one of the following:
Result Description primary-div sequence belongs to division of the input tax-id primary-div(virus) prokaryote viruses in prokaryotes and integrated eukaryote viruses are treated as belonging to the source genome contaminant sequence identified as a contaminant contaminant(human) contaminant identified as likely human from special thresholding parameters contaminant(virus) cross-superkingdom virus hits, or non-chimeric virus sequences in eukaryotes contaminant(cross-kingdom) cross-kingdom contaminants with higher score threshold requirements for reporting contaminant(cross-div) same-kingdom, cross-division contaminants with higher score threshold requirements for reporting same-kingdom-chimeric same-kingdom chimera sequences <10 kbp transposon inconclusive due to high identified transposon content repeat inconclusive because the sequence is highly repeat-specific low-coverage inconclusive due to low coverage inconclusive inconclusive for other reasons -
Column 33: The taxonomic division assigned to the sequence by FCS-GX.
-
Column 34: The percentage alignment coverage for the sequence in the taxonomic division.
The following steps will help you parse/interpret the taxonomy.rpt
output:
- Retrieve a list of sequences with at least one contaminant identifier (including chimeras):
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '$32~/contaminant/{print $1}' | cut -d '~' -f 1 | uniq
- Retrieve the FCS-GX output for all sequences with a mix of contaminant and primary-div ranges (putative chimeric sequences):
cat GCA_000006565.2.taxonomy.rpt | grep primary-div | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt | grep contaminant | cut -f 1 | cut -d "~" -f 1 | fgrep -f - GCA_000006565.2.taxonomy.rpt
- Calculate the percentage of the total genome length classified as primary-div:
cat GCA_000006565.2.taxonomy.rpt | awk -v FS='\t' -v OFS='\t' '($32 == "primary-div"){ pr_div_len += $2 }; 1{ tot_len += $2 } END{ print pr_div_len/tot_len*100 }'
A successful fcs.py clean genome
run will print the summary of cleaning actions:
Applied 405 actions; 402417 bps dropped; 1922 bps hardmasked.
fcs.py clean genome
performs the following actions to separate "clean" from "contaminated" sequences:
-
EXCLUDE : whole sequences are removed in
clean.fasta
, sent tocontam.fasta
-
TRIM : beginning or end of sequence is removed in
clean.fasta
, sent tocontam.fasta
-
FIX : internal contamination range is masked in
clean.fasta
at the range defined by start-pos>end-pos, sent tocontam.fasta
- REVIEW/REVIEW_RARE/INFO : no action taken. Manual review is recommended followed by conversion to EXCLUDE/TRIM/FIX where appropriate for cleaning
-
SPLIT :
clean.fasta
is split at the internal contamination range defined by start-pos>end-pos. This action is not defined automatically byfcs.py screen genome
and must be substituted by the user for FIX ranges where appropriate
Please create an Issue if you encounter any problems.
For all other questions or comments, please contact us at refseq-support@nlm.nih.gov
-
FCS-adaptor
-
FCS-GX
-
Setting up FCS in the cloud
-
FCS in Galaxy