-
Notifications
You must be signed in to change notification settings - Fork 4
15. Benchmarking Gene Detection through Expansion vs. DiscoVary
Warning: this was conducted using version 1.0 of the software and results might slightly change if repeated using more recent versions of lsaBGC.
Using the dataset described in Salamzade et al. 2022 of 132 M. luteus which were absent in GTDB R202 and recently uploaded to NCBI by our lab, we assessed the overlap in orthologs detected by lsaBGC-AutoExpansion.py
through assembly vs. lsaBGC-DiscoVary.py
from raw sequencing reads.
Every homolog group identified by lsaBGC-AutoExpansion.py
in the draft assemblies for the 132 samples was also reported as present by the loose criteria for gene presence in the initial parsing of Bowtie2 alignment results (90% coverage at 1X depth). Because the final set of homolog groups deemed as present by lsaBGC-DiscoVary.py
also has filtered homolog groups which are potentially "troublesome" for metagenomic analysis (e.g. multi-copy or mobile-genetic element associated), where we don't have syntenic information to infer whether a certain transposase is associated with the focal GCF, only 66.3% of the total homolog groups instances identified by lsaBGC-Expansion.py
were used for allelic phasing and finding potentially novel SNV reporting by lsaBGC-DiscoVary.py
. Conversely, comparing the number of homolog groups only found by lsaBGC-DiscoVary.py
and used for the final set of analyses, we find that lsaBGC-Expansion.py
misses only 4.1% of the 9,896 homolog groups reported as present by lsaBGC-DiscoVary.py
.
Of the 406 homolog group instances missed by lsaBGC-Expansion.py
but detected by lsaBGC-DiscoVary
more detailed follow-up analysis found:
- 205 instances corresponded to instances where there was no homology detected for the homolog group consensus sequence against the relevant sample's predicted proteome, thus suggesting the homolog group was not properly assembled.
- Of the remaining 201 instances with homology detected in the sample's predicted proteome, only 5 instances corresponded to cases where the closest match to the homolog group was found on the same scaffold as other homologs associated with the same GCF. This suggests that
lsaBGC-Expansion.py
failed to detect these instances because they were not syntenicly close to other genes or that assembly fragmentation resulted in scaffolds corresponding to blocks of BGCs too small to be detected by our criteria inlsaBGC-Expansion.py
(e.g. there must be at least 3 homolog groups per reported segment/scaffold).
Resulting data from this benchmarking can be on this Google spreadsheet.