Skip to content

domain prediction

Jorge edited this page Dec 19, 2021 · 1 revision

Domain prediction

The protein sequences from all the BGCs found in the input files are stored in a fasta file (<bgc.fasta>). BiG-SCAPE then uses the hmmscan tool from the HMMER suite to predict domains from the Pfam database. The actual command used is:

hmmscan --cpu 0 --domtblout <bgc.domtable> --cut_tc <path-to-PfamA.hmm> <bgc.fasta>

where the --cut_tc option, according to the official (3.1b2) documentation

Use[s] the TC (trusted cutoff) bit score thresholds in the model to set per-sequence (TC1) and per-domain (TC2) reporting and inclusion thresholds. TC thresholds are generally considered to be the score of the lowest-scoring known true positive that is above all known false positives.

The coordinates used for extracting and handling the domain sequences are the envelope coordinates. Again, from the HMMER guide:

(“env from” and “env to”) define the envelope of the domain’s location on the target sequence. The envelope is almost always a little wider than what HMMER chooses to show as a reasonably confident alignment. As mentioned earlier, the envelope represents a subsequence that encompasses most of the posterior probability for a given homologous domain, even if precise endpoints are only fuzzily inferrable.

After domain prediction, a step of filtering is performed where overlapping domains are discarded based on the per-domain score. When comparing pairs of domains within the same CDS, domain filtering will be triggered if the amino acid overlap percentage of any of the domain's sequences (i.e. overlap in amino acids / domain length) is higher than overlap_cutoff (set by the --domain_overlap_cutoff parameter), which is $0.1$ by default.

Clone this wiki locally