Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Can't find a matching hmm library in the database!" when running --query mode from antiSMASH v6 results #67

Open
jolespin opened this issue Mar 24, 2023 · 10 comments

Comments

@jolespin
Copy link

I'm running version: v1.1.1 . This is issue is an extension of #66

(VEBA-biosynthetic_env) [jespinoz@exp-15-15 Test]$ bigslice --query  test_output/biosynthetic/intermediate/1__antismash/SRR17458614__CONCOCT__P.2__9/ --query_name SRR17458614__CONCOCT__P.2__9 --program_db_folder /expanse/projects/jcl110/db/veba/VDB_v4/Annotate/BiG-SLiCE/ bigslice_output
pid 1058573's current affinity list: 110
pid 1058573's new affinity list: 110
pid 1058570's current affinity list: 110
pid 1058570's new affinity list: 110
Fetching run details...
Can't find a matching hmm library in the database!
BiG-SLiCE run failed.

Attaching my antiSMASH v6 directory.

SRR17458614__CONCOCT__P.2__9.zip

@jolespin
Copy link
Author

Just checking in about this. Do you have any suggestions?

@sunitj
Copy link

sunitj commented Aug 11, 2023

bump
@jolespin did you ever figure out the issue?

@jolespin
Copy link
Author

No I haven't been able to get past it so I haven't been able to use to package. I've posted my data if you want to try it out.

@PannyYi
Copy link

PannyYi commented Apr 15, 2024

@jolespin did you ever figure out the issue? I encounter the issue too.

@brilliant2643
Copy link

I encounter the same issue, anybody know how to solve this problem?

I used the latest software (v2.0.0) and the latest database for bigslice, I tried to put all database files into one directory but it didn't work.
And my command is: bigslice --query test/ --n_rank 1 output/ --program_db_folder databases/bigslice-db/bigslice-models -t 20, I also tried: bigslice --query test/ --n_rank 1 test/output/ --program_db_folder databases/bigslice-db/bigslice-models -t 20, but none of them worked.

@jolespin
Copy link
Author

jolespin commented May 22, 2024

@brilliant2643 @PannyYi I was going to incorporate it into my VEBA package but wasn't ever able to fix this issue. I gave up on this software in the interim but look forward to using it/incorporating once this issue is resolved.

I developed some de novo clustering for antiSMASH BGCs in the VEBA biosynthetic module. Some of the scripts are standalone too.

You can use it like:

# Syntax 1
biosynthetic.py --from_antismash /path/to/antismash_parent_directory -o veba_output/biosynthetic

# Syntax 2
veba --module biosynthetic --params "--from_antismash /path/to/antismash_parent_directory -o veba_output/biosynthetic"

The /path/to/antismash_parent_directory directory looks like this:

antismash_output/
antismash_output/genome_1/[antismash_results_gbk_files]
antismash_output/genome_2/[antismash_results_gbk_files]
antismash_output/genome_.../[antismash_results_gbk_files]
antismash_output/genome_n/[antismash_results_gbk_files]

You can also run antiSMASH with it instead of providing antiSMASH results (check the docs). Not sure if it works with your results but if it does, you will end up with the following files:

  • bgc_clusters.tsv - BGC to BGC nucleotide cluster
  • bgc_protocluster-types.tsv.gz - Summary of BGCs detected organized by type. Also includes summary of BGCs that are NOT on contig edge.
  • bgcs.representative_sequences.fasta.gz - Full length BGC nucleotide cluster representatives
  • component_clusters.tsv - BGC protein to BGC protein cluster
  • components.representative_sequences.faa.gz - BGC protein cluster representatives
  • fasta/[id/_genome].faa/fasta.gz - BGC sequences in protein and nucleotide space
  • genbanks/[id_genome]/*.gbk - Genbank formatted antiSMASH results
  • homology.tsv.gz - Diamond results for MIBiG and VFDB
  • identifier_mapping.bgcs.tsv.gz - All of the BGCs in tabular format organized by genome, contig, region, and gene.
  • identifier_mapping.components.tsv.gz - All of the BGC components (i.e., genes in BGC) in tabular format organized by genome, contig, region, and gene.
  • krona.html - HTML showing Krona plot for number of BGCs per protocluster-type.
  • krona.tsv - Data to produce Krona plot
  • prevalence_tables/bgcs.tsv.gz - Genome vs. BGC nucleotide cluster prevalence table
  • prevalence_tables/components.tsv.gz - Genome vs. BGC protein cluster prevalence table

I typically use the prevalence_tables/components.tsv.gz with Jaccard distance and hierarchical clustering depending on how many BGCs I have and if there's too many then I'll use another clustering algo that supports boolean distance metrics like Jaccard.

Hope this helps. If you want to read more about the methodology check out the preprint on bioRxiv. Peer reviewed paper should be coming out soon. In the final stages of review right now.

If you want something similar to BIG-SLICE then I think you can use the BIRCH algo in scikit-learn but I'm not sure exactly how the backend is implemented.

@brilliant2643
Copy link

@brilliant2643 @PannyYi I was going to incorporate it into my VEBA package but wasn't ever able to fix this issue. I gave up on this software in the interim but look forward to using it/incorporating once this issue is resolved.

I developed some de novo clustering for antiSMASH BGCs in the VEBA biosynthetic module. Some of the scripts are standalone too.

You can use it like:

# Syntax 1
biosynthetic.py --from_antismash /path/to/antismash_parent_directory -o veba_output/biosynthetic

# Syntax 2
veba --module biosynthetic --params "--from_antismash /path/to/antismash_parent_directory -o veba_output/biosynthetic"

The /path/to/antismash_parent_directory directory looks like this:

antismash_output/
antismash_output/genome_1/[antismash_results_gbk_files]
antismash_output/genome_2/[antismash_results_gbk_files]
antismash_output/genome_.../[antismash_results_gbk_files]
antismash_output/genome_n/[antismash_results_gbk_files]

You can also run antiSMASH with it instead of providing antiSMASH results (check the docs). Not sure if it works with your results but if it does, you will end up with the following files:

  • bgc_clusters.tsv - BGC to BGC nucleotide cluster
  • bgc_protocluster-types.tsv.gz - Summary of BGCs detected organized by type. Also includes summary of BGCs that are NOT on contig edge.
  • bgcs.representative_sequences.fasta.gz - Full length BGC nucleotide cluster representatives
  • component_clusters.tsv - BGC protein to BGC protein cluster
  • components.representative_sequences.faa.gz - BGC protein cluster representatives
  • fasta/[id/_genome].faa/fasta.gz - BGC sequences in protein and nucleotide space
  • genbanks/[id_genome]/*.gbk - Genbank formatted antiSMASH results
  • homology.tsv.gz - Diamond results for MIBiG and VFDB
  • identifier_mapping.bgcs.tsv.gz - All of the BGCs in tabular format organized by genome, contig, region, and gene.
  • identifier_mapping.components.tsv.gz - All of the BGC components (i.e., genes in BGC) in tabular format organized by genome, contig, region, and gene.
  • krona.html - HTML showing Krona plot for number of BGCs per protocluster-type.
  • krona.tsv - Data to produce Krona plot
  • prevalence_tables/bgcs.tsv.gz - Genome vs. BGC nucleotide cluster prevalence table
  • prevalence_tables/components.tsv.gz - Genome vs. BGC protein cluster prevalence table

I typically use the prevalence_tables/components.tsv.gz with Jaccard distance and hierarchical clustering depending on how many BGCs I have and if there's too many then I'll use another clustering algo that supports boolean distance metrics like Jaccard.

Hope this helps. If you want to read more about the methodology check out the preprint on bioRxiv. Peer reviewed paper should be coming out soon. In the final stages of review right now.

If you want something similar to BIG-SLICE then I think you can use the BIRCH algo in scikit-learn but I'm not sure exactly how the backend is implemented.

Thanks a lot ! I will try it later!

@shuklagyanesh21
Copy link

shuklagyanesh21 commented Nov 1, 2024

bigslice --query input_bigslice/dataset_1 --query_name scp_87 --n_ranks 7 --program_db_folder ~/Tools/anaconda3/bin/bigslice-models/ ./output_folder

still gives the same error

Can't find a matching hmm library in the database!
BiG-SLiCE run failed.

Anyone could figure out the reason?

@ZhangZF1102
Copy link

All right, in many attempts, I found that this issue can be solved by using BigSlice V1.1

@ZhangZF1102
Copy link

All right, in many attempts, I found that this issue can be solved by using BigSlice V1.1

While, it should be notice that, BigSlice V1.1 can only recognize the results of Antismash V6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants