Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convertalis: corrupted unsorted chunks #372

Closed
nick-youngblut opened this issue Nov 11, 2020 · 11 comments
Closed

convertalis: corrupted unsorted chunks #372

nick-youngblut opened this issue Nov 11, 2020 · 11 comments

Comments

@nick-youngblut
Copy link

It appears that convertalis fails if the mmseqs search hit database is empty or really small. The hits database size is ~7K, so it might be completely empty. The output that I'm getting:

convertalis --threads 4 --format-mode 0 --format-output query,target,evalue,pident,alnlen,tlen /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/mmseqs_search_db/db /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/hits_seqs17_db /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/hits_seqs17.tsv

MMseqs Version:        	12.113e3
Substitution matrix    	nucl:nucleotide.out,aa:blosum62.out
Alignment format       	0
Format alignment output	query,target,evalue,pident,alnlen,tlen
Translation table      	1
Gap open cost          	nucl:5,aa:11
Gap extension cost     	nucl:2,aa:1
Database output        	false
Preload mode           	0
Search type            	0
Threads                	4
Compressed             	0
Verbosity              	3

[============================================================Invalid database read for database data file=/ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db_h, database index=/ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db_h.index
Invalid database read for database data file=/ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db_h, database index=/ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db_h.index
Invalid database read for database data file=/ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db_h, database index=/ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db_h.index
Invalid database read for database data file=/ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db_h, database index=/ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db_h.index
getData: local id (4294967295) >= db size (6526)
getData: local id (4294967295) >= db size (6526)
getData: local id (4294967295) >= db size (6526)
getData: local id (4294967295) >= db size (6526)
free(): corrupted unsorted chunks

It would be nice if convertalis exited gracefully if the database is empty. Is there a way to even check whether any mmseqs database is empty?

mmseqs version: 12.113e3 (h2d02072_0 bioconda)

@milot-mirdita
Copy link
Member

I tried to reproduce the issue with a completely empty database, but that didn't crash. Could you upload the hits_seqs17_db db?

@nick-youngblut
Copy link
Author

Thanks for checking so quickly! Attached is the hits_seqs17_db file (and associated files)

files.zip

@milot-mirdita
Copy link
Member

Something might be wrong with the header database. hits_seqs17_db has 7002 entries and I assume the query sequence database has also this number of sequences, but the header database seems to have 6526 (db size (6526)) entries.
How was the query database created?

@nick-youngblut
Copy link
Author

It was created with createdb. The query db is attached

files.zip

@milot-mirdita
Copy link
Member

Okay some part of the puzzle is still missing. What was the search command? 7002 entries doesn't make much sense.

@nick-youngblut
Copy link
Author

The command was:

mmseqs search --threads 8 -e 1e-3 \
  --max-accept 1 --max-seqs 100 -s 6 \
  --num-iterations 2   --split 0 --split-memory-limit 44G  \
  seqs17_db target_db hits_seqs17_db      mmseqs_search_TMP17

@milot-mirdita
Copy link
Member

I think I also need the full output of the search. The issue does not seem to be in convertalis but somewhere in the search.

@nick-youngblut
Copy link
Author

Here's the output from that search job:

align /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/seqs17_db /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/mmseqs_search_db/db /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/pref_0 /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/aln_0 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 1 --alignment-mode 2 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 1 --max-rejected 2147483647 --max-accept 1 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 8 --compressed 0 -v 3

Compute score only
Query database size: 6526 type: Aminoacid
Target database size: 41195879 type: Aminoacid
Calculation of alignments
[=================================================================] 7.00K 0s 8ms
Time for merging to aln_0: 0h 0m 0s 9ms

0 alignments calculated.
0 sequence pairs passed the thresholds (-nan of overall calculated).
0.000000 hits per query sequence.
Time for processing: 0h 0m 3s 593ms
align /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/profile_0 /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/mmseqs_search_db/db /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/pref_1 /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/aln_tmp_1 --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 1 --alignment-mode 2 --wrapped-scoring 0 -e 0.001 --min-seq-id 0 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0 --cov-mode 0 --max-seq-len 65535 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 1 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open nucl:5,aa:11 --gap-extend nucl:2,aa:1 --zdrop 40 --threads 8 --compressed 0 -v 3

Compute score, coverage and sequence identity
Query database size: 7002 type: Profile
Target database size: 41195879 type: Aminoacid
Calculation of alignments
[=================================================================] 7.00K 0s 20ms
Time for merging to aln_tmp_1: 0h 0m 0s 10ms

0 alignments calculated.
0 sequence pairs passed the thresholds (-nan of overall calculated).
0.000000 hits per query sequence.
Time for processing: 0h 0m 4s 529ms
mergedbs /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/profile_0 /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/hits_seqs17_db /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/aln_0 /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/aln_tmp_1

Merging the results to /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search/hits_seqs17_db
Time for merging to hits_seqs17_db: 0h 0m 0s 2ms
Time for processing: 0h 0m 0s 19ms
rmdb /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/aln_0

Time for processing: 0h 0m 0s 1ms
rmdb /ebio/abt3_scratch/nyoungblut/Struo2_255873462447/UniRef50_clst0.9/mmseqs_search_TMP17/874358861699530798/aln_tmp_1

Time for processing: 0h 0m 0s 1ms

@milot-mirdita
Copy link
Member

Do you also have the previous steps? Alternatively, clear the temp directory and rerun the command.

At this point the profiles already contain over 7k entries for some reason (7.00K 0s 8ms).

@nick-youngblut
Copy link
Author

Yeah, maybe it's due to an old temp directory. I'm going to use --remove-tmp-files 1 from now on

@milot-mirdita
Copy link
Member

I added something that should hopefully prevent this from occurring in the future. It should create a new subdirectory in the tmp folder if any input has changed in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants