Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mmseqs easy-search: Results differ slighty based on used cores #277

Closed
bastian-wur opened this issue Feb 21, 2020 · 23 comments
Closed

mmseqs easy-search: Results differ slighty based on used cores #277

bastian-wur opened this issue Feb 21, 2020 · 23 comments

Comments

@bastian-wur
Copy link

bastian-wur commented Feb 21, 2020

Expected Behavior

You run the same command with 1, 2 or more cores, and the outputfile is the same.

Current Behavior

The output file slightly differs.

For a small benchmark, I ran blast, diamond and mmseqs with 1, 2 and 3 cores. While blast and diamond produce files with identical MD5 values, mmseqs does not. I checked further if maybe the order of the output is different, but the actual output is slightly different.
In 2 cases I get 162855 lines, in one case 162854 lines. In all combinations 162850 lines are identical, the remaining 4-5 lines are not.
All these hits are in the lower identity range.

Steps to Reproduce (for bugs)

Run a search with 1 or 2 cores.
EDIT: I can't provide the used database, since it's a non-public in-house database.
EDIT2: Does not always seem to happen though :/

Your Environment

MMseqs2 Version: 10.6d92c
Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz
Ubuntu 18.04.4 LTS

@milot-mirdita
Copy link
Member

milot-mirdita commented Feb 21, 2020

Only the order results in one query block is conserved. Block of different queries will be more or less random due to the order in which they are processed multithreaded.

You can ensure a consistent order of the result file by using search and then convertalis --threads 1 instead of easy-search.

However, a different number of result lines seems like a bug. Would you mind trying if this bug was resolved in the meantime in the newest release 11?

Can you reproduce the bug with the two FASTA files in the example folder (QUERY.fasta and DB.fasta)?

By the way, if you want a set of stickers (see https://twitter.com/thesteinegger/status/1201076220957315074), send me your address to milot at mirdita de.

@bastian-wur
Copy link
Author

lol nice offer with the stickers, I might email you :D.

I can NOT reproduce this with the example data. In my run, this only appeared to be a minor impact on the result (5 out of 160k), so I think the example data is not big enough.

It gets worse though.
I ran the exact same command (besides output and temp folders) again.
The result on 1 core is reproducable between the first and second run.
The result on 2 cores is NOT reproducable between the first and second run, and is not the same as for 1 core.
The same for 3 cores.
The amount of differences is again minor (<10 in 160k), but that it's not reproducable is really worrysome :/.

@milot-mirdita
Copy link
Member

Does the issue also happen in the newest release?
Could you capture the full log of the different runs and upload those?

Sending us the data via email or similar is also not possible?

@bastian-wur
Copy link
Author

Let me see later if I can install the latest release (currently I have it via conda, no clue if that's updated yet).
I don't think I can send the data to you. The database includes unpublished data from various collaboration partners, so not really an option :/.

I'll make another run and capture the output, the log is not a problem luckily.

@bastian-wur
Copy link
Author

Okay, the standard output is here https://gist.github.com/bastian-wur/1b978870a88c3ead69f51e77e65b4696
You can see that "sequence pairs passed the thresholds" is different for all 3 of them.

Maybe of importance, not sure: The used database is pretty redundant. Not sure if there are many identical matches in there, but many which are pretty close to each other.

@milot-mirdita
Copy link
Member

Conda should be up to date, so you should be able to just update the package.

@milot-mirdita
Copy link
Member

Could you send me the intermediate/result files? These do not contain neither sequences nor identifiers.

THREADS=1
mmseqs easy-search /home/bastian/data/benchmark/Dyadobacter_sp__SG02_1855291.fasta /home/bastian/data/benchmark/DF_PP_full100.fa.fixed.mmseqsdb /home/bastian/data/benchmark/mmseqs_test_again/Dyadobacter_sp__SG02_1855291_vs_DF_PP_full100.fa.fixed.1core.mmseqsresult "/home/bastian/data/benchmark/mmseqs_test_again/tmp${THREADS}/" --remove-tmp-files false  --threads "${THREADS}"
tar czvf "mmseqs-intermediate-threads${THREADS}.tar.gz" /home/bastian/data/benchmark/mmseqs_test_again/tmp${THREADS}/latest/result* /home/bastian/data/benchmark/mmseqs_test_again/tmp${THREADS}/latest/search_tmp 

@bastian-wur
Copy link
Author

bastian-wur commented Feb 24, 2020

I'm running this right now.

I updated to the latest issue, and that did not fix the problem :/ .

@bastian-wur
Copy link
Author

@milot-mirdita
Copy link
Member

Thank you that was very helpful.
We will have to investigate whats going wrong, I have a few suspects.

@milot-mirdita
Copy link
Member

Could you also please rerun createindex and post the full output?

@bastian-wur
Copy link
Author

With the current version?

createindex DF_PP_full100.fa.fixed.mmseqsdb tmp 

MMseqs Version:                 11.e1a1c

createindex DF_PP_full100.fa.fixed.mmseqsdb tmp 

MMseqs Version:                 11.e1a1c

indexdb DF_PP_full100.fa.fixed.mmseqsdb DF_PP_full100.fa.fixed.mmseqsdb --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --alph-size 21 --comp-bias-corr 1 --max-seq-len 65535 --max-seqs 300 --mask 1 --mask-lower-case 0 --spaced-kmer-mode 1 -s 7.5 --k-score 0 --check-compatible 0 --search-type 0 --split 0 --split-memory-limit 0 -v 3 --threads 12 

Estimated memory consumption: 8G
Write VERSION (0)
Write META (1)
Write SCOREMATRIX3MER (4)
Write SCOREMATRIX2MER (3)
Write SCOREMATRIXNAME (2)
Write SPACEDPATTERN (23)
Write DBR1INDEX (5)
Write DBR1DATA (6)
Write HDR1INDEX (18)
Write HDR1DATA (19)
Write GENERATOR (22)
Index table: counting k-mers
[=================================================================] 100.00% 1.85M 1m 9s 45ms       
Index table: Masked residues: 9292781
Index table: fill
[=================================================================] 100.00% 1.85M 41s 694ms       
Index statistics
Entries:          932473123
DB size:          5823 MB                                                                                                                                                                                                                                                                                                                                                 
Avg k-mer size:   14.569893                                                                                                                                                                                                                                                                                                                                               
Top 10 k-mers                                                                                                                                                                                                                                                                                                                                                             
    RDWGRP      39946                                                                                                                                                                                                                                                                                                                                                     
    CFVLGR      32839                                                                                                                                                                                                                                                                                                                                                     
    CVRNGS      31368                                                                                                                                                                                                                                                                                                                                                     
    ILTECV      26854                                                                                                                                                                                                                                                                                                                                                     
    APSDIR      20240                                                                                                                                                                                                                                                                                                                                                     
    ESCCGT      19857                                                                                                                                                                                                                                                                                                                                                     
    DGAGDT      18909                                                                                                                                                                                                                                                                                                                                                     
    YQAGTT      18726                                                                                                                                                                                                                                                                                                                                                     
    SSCDAW      18697                                                                                                                                                                                                                                                                                                                                                     
    PNEGGV      18425                                                                                                                                                                                                                                                                                                                                                     
Write ENTRIES (9)                                                                                                                                                                                                                                                                                                                                                         
Write ENTRIESOFFSETS (10)                                                                                                                                                                                                                                                                                                                                                 
Write SEQINDEXDATASIZE (15)                                                                                                                                                                                                                                                                                                                                               
Write SEQINDEXSEQOFFSET (16)                                                                                                                                                                                                                                                                                                                                              
Write SEQINDEXDATA (14)                                                                                                                                                                                                                                                                                                                                                   
Write ENTRIESNUM (12)                                                                                                                                                                                                                                                                                                                                                     
Write SEQCOUNT (13)                                                                                                                                                                                                                                                                                                                                                       
Time for merging to DF_PP_full100.fa.fixed.mmseqsdb.idx: 0h 0m 0s 0ms ```

@milot-mirdita
Copy link
Member

I have problems constructing a redundant database that reproduces this issue.

Something is odd with this part of the output:

320.495474 k-mers per position
1748306 DB matches per sequence
674 overflows

Can you share some approximate composition/construction of the database so I can try to make something from public data?

I improve the reliability of your own search, I would recommend to cluster your database at a high seq.id./coverage to collapse some of the redundancy. This will speed up the search and probably also not go into the possibly problematic code branch I suspect is causing the issue.

@bastian-wur
Copy link
Author

What info would you need?
I just heard from a colleague that you must know the DB, it's the CAZy internal database. We want to replace the blast search for the initial step of our curation with something faster, so that's why we testing mmseqs right now.

@milot-mirdita
Copy link
Member

Thank you, that seems to be the hint I've needed.
I downloaded the CAZymes database from dbCAN and I can get that one to overflow too. Maybe I can reproduce the issue with that database.

@milot-mirdita
Copy link
Member

Amazing, the prefilter result changes with the number of threads. I think we should actually be able to investigate the issue now.

@bastian-wur
Copy link
Author

Ah, phew, that's good to hear :).

@milot-mirdita
Copy link
Member

Commit 14014cd should be stable and safe to use. Could you please test if the instability was resolved on your dataset?

You will find precompiled binaries here (once the CI builds them).

@bastian-wur
Copy link
Author

I just ran the same search as previously, and now the output is identical 👍 .

I am just for the fun of it also running the reserve search (database against genome) to see what happens, but that is still running, and I'll probably not be able to check it in the coming days, therefore already reporting the first part.

@milot-mirdita
Copy link
Member

Especially for this search you might want to run the new exhaustive search mode available through the --slice-search parameter.

MMseqs2 has a limitation on the number of reported prefiltering hits (by default 300 with --max-seqs). Increasing this parameter could explode the disk space use though.

We developed this exhaustive search mode to be more efficient with disk use, but trading off some runtime for that.

@bastian-wur
Copy link
Author

Sorry, as mentioned, I am having some time issues right now.
I checked the search in the reverse way, it's also identical.

I don't understand what you want me to test right now. I can't find the --slice-search parameter anywhere, or maybe I'm looking wrong. If you could clarify, then I'll test this too.

@milot-mirdita
Copy link
Member

I don't know your exact use-case and if this is actually important for you, I am just warning you that how you are currently using MMseqs2, might be a bit problematic with this specific target database. The results you are already getting with your current usage might already by completely fine for your current use-case. I will just explain what you could do if you might want more "complete" results. Furthermore, this is unrelated to the stability problem (that one should now be solved).

The very redundant target database you are using is a bit of a weak point in MMseqs2.
The MMseqs2 prefiltering module reports all the best double-consecutive-kmer matches either down to the --min-ungapped-score threshold or as many as --max-seqs allows.

The --min-ungapped-score criterion removes many hits that could never reach a good E-value anyway.

The --max-seqs parameters prevents a explosion of disk space while accepting the risk of losing some redundant hits. This is not that much of a problem since we normally search against the representatives of a clustered database, which are sufficiently dissimilar to not enter this case.

Now you have a target database with many very similar sequence and will run into this case. The effect is that a good (maybe even the best) hit in the target database might not be found, since its outside the limit given by --max-seqs.

We have a different search mode that accepts some inefficiency, while dealing with correctly with very redundant database and you can access this mode by passing the --slice-search parameter to either easy-search or search.

@bastian-wur
Copy link
Author

Ah, now I understand, thanks.

The --slice-search parameter does not show up in the help of easy-search, so I was confused.
I'll give it a try :).

And issue closed, I'll send an email regarding the stickers :D.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants