-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mmseqs easy-search: Results differ slighty based on used cores #277
Comments
Only the order results in one query block is conserved. Block of different queries will be more or less random due to the order in which they are processed multithreaded. You can ensure a consistent order of the result file by using However, a different number of result lines seems like a bug. Would you mind trying if this bug was resolved in the meantime in the newest release 11? Can you reproduce the bug with the two FASTA files in the example folder (QUERY.fasta and DB.fasta)? By the way, if you want a set of stickers (see https://twitter.com/thesteinegger/status/1201076220957315074), send me your address to milot at mirdita de. |
lol nice offer with the stickers, I might email you :D. I can NOT reproduce this with the example data. In my run, this only appeared to be a minor impact on the result (5 out of 160k), so I think the example data is not big enough. It gets worse though. |
Does the issue also happen in the newest release? Sending us the data via email or similar is also not possible? |
Let me see later if I can install the latest release (currently I have it via conda, no clue if that's updated yet). I'll make another run and capture the output, the log is not a problem luckily. |
Okay, the standard output is here https://gist.github.com/bastian-wur/1b978870a88c3ead69f51e77e65b4696 Maybe of importance, not sure: The used database is pretty redundant. Not sure if there are many identical matches in there, but many which are pretty close to each other. |
Conda should be up to date, so you should be able to just update the package. |
Could you send me the intermediate/result files? These do not contain neither sequences nor identifiers.
|
I'm running this right now. I updated to the latest issue, and that did not fix the problem :/ . |
Thank you that was very helpful. |
Could you also please rerun |
With the current version?
|
I have problems constructing a redundant database that reproduces this issue. Something is odd with this part of the output:
Can you share some approximate composition/construction of the database so I can try to make something from public data? I improve the reliability of your own search, I would recommend to cluster your database at a high seq.id./coverage to collapse some of the redundancy. This will speed up the search and probably also not go into the possibly problematic code branch I suspect is causing the issue. |
What info would you need? |
Thank you, that seems to be the hint I've needed. |
Amazing, the prefilter result changes with the number of threads. I think we should actually be able to investigate the issue now. |
Ah, phew, that's good to hear :). |
I just ran the same search as previously, and now the output is identical 👍 . I am just for the fun of it also running the reserve search (database against genome) to see what happens, but that is still running, and I'll probably not be able to check it in the coming days, therefore already reporting the first part. |
Especially for this search you might want to run the new exhaustive search mode available through the MMseqs2 has a limitation on the number of reported prefiltering hits (by default 300 with We developed this exhaustive search mode to be more efficient with disk use, but trading off some runtime for that. |
Sorry, as mentioned, I am having some time issues right now. I don't understand what you want me to test right now. I can't find the --slice-search parameter anywhere, or maybe I'm looking wrong. If you could clarify, then I'll test this too. |
I don't know your exact use-case and if this is actually important for you, I am just warning you that how you are currently using MMseqs2, might be a bit problematic with this specific target database. The results you are already getting with your current usage might already by completely fine for your current use-case. I will just explain what you could do if you might want more "complete" results. Furthermore, this is unrelated to the stability problem (that one should now be solved). The very redundant target database you are using is a bit of a weak point in MMseqs2. The The Now you have a target database with many very similar sequence and will run into this case. The effect is that a good (maybe even the best) hit in the target database might not be found, since its outside the limit given by We have a different search mode that accepts some inefficiency, while dealing with correctly with very redundant database and you can access this mode by passing the |
Ah, now I understand, thanks. The --slice-search parameter does not show up in the help of easy-search, so I was confused. And issue closed, I'll send an email regarding the stickers :D. |
Expected Behavior
You run the same command with 1, 2 or more cores, and the outputfile is the same.
Current Behavior
The output file slightly differs.
For a small benchmark, I ran blast, diamond and mmseqs with 1, 2 and 3 cores. While blast and diamond produce files with identical MD5 values, mmseqs does not. I checked further if maybe the order of the output is different, but the actual output is slightly different.
In 2 cases I get 162855 lines, in one case 162854 lines. In all combinations 162850 lines are identical, the remaining 4-5 lines are not.
All these hits are in the lower identity range.
Steps to Reproduce (for bugs)
Run a search with 1 or 2 cores.
EDIT: I can't provide the used database, since it's a non-public in-house database.
EDIT2: Does not always seem to happen though :/
Your Environment
MMseqs2 Version: 10.6d92c
Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz
Ubuntu 18.04.4 LTS
The text was updated successfully, but these errors were encountered: