-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks of LCA functions #65
Comments
Exceptions are expensive. Have you tried not using then? I'm having a really hard time understanding what these functions are doing. Why is a tree necessary? I don't fully understand the data types here, but assuming taxa = [['a', 'b', 'c', 'd'],
['a', 'b', 'c', 'e'],
['a', 'b', 'x']] then why not do something like:
It's also not obvious why a |
@wasade : Thank you for examining my code and proposing your suggestion! Very helpful! The variable Now assume we did
Therefore, the length of a lineage cannot be used to determine LCA. We once discussed this question in your office. I was suggesting that flexible levels is preferrable for network-like classification systems as well as some non-taxonomy systems. That being said, Woltka's design is compatible with fixed lineages. When using the same GG-style taxonomy file as input, SHOGUN and Woltka free-rank classification produce exactly the same result. Finally, the cost of exceptions: I also have the same impression that exceptions are expensive. Therefore wherever it is possible to replace To my knowledge, Python's After some intense efforts of optimization, the runtime of Woltka is significantly shrinked in the upgrade branch, and the runtime of no classification (gOTU) vs. free-rank classification are now very close. Therefore I think that it's probably not most urgent to further optimize this part. |
Okay thanks @qiyunzhu. These functions are very difficult to follow... It may be advantageous to improve the readability through judicious comments. It looks like what may help is reversing try/excepts are faster than a conditional if the exception block is very infrequently encountered. It seems surprising that these are infrequent here given the code. They also have the effect of complicating the interpretation of the code. |
@wasade Thank you! Your suggestion of reversing the lineage sounds interesting. Let me give it a try. I benchmarked the program on a set of SHOGUN / Bowtie2 alignment files generated from CAMISIM metagenomes against WoL database. I think this is close to the most realistic scenario. I can add more details to docstrings & comments to make the logics clear for those important functions. |
Following @wasade 's suggestions in PR #50 as well as other thoughts, I tested multiple options of the
find_lca
function. Benchmarks were performed on a Bowtie2 alignment file of 100,000 lines against the WoL database. Summary:It appears that the original version is almost the best. No option I could think about notably improved performance. This includes the dict solution (f3), which appears to be the slowest, likely due to the overhead of converting the lineage list into a dict. Therefore, eventually I didn't make any change to the function.
Original version:
Alternative way of separating a random element from the remaining elements in a frozenset:
Use index range instead of slice for list:
Covert list to a dict of taxon to index to accelerate lookup:
Note: I benchmarked several options for converting a list to a dict:
v1:
{taxon: i for i, taxon in enumerate(lst)}
: 78.6 ms ± 480 µsv2:
dict(zip(subs, range(len(subs))))
: 50 ms ± 179 µsv3:
dict(map(reversed, enumerate(lst)))
: 250 ms ± 2.51 msSo I chose v2.
Use a loop instead of
list.index
to avoid error raising:Use list comprehension to replace loop:
The text was updated successfully, but these errors were encountered: