-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restructured coverage functions #1
Conversation
Fixed typo
Upgraded to version 0.1.3
woltka/coverage.py
Outdated
for sample, (_, subque) in rmaps.items(): | ||
for subjects in subque: | ||
for subject, ranges in subjects.items(): | ||
cover = add_cover((sample, subject), [0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of [0]? Is this empty coverage? Wouldn't that be []?
woltka/coverage.py
Outdated
cover = add_cover((sample, subject), [0]) | ||
count = cover[0] + len(ranges) | ||
if count >= chunk: | ||
cover[:] = [0] + merge_ranges(cover[1:] + ranges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the [0] here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this syntax will make a copy of the merge_ranges result list that would not be made by cover[:] = merge_ranges(cover[1:] + ranges), but I can't say for sure that python won't optimize that away
woltka/coverage.py
Outdated
cover[0] = count | ||
cover.extend(ranges) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this syntax would avoid the copy I complain about above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments added. Generally approve the changes, though I'd want the merge algorithm to be credited to me since it is nearly identical to the zebra filter implementation and the creation of the new file hides that it is stemming from my contribution.
I think storing [items_added, range, range, range...] is bad practice, the (named)tuple (items_added, [range, range, range]) would be better, but that isn't my call to make.
Between the chunking of operations and the storage of closures to avoid function lookups, there seems to be an large focus on low level optimization. That screams to me that this is written in the wrong programming language. Will this all need to be ported to C/C++ at some point?
Hello @dhakim87 Thank you for your review! I have implemented the data structure you suggested, which is neater and retains the same performance. I have also credited you in the docstring. Please kindly review. Thanks! There is no plan to port to C/C++. I also think that this will significantly boost the performance. However, I consider that I am not as familiar with C/C++, and likely will not be due to the time commitment to my responsibilities. Meanwhile, continuous development and maintenance is important for the life cycle of the software tool. Therefore, I tend to stick to Python. There has been an ongoing discussion with several collaborators regarding improving the performance of Woltka. They are expertised in computer science, especially hardware acceleration. |
Here are some benchmarks of the chunk size (autocompress step size):
Therefore, 5000 may be the best, but this is a small-scale test so I would keep the current value 10000 until there is more evidence. I think that no constant value will work the best for all sample - subject pairs. The ideal solution would be an adaptive gradient method which dynamically adjusts the merging rate according to the numbers of ranges merged in the past several rounds. |
@dhakim87 I have found another significant optimization: Instead of storing range data as a list of tuples: This optimization reduced memory to half (373228 to 192644), meanwhile it also reduced runtime slightly (0:49.06 to 0:46.99). Note that these numbers are for the entire Woltka run. I plan to wait until you merge this PR, then submit another PR for you to review. What do you think? |
Okay, I'll merge this PR now. I think an interleaved start end list is a fine optimization |
Hello @dhakim87 I have carefully read and tested your code, and modified it based on my understanding. The performance of the code has been significantly improved. For a small test run:
Modifications include:
range_mapper
, which is slightly modified fromplain_mapper
by returning range information. The oldplain_mapper
remains the same. This ensures that when the user chooses not to report coverage, the program performance remains the same.SortedRangeList
class with functions to save class instantiation and member access overhead.However, there is still big room for improving the performance, considering that the analysis only takes 20 sec if one chooses not to report coverage. I will explore further tweaks.
Will appreciate your feedback! Once this PR is merged into your repo, your PR to my repo will be automatically updated, and I will review accordingly.