Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructured coverage functions #1

Merged
merged 11 commits into from
Sep 1, 2021

Conversation

qiyunzhu
Copy link

Hello @dhakim87 I have carefully read and tested your code, and modified it based on my understanding. The performance of the code has been significantly improved. For a small test run:

  • Old: 1:01.56 elapsed, 620644 maxresident
  • New: 0:49.26 elapsed, 372700 maxresident

Modifications include:

  • Decided that ordinal mapper does not need coverage calculation, because itself has coordinate operations. This simplifies logic.
  • Created a new mapper: range_mapper, which is slightly modified from plain_mapper by returning range information. The old plain_mapper remains the same. This ensures that when the user chooses not to report coverage, the program performance remains the same.
  • Replaced SortedRangeList class with functions to save class instantiation and member access overhead.
  • Changed autocompress frequency from 100k to 10k. I haven't thoroughly benchmarked this parameter though.
  • Multiple minor tweaks to improve performance.
  • Reformatted code such as variable names to ensure consistent coding style.
  • Added unit tests. Now coverage remains 100%.
  • Merged v0.1.3 code.

However, there is still big room for improving the performance, considering that the analysis only takes 20 sec if one chooses not to report coverage. I will explore further tweaks.

Will appreciate your feedback! Once this PR is merged into your repo, your PR to my repo will be automatically updated, and I will review accordingly.

for sample, (_, subque) in rmaps.items():
for subjects in subque:
for subject, ranges in subjects.items():
cover = add_cover((sample, subject), [0])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of [0]? Is this empty coverage? Wouldn't that be []?

cover = add_cover((sample, subject), [0])
count = cover[0] + len(ranges)
if count >= chunk:
cover[:] = [0] + merge_ranges(cover[1:] + ranges)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the [0] here.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this syntax will make a copy of the merge_ranges result list that would not be made by cover[:] = merge_ranges(cover[1:] + ranges), but I can't say for sure that python won't optimize that away

Comment on lines 101 to 102
cover[0] = count
cover.extend(ranges)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this syntax would avoid the copy I complain about above.

Copy link
Owner

@dhakim87 dhakim87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments added. Generally approve the changes, though I'd want the merge algorithm to be credited to me since it is nearly identical to the zebra filter implementation and the creation of the new file hides that it is stemming from my contribution.

I think storing [items_added, range, range, range...] is bad practice, the (named)tuple (items_added, [range, range, range]) would be better, but that isn't my call to make.

Between the chunking of operations and the storage of closures to avoid function lookups, there seems to be an large focus on low level optimization. That screams to me that this is written in the wrong programming language. Will this all need to be ported to C/C++ at some point?

@qiyunzhu
Copy link
Author

Hello @dhakim87 Thank you for your review! I have implemented the data structure you suggested, which is neater and retains the same performance. I have also credited you in the docstring. Please kindly review. Thanks!

There is no plan to port to C/C++. I also think that this will significantly boost the performance. However, I consider that I am not as familiar with C/C++, and likely will not be due to the time commitment to my responsibilities. Meanwhile, continuous development and maintenance is important for the life cycle of the software tool. Therefore, I tend to stick to Python.

There has been an ongoing discussion with several collaborators regarding improving the performance of Woltka. They are expertised in computer science, especially hardware acceleration.

@qiyunzhu
Copy link
Author

Here are some benchmarks of the chunk size (autocompress step size):

Chunk Runtime Memory
1000 0:58.17 260764
2000 0:52.82 282048
5000 0:49.19 329532
10000 0:49.57 372804
20000 0:50.49 429624
50000 0:53.74 538292
100000 0:57.23 654132

Therefore, 5000 may be the best, but this is a small-scale test so I would keep the current value 10000 until there is more evidence.

I think that no constant value will work the best for all sample - subject pairs. The ideal solution would be an adaptive gradient method which dynamically adjusts the merging rate according to the numbers of ranges merged in the past several rounds.

@qiyunzhu
Copy link
Author

@dhakim87 I have found another significant optimization: Instead of storing range data as a list of tuples: [(start, end), (start, end), (start, end)...], I flattened this structure into [start, end, start, end, start, end...]. This saves the compute overhead and memory cost of creating many tuples. When the ranges need to be merged, the flat list will be zipped up in pairs.

This optimization reduced memory to half (373228 to 192644), meanwhile it also reduced runtime slightly (0:49.06 to 0:46.99). Note that these numbers are for the entire Woltka run.

I plan to wait until you merge this PR, then submit another PR for you to review. What do you think?

@dhakim87
Copy link
Owner

dhakim87 commented Sep 1, 2021

Okay, I'll merge this PR now.

I think an interleaved start end list is a fine optimization

@dhakim87 dhakim87 merged commit c0a6861 into dhakim87:st_dh_coverage Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants