Restructured coverage functions #1

qiyunzhu · 2021-08-29T17:23:46Z

Hello @dhakim87 I have carefully read and tested your code, and modified it based on my understanding. The performance of the code has been significantly improved. For a small test run:

Old: 1:01.56 elapsed, 620644 maxresident
New: 0:49.26 elapsed, 372700 maxresident

Modifications include:

Decided that ordinal mapper does not need coverage calculation, because itself has coordinate operations. This simplifies logic.
Created a new mapper: range_mapper, which is slightly modified from plain_mapper by returning range information. The old plain_mapper remains the same. This ensures that when the user chooses not to report coverage, the program performance remains the same.
Replaced SortedRangeList class with functions to save class instantiation and member access overhead.
Changed autocompress frequency from 100k to 10k. I haven't thoroughly benchmarked this parameter though.
Multiple minor tweaks to improve performance.
Reformatted code such as variable names to ensure consistent coding style.
Added unit tests. Now coverage remains 100%.
Merged v0.1.3 code.

However, there is still big room for improving the performance, considering that the analysis only takes 20 sec if one chooses not to report coverage. I will explore further tweaks.

Will appreciate your feedback! Once this PR is merged into your repo, your PR to my repo will be automatically updated, and I will review accordingly.

Fixed typo

Upgraded to version 0.1.3

woltka/coverage.py

dhakim87 · 2021-08-30T19:12:11Z

woltka/coverage.py

+    for sample, (_, subque) in rmaps.items():
+        for subjects in subque:
+            for subject, ranges in subjects.items():
+                cover = add_cover((sample, subject), [0])


What is the meaning of [0]? Is this empty coverage? Wouldn't that be []?

dhakim87 · 2021-08-30T19:14:42Z

woltka/coverage.py

+                cover = add_cover((sample, subject), [0])
+                count = cover[0] + len(ranges)
+                if count >= chunk:
+                    cover[:] = [0] + merge_ranges(cover[1:] + ranges)


I don't understand the [0] here.

I believe that this syntax will make a copy of the merge_ranges result list that would not be made by cover[:] = merge_ranges(cover[1:] + ranges), but I can't say for sure that python won't optimize that away

woltka/coverage.py

dhakim87 · 2021-08-30T19:38:15Z

woltka/coverage.py

+                    cover[0] = count
+                    cover.extend(ranges)


I believe this syntax would avoid the copy I complain about above.

woltka/tests/test_coverage.py

woltka/coverage.py

dhakim87

Comments added. Generally approve the changes, though I'd want the merge algorithm to be credited to me since it is nearly identical to the zebra filter implementation and the creation of the new file hides that it is stemming from my contribution.

I think storing [items_added, range, range, range...] is bad practice, the (named)tuple (items_added, [range, range, range]) would be better, but that isn't my call to make.

Between the chunking of operations and the storage of closures to avoid function lookups, there seems to be an large focus on low level optimization. That screams to me that this is written in the wrong programming language. Will this all need to be ported to C/C++ at some point?

qiyunzhu · 2021-08-31T01:15:46Z

Hello @dhakim87 Thank you for your review! I have implemented the data structure you suggested, which is neater and retains the same performance. I have also credited you in the docstring. Please kindly review. Thanks!

There is no plan to port to C/C++. I also think that this will significantly boost the performance. However, I consider that I am not as familiar with C/C++, and likely will not be due to the time commitment to my responsibilities. Meanwhile, continuous development and maintenance is important for the life cycle of the software tool. Therefore, I tend to stick to Python.

There has been an ongoing discussion with several collaborators regarding improving the performance of Woltka. They are expertised in computer science, especially hardware acceleration.

qiyunzhu · 2021-08-31T02:31:06Z

Here are some benchmarks of the chunk size (autocompress step size):

Chunk	Runtime	Memory
1000	0:58.17	260764
2000	0:52.82	282048
5000	0:49.19	329532
10000	0:49.57	372804
20000	0:50.49	429624
50000	0:53.74	538292
100000	0:57.23	654132

Therefore, 5000 may be the best, but this is a small-scale test so I would keep the current value 10000 until there is more evidence.

I think that no constant value will work the best for all sample - subject pairs. The ideal solution would be an adaptive gradient method which dynamically adjusts the merging rate according to the numbers of ranges merged in the past several rounds.

qiyunzhu · 2021-08-31T05:37:04Z

@dhakim87 I have found another significant optimization: Instead of storing range data as a list of tuples: [(start, end), (start, end), (start, end)...], I flattened this structure into [start, end, start, end, start, end...]. This saves the compute overhead and memory cost of creating many tuples. When the ranges need to be merged, the flat list will be zipped up in pairs.

This optimization reduced memory to half (373228 to 192644), meanwhile it also reduced runtime slightly (0:49.06 to 0:46.99). Note that these numbers are for the entire Woltka run.

I plan to wait until you merge this PR, then submit another PR for you to review. What do you think?

dhakim87 · 2021-09-01T01:52:07Z

Okay, I'll merge this PR now.

I think an interleaved start end list is a fine optimization

niemasd and others added 9 commits August 12, 2021 13:06

Fixed typo

68f820d

Merge pull request #134 from niemasd/patch-1

365ac69

Fixed typo

added change log

1b48fdf

increased version number to 0.1.3

f39f30a

Merge pull request #136 from qiyunzhu/upgrade

8cc9516

Upgraded to version 0.1.3

modularized coverage functions

3f5b5d1

reverted changed files

a8992d1

Merge branch 'master' of github.com:qiyunzhu/woltka into coverage

07f576e

restructured coverage module

319c48a

qiyunzhu mentioned this pull request Aug 29, 2021

Coverage Calculation qiyunzhu/woltka#135

Merged

dhakim87 reviewed Aug 30, 2021

View reviewed changes

woltka/coverage.py Show resolved Hide resolved

dhakim87 reviewed Aug 30, 2021

View reviewed changes

woltka/coverage.py Show resolved Hide resolved

dhakim87 reviewed Aug 30, 2021

View reviewed changes

woltka/tests/test_coverage.py Show resolved Hide resolved

dhakim87 reviewed Aug 30, 2021

View reviewed changes

woltka/coverage.py Show resolved Hide resolved

dhakim87 approved these changes Aug 30, 2021

View reviewed changes

qiyunzhu added 2 commits August 30, 2021 17:45

interleaved covers structure

9c45d59

added credit for dhakim87

bb20624

dhakim87 merged commit c0a6861 into dhakim87:st_dh_coverage Sep 1, 2021

qiyunzhu mentioned this pull request Sep 5, 2021

Adaptive range merging policy qiyunzhu/woltka#138

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructured coverage functions #1

Restructured coverage functions #1

qiyunzhu commented Aug 29, 2021

dhakim87 Aug 30, 2021

dhakim87 Aug 30, 2021

dhakim87 Aug 30, 2021

dhakim87 Aug 30, 2021

dhakim87 left a comment

qiyunzhu commented Aug 31, 2021

qiyunzhu commented Aug 31, 2021

qiyunzhu commented Aug 31, 2021

dhakim87 commented Sep 1, 2021

Restructured coverage functions #1

Restructured coverage functions #1

Conversation

qiyunzhu commented Aug 29, 2021

dhakim87 Aug 30, 2021

Choose a reason for hiding this comment

dhakim87 Aug 30, 2021

Choose a reason for hiding this comment

dhakim87 Aug 30, 2021

Choose a reason for hiding this comment

dhakim87 Aug 30, 2021

Choose a reason for hiding this comment

dhakim87 left a comment

Choose a reason for hiding this comment

qiyunzhu commented Aug 31, 2021

qiyunzhu commented Aug 31, 2021

qiyunzhu commented Aug 31, 2021

dhakim87 commented Sep 1, 2021