-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement optimized median filter #862
Conversation
Cool! |
|
Does this return identical results to old approach? |
Yup. All normalize-by-median tests pass with the new function patched in. |
jenkins, test this please you useless drunk |
jenkins test this please and give it a pass or the floggings will continue |
@ctb @mr-c @luizirber okey now pls review |
Please leave for me to merge - tnx. |
@ctb I keep expecting that I've overlooked something really obvious and there's no way a 20% performance increase would be so easy, but shrug |
return NULL; | ||
} | ||
|
||
if (counting->filter_on_median(long_str, cutoff)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add { and } - single-line if statements are dangerous ;).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping.
A few comments -- the code didn't return identical results on data/100k-filtered.fa, due to a round-off error. See rounding comment on hashtable.cc. Please add tests for single k-mer, med < cutoff; single k-mer, med > cutoff; odd # k-mers, med < cutoff; odd # of k-mers; med > cutoff; and any other edge cases you can think of (that's all I got). The homopolymer run tests are not sufficient ;). You should enforce comparison with the old function, too, to make sure they both return the same results. can we implement the old function without the sort? it seems fairly straightforward to do so although there would still be a performance hit. This shouldn't be your job to do, but can you create a few issues around (a) improving old function, and (b) replacing old function with this function in places where it would work? tnx! |
@ctb can do on the extra tests and whatnot. We definitely can't do the old function without the sort -- the key difference here is that the actual median isn't calculated, only that it's greater than some value. There are definitely more efficient median algorithms if we want to replace the naive sort-based one we're using now thoug |
On Mon, Mar 09, 2015 at 01:15:38PM -0700, Camille Scott wrote:
ok, let me know
sure, we should look at the current code and think about where we might |
Stop being useless Jenkins, and dammit Jenkins, test this please! |
@ctb, ready for re-review |
hi.consume(seq) | ||
|
||
med, _, _ = hi.get_median_count(seq) | ||
assert hi.filter_on_median(seq, 4) is (med >= C) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace 4 with C!
Two questions for consideration cc @mr-c --
|
+1 for storing known-good output. How big, when compressed? Worse case we keep it elsewhere and let Jenkins deal with it. |
Very well -- I'll add a comparison test to known output of the mentioned file and change the function name. |
Hrmm.. that would nearly double the distribution size.. On Thu, Apr 2, 2015 at 4:38 AM Camille Scott notifications@github.com
|
We need a separate repo to store larger pieces of test data and have tests that run only on Jenkins by default. @ctb What do you think? |
Would something like git's LFS work? https://github.com/blog/1986-announcing-git-large-file-storage-lfs from what I can tell it's essentially git-annex, though I have no experience with either. |
@bocajnotnef In this case the file is 5Megabytes so GitHub LFS wouldn't be needed. I don't want to make the git clone of the main repository larger than necessary. Another option is to add the 5M file but exclude it from packaging and adjust the test to skip if the file isn't present. |
On Thu, Apr 09, 2015 at 08:55:01AM -0700, Michael R. Crusoe wrote:
What about a 'khmer-data' repository? Although that seems kind of silly. |
On Thu, Apr 09, 2015 at 08:19:19AM -0700, Michael R. Crusoe wrote:
Since the data is mostly going to be read-only, it seems silly to have a full |
How about we get the optimization merged for now, then open an issue for supporting github LFS or some other variety of data hosting to be addressed in the future? |
+1 On Mon, Apr 13, 2015 at 6:33 PM Camille Scott notifications@github.com
|
|
@camillescott - @mr-c is going to address data size problem for test by putting that elsewhere, but other than that, all you need to do is make it mergable, I believe. |
It can go into https://github.com/dib-lab/khmer-testdata ; just commit directly |
ping @camillescott |
@mr-c need access to khmer-testdata repo |
Jenkins, test this please |
There is so much winning in this PR, @camillescott. |
Suggest squashing as in #660. |
… observe that checking if the median of a set is greater than some cutoff is equivalent to checking if more than half the elements of that set are greater than some cutoff. The latter avoids doing a lookup for every kmer every time, and avoids a costly sort. On a small dataset (1m ecoli reads), this was an 18% performance improvement. Implements the median_at_least function in C++ land, exposes it in CPython, and updates normalize-by-median.py.
c24578f
to
3dd1a9f
Compare
Jenkins, test this please |
@ctb ready for final review / merge. the known_good test has been marked as known_failing for now, as there is more work that needs to be done to configure jenkins to use the khmer-testdata repo. Once that's set up, we can just remove that attribute. |
also pleeeeeeeeeease merge this before merging any other PR's in, it's all squashed and pretty and I've had to update and tweak little things on this branch way too many times :'( |
LGTM; nice work. |
Yay, thanks! |
HashIntoType kmer = kmers.next(); | ||
if (this->get_count(kmer) >= cutoff) { | ||
++num_cutoff_kmers; | ||
if (num_cutoff_kmers >= min_req) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thought @camillescott - what if you did a for or a while loop up until min_req, and only then checked the 'if'? Obviously if num_kmers < min_req there's no point in checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:cries: that's probably fair, though I imagine the compiler + cpu does a fair job with it's branch prediction in this scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Tue, Jun 09, 2015 at 10:20:17AM -0700, Camille Scott wrote:
@@ -232,6 +232,26 @@ void Hashtable::get_median_count(const std::string &s,
median = counts[counts.size() / 2]; // rounds down
}+//
+// Optimized filter function for normalize-by-median
+//
+bool Hashtable::median_at_least(const std::string &s,
unsigned int cutoff) {
- KMerIterator kmers(s.c_str(), _ksize);
- unsigned int min_req = 0.5 + float(s.size() - _ksize + 1) / 2;
- unsigned int num_cutoff_kmers = 0;
- while(!kmers.done()) {
HashIntoType kmer = kmers.next();
if (this->get_count(kmer) >= cutoff) {
++num_cutoff_kmers;
if (num_cutoff_kmers >= min_req) {
:cries: that's probably fair, though I imagine the compiler + cpu does a fair job with it's branch prediction in this scenario
:)
Easy optimization to improve normalize-by-median: it's simple to observe that checking if the median of a set is greater than some cutoff is equivalent to checking if more than half the elements of that set are greater than some cutoff. The latter avoids doing a lookup for every kmer every time, and avoids a costly sort. On a small dataset (1m ecoli reads), this was an 18% performance improvement. Not bad for 5 minutes!
TODO: Add a couple tests