Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement optimized median filter #862

Merged
merged 1 commit into from
Jun 9, 2015
Merged

Conversation

camillescott
Copy link
Member

Easy optimization to improve normalize-by-median: it's simple to observe that checking if the median of a set is greater than some cutoff is equivalent to checking if more than half the elements of that set are greater than some cutoff. The latter avoids doing a lookup for every kmer every time, and avoids a costly sort. On a small dataset (1m ecoli reads), this was an 18% performance improvement. Not bad for 5 minutes!

TODO: Add a couple tests

@mr-c
Copy link
Contributor

mr-c commented Mar 8, 2015

Cool!

@camillescott
Copy link
Member Author

  • Is it mergeable?
  • Did it pass the tests?
  • If it introduces new functionality in scripts/ is it tested?
    Check for code coverage with make clean diff-cover
  • Is it well formatted? Look at make pep8, make diff_pylint_report,
    make cppcheck, and make doc output. Use make format and manual
    fixing as needed.
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
    http://en.wikipedia.org/wiki/Changelog#Format
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@camillescott
Copy link
Member Author

ready for review @ctb @mr-c

@ctb
Copy link
Member

ctb commented Mar 8, 2015

Does this return identical results to old approach?

@camillescott
Copy link
Member Author

Yup. All normalize-by-median tests pass with the new function patched in.

@camillescott
Copy link
Member Author

jenkins, test this please you useless drunk

@camillescott
Copy link
Member Author

jenkins test this please and give it a pass or the floggings will continue

@camillescott
Copy link
Member Author

@ctb @mr-c @luizirber okey now pls review

@ctb
Copy link
Member

ctb commented Mar 9, 2015

Please leave for me to merge - tnx.

@camillescott
Copy link
Member Author

@ctb I keep expecting that I've overlooked something really obvious and there's no way a 20% performance increase would be so easy, but shrug

return NULL;
}

if (counting->filter_on_median(long_str, cutoff))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add { and } - single-line if statements are dangerous ;).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping.

@ctb
Copy link
Member

ctb commented Mar 9, 2015

A few comments --

the code didn't return identical results on data/100k-filtered.fa, due to a round-off error. See rounding comment on hashtable.cc.

Please add tests for single k-mer, med < cutoff; single k-mer, med > cutoff; odd # k-mers, med < cutoff; odd # of k-mers; med > cutoff; and any other edge cases you can think of (that's all I got). The homopolymer run tests are not sufficient ;). You should enforce comparison with the old function, too, to make sure they both return the same results.

can we implement the old function without the sort? it seems fairly straightforward to do so although there would still be a performance hit. This shouldn't be your job to do, but can you create a few issues around (a) improving old function, and (b) replacing old function with this function in places where it would work?

tnx!

@camillescott
Copy link
Member Author

@ctb can do on the extra tests and whatnot.

We definitely can't do the old function without the sort -- the key difference here is that the actual median isn't calculated, only that it's greater than some value. There are definitely more efficient median algorithms if we want to replace the naive sort-based one we're using now thoug

@ctb
Copy link
Member

ctb commented Mar 11, 2015

On Mon, Mar 09, 2015 at 01:15:38PM -0700, Camille Scott wrote:

@ctb can do on the extra tests and whatnot.

ok, let me know

We definitely can't do the old function without the sort -- the key difference here is that the actual median isn't calculated, only that it's greater than some value. There are definitely more efficient median algorithms if we want to replace the naive sort-based one we're using now thoug

sure, we should look at the current code and think about where we might
optimize.

@camillescott
Copy link
Member Author

Stop being useless Jenkins, and dammit Jenkins, test this please!

@camillescott
Copy link
Member Author

@ctb, ready for re-review

hi.consume(seq)

med, _, _ = hi.get_median_count(seq)
assert hi.filter_on_median(seq, 4) is (med >= C)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace 4 with C!

@ctb
Copy link
Member

ctb commented Mar 31, 2015

Two questions for consideration cc @mr-c --

  • should we store the results of normalize-by-median on (say) 100k-filtered.fa, as a test for consistency going forward?
  • filter_on_median is an unsatisfying function name, since it doesn't actually filter anything. Would 'median_greater_than' or something like that be clearer? Suggestions welcome.

@mr-c
Copy link
Contributor

mr-c commented Mar 31, 2015

+1 for storing known-good output. How big, when compressed? Worse case we keep it elsewhere and let Jenkins deal with it.

@camillescott
Copy link
Member Author

Very well -- I'll add a comparison test to known output of the mentioned file and change the function name.

@camillescott
Copy link
Member Author

@ctb @mr-c okay I added the new test -- it adds ~5mb the repo. Is that manageable?

@mr-c
Copy link
Contributor

mr-c commented Apr 3, 2015

Hrmm.. that would nearly double the distribution size..

On Thu, Apr 2, 2015 at 4:38 AM Camille Scott notifications@github.com
wrote:

@ctb https://github.com/ctb @mr-c https://github.com/mr-c okay I
added the new test -- it adds ~5mb the repo. Is that manageable?


Reply to this email directly or view it on GitHub
#862 (comment).

@camillescott
Copy link
Member Author

@ctb @mr-c further thoughts on this??

@mr-c
Copy link
Contributor

mr-c commented Apr 9, 2015

We need a separate repo to store larger pieces of test data and have tests that run only on Jenkins by default. @ctb What do you think?

@SensibleSalmon
Copy link
Contributor

Would something like git's LFS work?

https://github.com/blog/1986-announcing-git-large-file-storage-lfs

from what I can tell it's essentially git-annex, though I have no experience with either.

@mr-c
Copy link
Contributor

mr-c commented Apr 9, 2015

@bocajnotnef In this case the file is 5Megabytes so GitHub LFS wouldn't be needed. I don't want to make the git clone of the main repository larger than necessary. Another option is to add the 5M file but exclude it from packaging and adjust the test to skip if the file isn't present.

@ctb
Copy link
Member

ctb commented Apr 12, 2015

On Thu, Apr 09, 2015 at 08:55:01AM -0700, Michael R. Crusoe wrote:

@bocajnotnef In this case the file is 5Megabytes so GitHub LFS wouldn't be needed. I don't want to make the git clone of the main repository larger than necessary. Another option is to add the 5M file but exclude it from packaging and adjust the test to skip if the file isn't present.

What about a 'khmer-data' repository? Although that seems kind of silly.

@ctb
Copy link
Member

ctb commented Apr 12, 2015

On Thu, Apr 09, 2015 at 08:19:19AM -0700, Michael R. Crusoe wrote:

We need a separate repo to store larger pieces of test data and have tests that run only on Jenkins by default. @ctb What do you think?

Since the data is mostly going to be read-only, it seems silly to have a full
repo. Maybe their new data extension is the way to go for that? And have a
khmer-data repo that stores the pointers?

@camillescott
Copy link
Member Author

How about we get the optimization merged for now, then open an issue for supporting github LFS or some other variety of data hosting to be addressed in the future?

@mr-c
Copy link
Contributor

mr-c commented Apr 13, 2015

+1

On Mon, Apr 13, 2015 at 6:33 PM Camille Scott notifications@github.com
wrote:

How about we get the optimization merged for now, then open an issue for
supporting github LFS or some other variety of data hosting to be addressed
in the future?


Reply to this email directly or view it on GitHub
#862 (comment).

@ctb
Copy link
Member

ctb commented Apr 16, 2015

  • create an issue
  • make mergable

@mr-c mr-c added this to the 1.4+ milestone May 13, 2015
@ctb
Copy link
Member

ctb commented May 20, 2015

@camillescott - @mr-c is going to address data size problem for test by putting that elsewhere, but other than that, all you need to do is make it mergable, I believe.

@mr-c
Copy link
Contributor

mr-c commented May 20, 2015

It can go into https://github.com/dib-lab/khmer-testdata ; just commit directly

@ctb ctb modified the milestones: 2.0, 1.4+ May 31, 2015
@ctb
Copy link
Member

ctb commented Jun 6, 2015

ping @camillescott

@camillescott
Copy link
Member Author

@mr-c need access to khmer-testdata repo

@camillescott
Copy link
Member Author

Jenkins, test this please

@blahah
Copy link

blahah commented Jun 8, 2015

There is so much winning in this PR, @camillescott.

@ctb
Copy link
Member

ctb commented Jun 8, 2015

Suggest squashing as in #660.

… observe that checking if the median of a set is greater than some cutoff is equivalent to checking if more than half the elements of that set are greater than some cutoff. The latter avoids doing a lookup for every kmer every time, and avoids a costly sort. On a small dataset (1m ecoli reads), this was an 18% performance improvement.

Implements the median_at_least function in C++ land, exposes it in CPython, and updates normalize-by-median.py.
@camillescott camillescott force-pushed the optimization/median_filter branch from c24578f to 3dd1a9f Compare June 9, 2015 03:05
@camillescott
Copy link
Member Author

Jenkins, test this please

@camillescott
Copy link
Member Author

@ctb ready for final review / merge. the known_good test has been marked as known_failing for now, as there is more work that needs to be done to configure jenkins to use the khmer-testdata repo. Once that's set up, we can just remove that attribute.

@camillescott
Copy link
Member Author

also pleeeeeeeeeease merge this before merging any other PR's in, it's all squashed and pretty and I've had to update and tweak little things on this branch way too many times :'(

@ctb
Copy link
Member

ctb commented Jun 9, 2015

LGTM; nice work.

ctb added a commit that referenced this pull request Jun 9, 2015
@ctb ctb merged commit 8a826ab into master Jun 9, 2015
@ctb ctb deleted the optimization/median_filter branch June 9, 2015 10:27
@camillescott
Copy link
Member Author

Yay, thanks!

HashIntoType kmer = kmers.next();
if (this->get_count(kmer) >= cutoff) {
++num_cutoff_kmers;
if (num_cutoff_kmers >= min_req) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought @camillescott - what if you did a for or a while loop up until min_req, and only then checked the 'if'? Obviously if num_kmers < min_req there's no point in checking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:cries: that's probably fair, though I imagine the compiler + cpu does a fair job with it's branch prediction in this scenario

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Tue, Jun 09, 2015 at 10:20:17AM -0700, Camille Scott wrote:

@@ -232,6 +232,26 @@ void Hashtable::get_median_count(const std::string &s,
median = counts[counts.size() / 2]; // rounds down
}

+//
+// Optimized filter function for normalize-by-median
+//
+bool Hashtable::median_at_least(const std::string &s,

  •                            unsigned int cutoff) {
    
  • KMerIterator kmers(s.c_str(), _ksize);
  • unsigned int min_req = 0.5 + float(s.size() - _ksize + 1) / 2;
  • unsigned int num_cutoff_kmers = 0;
  • while(!kmers.done()) {
  •    HashIntoType kmer = kmers.next();
    
  •    if (this->get_count(kmer) >= cutoff) {
    
  •        ++num_cutoff_kmers;
    
  •        if (num_cutoff_kmers >= min_req) {
    

:cries: that's probably fair, though I imagine the compiler + cpu does a fair job with it's branch prediction in this scenario

:)

@ctb ctb mentioned this pull request Jun 11, 2015
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants