Implement optimized median filter #862

camillescott · 2015-03-08T23:04:36Z

Easy optimization to improve normalize-by-median: it's simple to observe that checking if the median of a set is greater than some cutoff is equivalent to checking if more than half the elements of that set are greater than some cutoff. The latter avoids doing a lookup for every kmer every time, and avoids a costly sort. On a small dataset (1m ecoli reads), this was an 18% performance improvement. Not bad for 5 minutes!

TODO: Add a couple tests

mr-c · 2015-03-08T23:15:02Z

Cool!

camillescott · 2015-03-08T23:42:33Z

Is it mergeable?
Did it pass the tests?
If it introduces new functionality in scripts/ is it tested?
Check for code coverage with make clean diff-cover
Is it well formatted? Look at make pep8, make diff_pylint_report,
make cppcheck, and make doc output. Use make format and manual
fixing as needed.
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Is it documented in the ChangeLog?
http://en.wikipedia.org/wiki/Changelog#Format
Was a spellchecker run on the source code and documentation after
changes were made?

camillescott · 2015-03-08T23:52:10Z

ready for review @ctb @mr-c

ctb · 2015-03-08T23:53:55Z

Does this return identical results to old approach?

camillescott · 2015-03-08T23:54:51Z

Yup. All normalize-by-median tests pass with the new function patched in.

camillescott · 2015-03-08T23:55:27Z

jenkins, test this please you useless drunk

camillescott · 2015-03-09T01:04:08Z

jenkins test this please and give it a pass or the floggings will continue

camillescott · 2015-03-09T01:09:48Z

@ctb @mr-c @luizirber okey now pls review

ctb · 2015-03-09T01:12:47Z

Please leave for me to merge - tnx.

camillescott · 2015-03-09T05:06:34Z

@ctb I keep expecting that I've overlooked something really obvious and there's no way a 20% performance increase would be so easy, but shrug

ctb · 2015-03-09T13:31:25Z

khmer/_khmermodule.cc

+        return NULL;
+    }
+
+    if (counting->filter_on_median(long_str, cutoff))


Please add { and } - single-line if statements are dangerous ;).

ctb · 2015-03-09T14:02:34Z

A few comments --

the code didn't return identical results on data/100k-filtered.fa, due to a round-off error. See rounding comment on hashtable.cc.

Please add tests for single k-mer, med < cutoff; single k-mer, med > cutoff; odd # k-mers, med < cutoff; odd # of k-mers; med > cutoff; and any other edge cases you can think of (that's all I got). The homopolymer run tests are not sufficient ;). You should enforce comparison with the old function, too, to make sure they both return the same results.

can we implement the old function without the sort? it seems fairly straightforward to do so although there would still be a performance hit. This shouldn't be your job to do, but can you create a few issues around (a) improving old function, and (b) replacing old function with this function in places where it would work?

tnx!

camillescott · 2015-03-09T20:15:38Z

@ctb can do on the extra tests and whatnot.

We definitely can't do the old function without the sort -- the key difference here is that the actual median isn't calculated, only that it's greater than some value. There are definitely more efficient median algorithms if we want to replace the naive sort-based one we're using now thoug

ctb · 2015-03-11T17:58:29Z

On Mon, Mar 09, 2015 at 01:15:38PM -0700, Camille Scott wrote:

@ctb can do on the extra tests and whatnot.

ok, let me know

We definitely can't do the old function without the sort -- the key difference here is that the actual median isn't calculated, only that it's greater than some value. There are definitely more efficient median algorithms if we want to replace the naive sort-based one we're using now thoug

sure, we should look at the current code and think about where we might
optimize.

camillescott · 2015-03-29T19:05:50Z

Stop being useless Jenkins, and dammit Jenkins, test this please!

camillescott · 2015-03-29T19:12:48Z

@ctb, ready for re-review

ctb · 2015-03-31T14:21:33Z

tests/test_counting_hash.py

+        hi.consume(seq)
+
+        med, _, _ = hi.get_median_count(seq)
+        assert hi.filter_on_median(seq, 4) is (med >= C)


replace 4 with C!

ctb · 2015-03-31T14:28:13Z

Two questions for consideration cc @mr-c --

should we store the results of normalize-by-median on (say) 100k-filtered.fa, as a test for consistency going forward?
filter_on_median is an unsatisfying function name, since it doesn't actually filter anything. Would 'median_greater_than' or something like that be clearer? Suggestions welcome.

mr-c · 2015-03-31T16:18:55Z

+1 for storing known-good output. How big, when compressed? Worse case we keep it elsewhere and let Jenkins deal with it.

camillescott · 2015-03-31T16:21:27Z

Very well -- I'll add a comparison test to known output of the mentioned file and change the function name.

camillescott · 2015-04-02T01:38:51Z

@ctb @mr-c okay I added the new test -- it adds ~5mb the repo. Is that manageable?

mr-c · 2015-04-03T12:57:52Z

Hrmm.. that would nearly double the distribution size..

On Thu, Apr 2, 2015 at 4:38 AM Camille Scott notifications@github.com
wrote:

@ctb https://github.com/ctb @mr-c https://github.com/mr-c okay I
added the new test -- it adds ~5mb the repo. Is that manageable?

—
Reply to this email directly or view it on GitHub
#862 (comment).

camillescott · 2015-04-08T23:06:45Z

@ctb @mr-c further thoughts on this??

mr-c · 2015-04-09T15:19:18Z

We need a separate repo to store larger pieces of test data and have tests that run only on Jenkins by default. @ctb What do you think?

SensibleSalmon · 2015-04-09T15:49:26Z

Would something like git's LFS work?

https://github.com/blog/1986-announcing-git-large-file-storage-lfs

from what I can tell it's essentially git-annex, though I have no experience with either.

mr-c · 2015-04-09T15:55:00Z

@bocajnotnef In this case the file is 5Megabytes so GitHub LFS wouldn't be needed. I don't want to make the git clone of the main repository larger than necessary. Another option is to add the 5M file but exclude it from packaging and adjust the test to skip if the file isn't present.

ctb · 2015-04-12T13:26:48Z

On Thu, Apr 09, 2015 at 08:55:01AM -0700, Michael R. Crusoe wrote:

@bocajnotnef In this case the file is 5Megabytes so GitHub LFS wouldn't be needed. I don't want to make the git clone of the main repository larger than necessary. Another option is to add the 5M file but exclude it from packaging and adjust the test to skip if the file isn't present.

What about a 'khmer-data' repository? Although that seems kind of silly.

ctb · 2015-04-12T13:30:17Z

On Thu, Apr 09, 2015 at 08:19:19AM -0700, Michael R. Crusoe wrote:

We need a separate repo to store larger pieces of test data and have tests that run only on Jenkins by default. @ctb What do you think?

Since the data is mostly going to be read-only, it seems silly to have a full
repo. Maybe their new data extension is the way to go for that? And have a
khmer-data repo that stores the pointers?

camillescott · 2015-04-13T22:33:32Z

How about we get the optimization merged for now, then open an issue for supporting github LFS or some other variety of data hosting to be addressed in the future?

mr-c · 2015-04-13T23:01:39Z

+1

On Mon, Apr 13, 2015 at 6:33 PM Camille Scott notifications@github.com
wrote:

How about we get the optimization merged for now, then open an issue for
supporting github LFS or some other variety of data hosting to be addressed
in the future?

—
Reply to this email directly or view it on GitHub
#862 (comment).

ctb · 2015-04-16T10:35:13Z

create an issue
make mergable

ctb · 2015-05-20T19:13:45Z

@camillescott - @mr-c is going to address data size problem for test by putting that elsewhere, but other than that, all you need to do is make it mergable, I believe.

mr-c · 2015-05-20T19:14:53Z

It can go into https://github.com/dib-lab/khmer-testdata ; just commit directly

ctb · 2015-06-06T11:33:52Z

ping @camillescott

camillescott · 2015-06-06T21:25:04Z

@mr-c need access to khmer-testdata repo

camillescott · 2015-06-07T23:57:15Z

Jenkins, test this please

blahah · 2015-06-08T12:58:21Z

There is so much winning in this PR, @camillescott.

ctb · 2015-06-08T13:58:00Z

Suggest squashing as in #660.

… observe that checking if the median of a set is greater than some cutoff is equivalent to checking if more than half the elements of that set are greater than some cutoff. The latter avoids doing a lookup for every kmer every time, and avoids a costly sort. On a small dataset (1m ecoli reads), this was an 18% performance improvement. Implements the median_at_least function in C++ land, exposes it in CPython, and updates normalize-by-median.py.

camillescott · 2015-06-09T03:06:35Z

Jenkins, test this please

camillescott · 2015-06-09T03:12:54Z

@ctb ready for final review / merge. the known_good test has been marked as known_failing for now, as there is more work that needs to be done to configure jenkins to use the khmer-testdata repo. Once that's set up, we can just remove that attribute.

camillescott · 2015-06-09T03:17:30Z

also pleeeeeeeeeease merge this before merging any other PR's in, it's all squashed and pretty and I've had to update and tweak little things on this branch way too many times :'(

ctb · 2015-06-09T10:27:33Z

LGTM; nice work.

Implement optimized median filter

camillescott · 2015-06-09T17:13:42Z

Yay, thanks!

ctb · 2015-06-09T17:17:00Z

lib/hashtable.cc

+        HashIntoType kmer = kmers.next();
+        if (this->get_count(kmer) >= cutoff) {
+            ++num_cutoff_kmers;
+            if (num_cutoff_kmers >= min_req) {


Random thought @camillescott - what if you did a for or a while loop up until min_req, and only then checked the 'if'? Obviously if num_kmers < min_req there's no point in checking.

:cries: that's probably fair, though I imagine the compiler + cpu does a fair job with it's branch prediction in this scenario

On Tue, Jun 09, 2015 at 10:20:17AM -0700, Camille Scott wrote:

@@ -232,6 +232,26 @@ void Hashtable::get_median_count(const std::string &s,
median = counts[counts.size() / 2]; // rounds down
}

+//
+// Optimized filter function for normalize-by-median
+//
+bool Hashtable::median_at_least(const std::string &s,

unsigned int cutoff) {

KMerIterator kmers(s.c_str(), _ksize);

unsigned int min_req = 0.5 + float(s.size() - _ksize + 1) / 2;

unsigned int num_cutoff_kmers = 0;

while(!kmers.done()) {

HashIntoType kmer = kmers.next();

if (this->get_count(kmer) >= cutoff) {

++num_cutoff_kmers;

if (num_cutoff_kmers >= min_req) {

:cries: that's probably fair, though I imagine the compiler + cpu does a fair job with it's branch prediction in this scenario

:)

ctb reviewed Mar 9, 2015
View reviewed changes

ctb reviewed Mar 31, 2015
View reviewed changes

mr-c added this to the 1.4+ milestone May 13, 2015

ctb modified the milestones: 2.0, 1.4+ May 31, 2015

camillescott mentioned this pull request Jun 8, 2015

Refactor normalize-by-median stage 1: broken_paired #1057

Merged

camillescott force-pushed the optimization/median_filter branch from c24578f to 3dd1a9f Compare June 9, 2015 03:05

ctb added a commit that referenced this pull request Jun 9, 2015

Merge pull request #862 from dib-lab/optimization/median_filter

8a826ab

Implement optimized median filter

ctb merged commit 8a826ab into master Jun 9, 2015

ctb deleted the optimization/median_filter branch June 9, 2015 10:27

ctb reviewed Jun 9, 2015
View reviewed changes

ctb mentioned this pull request Jun 11, 2015

Sparse median #218

Closed

4 tasks

Implement optimized median filter #862

Implement optimized median filter #862

Conversation

camillescott commented Mar 8, 2015

mr-c commented Mar 8, 2015

camillescott commented Mar 8, 2015

camillescott commented Mar 8, 2015

ctb commented Mar 8, 2015

camillescott commented Mar 8, 2015

camillescott commented Mar 8, 2015

camillescott commented Mar 9, 2015

camillescott commented Mar 9, 2015

ctb commented Mar 9, 2015

camillescott commented Mar 9, 2015

ctb Mar 9, 2015

Choose a reason for hiding this comment

ctb Mar 31, 2015

Choose a reason for hiding this comment

ctb commented Mar 9, 2015

camillescott commented Mar 9, 2015

ctb commented Mar 11, 2015

camillescott commented Mar 29, 2015

camillescott commented Mar 29, 2015

ctb Mar 31, 2015

Choose a reason for hiding this comment

ctb commented Mar 31, 2015

mr-c commented Mar 31, 2015

camillescott commented Mar 31, 2015

camillescott commented Apr 2, 2015

mr-c commented Apr 3, 2015

camillescott commented Apr 8, 2015

mr-c commented Apr 9, 2015

SensibleSalmon commented Apr 9, 2015

mr-c commented Apr 9, 2015

ctb commented Apr 12, 2015

ctb commented Apr 12, 2015

camillescott commented Apr 13, 2015

mr-c commented Apr 13, 2015

ctb commented Apr 16, 2015

ctb commented May 20, 2015

mr-c commented May 20, 2015

ctb commented Jun 6, 2015

camillescott commented Jun 6, 2015

camillescott commented Jun 7, 2015

blahah commented Jun 8, 2015

ctb commented Jun 8, 2015

camillescott commented Jun 9, 2015

camillescott commented Jun 9, 2015

camillescott commented Jun 9, 2015

ctb commented Jun 9, 2015

camillescott commented Jun 9, 2015

ctb Jun 9, 2015

Choose a reason for hiding this comment

camillescott Jun 9, 2015

Choose a reason for hiding this comment

ctb Jun 9, 2015

Choose a reason for hiding this comment