Initial implementation of read-only buffer access to raw tables #671

camillescott · 2014-11-24T00:48:23Z

This addresses #667: exposes the tables as a read-only buffer object. I've chosen read-only as there will be undefined behavior from writing directly to the tables outside of the count functions; of course, the idea is that you can still use the python-exposed hashtable object as you see fit, and the buffer view will update to reflect that.

For now, I've just exported a list of N buffers. Eventually I'd like this to properly define the buffer interface on the hashtable type object, exposing a single buffer or memoryview object with proper striding / offsets for the N dimension array, but... :effort:.

Some example usage:

In [1]: import khmer

In [2]: import numpy as np

In [3]: ht = khmer.new_counting_hash (20, 1e8, 4)

In [4]: tables = ht.get_raw_tables ()

In [5]: tables[0]
Out[5]: <read-only buffer ptr 0x7ff6a7015010, size 100000007 at 0x7ff6acff2930>

In [6]: arr = np.frombuffer (tables[0], dtype=np.uint8)

In [7]: arr
Out[7]: array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

In [8]: arr.sum()
Out[8]: 0

In [9]: ht.consume_fasta('tests/test-data/test-reads.fa')
Out[9]: (25000, 1425000L)

In [10]: arr.sum()
Out[10]: 1424408

kdm9 · 2014-11-24T00:58:31Z

This is great! Thanks very much. Will have a go with this, and let you know how I get on! Thanks for the speedy implementation 😄

mr-c · 2014-11-24T01:03:58Z

+1

camillescott · 2014-11-24T01:26:37Z

No problem! Now it's actually tested as well :)

mr-c · 2014-12-14T22:58:17Z

@kdmurray91 Was this useful? Should it get merged in?

ctb · 2015-01-27T18:16:11Z

ping @kdmurray91

kdm9 · 2015-01-27T21:55:06Z

Apologies all, I've been on holiday and this must have been lost in a sea of notifications. This was very useful, and still is. Thanks very much @camillescott et al.

kdm9 · 2015-02-23T04:14:31Z

Is there anything that I need to do to get this merged?

mr-c · 2015-02-23T06:25:09Z

@camillescott claims that she'll be joining the sprint Monday. So hopefully she brings this up to speed for @kdmurray91

mr-c · 2015-03-05T22:05:09Z

ping @camillescott :-)

kdm9 · 2015-03-06T02:25:12Z

I swear I added a long comment to this PR yesterday, it appears to have disappeared. Is it just me?

mr-c · 2015-03-06T02:31:49Z

@kdmurray91 Last comment I see from you is February 22nd :-/

kdm9 · 2015-03-06T02:56:04Z

Ah crap :)

From memory, it went something like:

FYI, I've rebased on master here

I'd like to get this merged, and am hacking on the code that uses this actively, so consider me willing to do whatever it takes to get this, a) stable, and more user friendly if that's desired, and b) tested/documented/all the other things on the PR checklist.

To that end, I proposed making what @camillescott has done an internal feature, then using some wrapper in python land to expose this as either a single numpy array or just do what she has in her demo in the PR description. This could be done by wrapping (i.e. subclassing as is done with HLLCounter in __init__.py) the CPython CountingHash class in python land, with a python implementation of a method that does import numpy. This means that unless you call that method, you don't need numpy to use the package (I think). Thoughts?

Also, let me know your thoughts on a read/write version of this too, as algorithm I'm working on relies on adding/subtracting hashes (ignoring bigcount which we're not using). It would be useful to have, and am happy to work on ways of doing this safely under circumstances where it is safe (e.g., using the busywait locks and __sync_add_and_fetch to add the values). And checking that we're not using bigcount. And this would break things like _n_unique_kmers etc, unless we also added those.

Anyway, that turned out to be more mental diarrhea than a comment, but that's my 0.000002 BTC, as the cool kids say.

Cheers,
K

mr-c · 2015-03-09T22:51:48Z

+1 for merging this w/o a script frontend as that does not commit us to anything.
+1 for iterating on top of that later :-)

I'm fine with experimental Python & C++ APIs. @camillescott or @kdmurray91 can you either update this PR with a filled out checklist or make a new PR?

kdm9 · 2015-03-09T23:06:24Z

Yup! Unless @camillescott rebases in the next hour, I'll make a new PR from my rebased branch. Or could I push to this branch?

On a side note, is there any merit to having the likes of a khmer.__future__ module to make experimental APIs more segregated?

kdm9 · 2015-03-09T23:07:17Z

Is it mergable?
Did it pass the tests?
If it introduces new functionality in scripts/ is it tested?
Check for code coverage.
Is it well formatted? Look at make pep8, make diff_pylint_report,
make cppcheck, and make doc output. Use make format and manual
fixing as needed.
Did it change the command-line interface? Only additions are allowed
without a major version increment. Changing file formats also requires a
major version number increment.
Is it documented in the ChangeLog?
Was a spellchecker run on the source code and documentation after
changes were made?

mr-c · 2015-03-09T23:13:22Z

The APIs are getting a slow overhaul and are subject to change at any time so there is no current need to segregate; thanks though!

kdm9 · 2015-03-09T23:20:40Z

OK, so to avoid nuking anything I've made a backup of this branch here:
https://github.com/kdmurray91/khmer/tree/upstream/feature/pymemoryview

and will have a go at pushing directly to this branch

kdm9 · 2015-03-10T01:43:32Z

What's the policy on cleaning formatting errors etc in places not touched by the changes in a PR?

E.g., there's some formatting issues in the HLL counter, I assume I leave those for another PR?

In test_counting_hash.py

kdm9 · 2015-03-10T02:04:34Z

All ticked off!

mr-c · 2015-03-10T02:09:19Z

You're on the hook for lines of code you touch, and no other :-) I'm sure @luizirber will want to hear about any HLL issues.

mr-c · 2015-03-10T02:11:06Z

LGTM

Initial implementation of read-only buffer access to raw tables

mr-c · 2015-03-10T02:13:18Z

Share & enjoy!

camillescott · 2015-03-10T02:50:45Z

Thanks for putting in the work to make this mergeable @kdmurray91! And sorry for ducking out on it myself; got distracted with other things.

kdm9 · 2015-03-10T02:51:10Z

All good, thanks for your initial work 😄

camillescott mentioned this pull request Nov 24, 2014

RFC: Exposing the raw data in a count-min sketch/bloom filter via numpy? #667

Closed

mr-c added this to the 1.2+ milestone Dec 1, 2014

mr-c modified the milestones: unscheduled, 1.3+ Jan 19, 2015

kdm9 added a commit to kdm9/khmer that referenced this pull request Mar 9, 2015

make format the get_raw_tables changes from dib-lab#671

635aa28

kdm9 force-pushed the feature/pymemoryview branch from 5004c00 to 635aa28 Compare March 9, 2015 23:23

camillescott and others added 5 commits March 10, 2015 12:51

Initial implementation of read-only buffer access to raw tables

cc3546d

Remove some code cruft

0cdfeb1

Add tests for get_raw_tables()

bbad77b

make format the get_raw_tables changes from #671

80b8475

fix pylint complaints about new single-letter vars

3089a9f

In test_counting_hash.py

kdm9 force-pushed the feature/pymemoryview branch from 2bcebc0 to 3089a9f Compare March 10, 2015 01:51

Document #671 in the changelog.

4d12776

mr-c added a commit that referenced this pull request Mar 10, 2015

Merge pull request #671 from ged-lab/feature/pymemoryview

a4c46e0

Initial implementation of read-only buffer access to raw tables

mr-c merged commit a4c46e0 into master Mar 10, 2015

mr-c deleted the feature/pymemoryview branch March 10, 2015 02:13

ctb mentioned this pull request Mar 11, 2015

test_get_raw_tables_view fails on Python 2.7.2 #868

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial implementation of read-only buffer access to raw tables #671

Initial implementation of read-only buffer access to raw tables #671

camillescott commented Nov 24, 2014

kdm9 commented Nov 24, 2014

mr-c commented Nov 24, 2014

camillescott commented Nov 24, 2014

mr-c commented Dec 14, 2014

ctb commented Jan 27, 2015

kdm9 commented Jan 27, 2015

kdm9 commented Feb 23, 2015

mr-c commented Feb 23, 2015

mr-c commented Mar 5, 2015

kdm9 commented Mar 6, 2015

mr-c commented Mar 6, 2015

kdm9 commented Mar 6, 2015

mr-c commented Mar 9, 2015

kdm9 commented Mar 9, 2015

kdm9 commented Mar 9, 2015

mr-c commented Mar 9, 2015

kdm9 commented Mar 9, 2015

kdm9 commented Mar 10, 2015

kdm9 commented Mar 10, 2015

mr-c commented Mar 10, 2015

mr-c commented Mar 10, 2015

mr-c commented Mar 10, 2015

camillescott commented Mar 10, 2015

kdm9 commented Mar 10, 2015

Initial implementation of read-only buffer access to raw tables #671

Initial implementation of read-only buffer access to raw tables #671

Conversation

camillescott commented Nov 24, 2014

kdm9 commented Nov 24, 2014

mr-c commented Nov 24, 2014

camillescott commented Nov 24, 2014

mr-c commented Dec 14, 2014

ctb commented Jan 27, 2015

kdm9 commented Jan 27, 2015

kdm9 commented Feb 23, 2015

mr-c commented Feb 23, 2015

mr-c commented Mar 5, 2015

kdm9 commented Mar 6, 2015

mr-c commented Mar 6, 2015

kdm9 commented Mar 6, 2015

mr-c commented Mar 9, 2015

kdm9 commented Mar 9, 2015

kdm9 commented Mar 9, 2015

mr-c commented Mar 9, 2015

kdm9 commented Mar 9, 2015

kdm9 commented Mar 10, 2015

kdm9 commented Mar 10, 2015

mr-c commented Mar 10, 2015

mr-c commented Mar 10, 2015

mr-c commented Mar 10, 2015

camillescott commented Mar 10, 2015

kdm9 commented Mar 10, 2015