Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of read-only buffer access to raw tables #671

Merged
merged 6 commits into from
Mar 10, 2015

Conversation

camillescott
Copy link
Member

This addresses #667: exposes the tables as a read-only buffer object. I've chosen read-only as there will be undefined behavior from writing directly to the tables outside of the count functions; of course, the idea is that you can still use the python-exposed hashtable object as you see fit, and the buffer view will update to reflect that.

For now, I've just exported a list of N buffers. Eventually I'd like this to properly define the buffer interface on the hashtable type object, exposing a single buffer or memoryview object with proper striding / offsets for the N dimension array, but... :effort:.

Some example usage:

In [1]: import khmer

In [2]: import numpy as np

In [3]: ht = khmer.new_counting_hash (20, 1e8, 4)

In [4]: tables = ht.get_raw_tables ()

In [5]: tables[0]
Out[5]: <read-only buffer ptr 0x7ff6a7015010, size 100000007 at 0x7ff6acff2930>

In [6]: arr = np.frombuffer (tables[0], dtype=np.uint8)

In [7]: arr
Out[7]: array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

In [8]: arr.sum()
Out[8]: 0

In [9]: ht.consume_fasta('tests/test-data/test-reads.fa')
Out[9]: (25000, 1425000L)

In [10]: arr.sum()
Out[10]: 1424408

@kdm9
Copy link
Contributor

kdm9 commented Nov 24, 2014

This is great! Thanks very much. Will have a go with this, and let you know how I get on! Thanks for the speedy implementation 😄

@mr-c
Copy link
Contributor

mr-c commented Nov 24, 2014

+1

@camillescott
Copy link
Member Author

No problem! Now it's actually tested as well :)

@mr-c mr-c added this to the 1.2+ milestone Dec 1, 2014
@mr-c
Copy link
Contributor

mr-c commented Dec 14, 2014

@kdmurray91 Was this useful? Should it get merged in?

@mr-c mr-c modified the milestones: unscheduled, 1.3+ Jan 19, 2015
@ctb
Copy link
Member

ctb commented Jan 27, 2015

ping @kdmurray91

@kdm9
Copy link
Contributor

kdm9 commented Jan 27, 2015

Apologies all, I've been on holiday and this must have been lost in a sea of notifications. This was very useful, and still is. Thanks very much @camillescott et al.

@kdm9
Copy link
Contributor

kdm9 commented Feb 23, 2015

Is there anything that I need to do to get this merged?

@mr-c
Copy link
Contributor

mr-c commented Feb 23, 2015

@camillescott claims that she'll be joining the sprint Monday. So hopefully she brings this up to speed for @kdmurray91

@mr-c
Copy link
Contributor

mr-c commented Mar 5, 2015

ping @camillescott :-)

@kdm9
Copy link
Contributor

kdm9 commented Mar 6, 2015

I swear I added a long comment to this PR yesterday, it appears to have disappeared. Is it just me?

@mr-c
Copy link
Contributor

mr-c commented Mar 6, 2015

@kdmurray91 Last comment I see from you is February 22nd :-/

@kdm9
Copy link
Contributor

kdm9 commented Mar 6, 2015

Ah crap :)

From memory, it went something like:

FYI, I've rebased on master here

I'd like to get this merged, and am hacking on the code that uses this actively, so consider me willing to do whatever it takes to get this, a) stable, and more user friendly if that's desired, and b) tested/documented/all the other things on the PR checklist.

To that end, I proposed making what @camillescott has done an internal feature, then using some wrapper in python land to expose this as either a single numpy array or just do what she has in her demo in the PR description. This could be done by wrapping (i.e. subclassing as is done with HLLCounter in __init__.py) the CPython CountingHash class in python land, with a python implementation of a method that does import numpy. This means that unless you call that method, you don't need numpy to use the package (I think). Thoughts?

Also, let me know your thoughts on a read/write version of this too, as algorithm I'm working on relies on adding/subtracting hashes (ignoring bigcount which we're not using). It would be useful to have, and am happy to work on ways of doing this safely under circumstances where it is safe (e.g., using the busywait locks and __sync_add_and_fetch to add the values). And checking that we're not using bigcount. And this would break things like _n_unique_kmers etc, unless we also added those.

Anyway, that turned out to be more mental diarrhea than a comment, but that's my 0.000002 BTC, as the cool kids say.

Cheers,
K

@mr-c
Copy link
Contributor

mr-c commented Mar 9, 2015

+1 for merging this w/o a script frontend as that does not commit us to anything.
+1 for iterating on top of that later :-)

I'm fine with experimental Python & C++ APIs. @camillescott or @kdmurray91 can you either update this PR with a filled out checklist or make a new PR?

@kdm9
Copy link
Contributor

kdm9 commented Mar 9, 2015

Yup! Unless @camillescott rebases in the next hour, I'll make a new PR from my rebased branch. Or could I push to this branch?

On a side note, is there any merit to having the likes of a khmer.__future__ module to make experimental APIs more segregated?

@kdm9
Copy link
Contributor

kdm9 commented Mar 9, 2015

  • Is it mergable?
  • Did it pass the tests?
  • If it introduces new functionality in scripts/ is it tested?
    Check for code coverage.
  • Is it well formatted? Look at make pep8, make diff_pylint_report,
    make cppcheck, and make doc output. Use make format and manual
    fixing as needed.
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Is it documented in the ChangeLog?
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@mr-c
Copy link
Contributor

mr-c commented Mar 9, 2015

The APIs are getting a slow overhaul and are subject to change at any time so there is no current need to segregate; thanks though!

kdm9 added a commit to kdm9/khmer that referenced this pull request Mar 9, 2015
@kdm9
Copy link
Contributor

kdm9 commented Mar 9, 2015

OK, so to avoid nuking anything I've made a backup of this branch here:
https://github.com/kdmurray91/khmer/tree/upstream/feature/pymemoryview

and will have a go at pushing directly to this branch

@kdm9 kdm9 force-pushed the feature/pymemoryview branch from 5004c00 to 635aa28 Compare March 9, 2015 23:23
@kdm9
Copy link
Contributor

kdm9 commented Mar 10, 2015

What's the policy on cleaning formatting errors etc in places not touched by the changes in a PR?

E.g., there's some formatting issues in the HLL counter, I assume I leave those for another PR?

@kdm9 kdm9 force-pushed the feature/pymemoryview branch from 2bcebc0 to 3089a9f Compare March 10, 2015 01:51
@kdm9
Copy link
Contributor

kdm9 commented Mar 10, 2015

All ticked off!

@mr-c
Copy link
Contributor

mr-c commented Mar 10, 2015

You're on the hook for lines of code you touch, and no other :-) I'm sure @luizirber will want to hear about any HLL issues.

@mr-c
Copy link
Contributor

mr-c commented Mar 10, 2015

LGTM

mr-c added a commit that referenced this pull request Mar 10, 2015
Initial implementation of read-only buffer access to raw tables
@mr-c mr-c merged commit a4c46e0 into master Mar 10, 2015
@mr-c mr-c deleted the feature/pymemoryview branch March 10, 2015 02:13
@mr-c
Copy link
Contributor

mr-c commented Mar 10, 2015

Share & enjoy!

@camillescott
Copy link
Member Author

Thanks for putting in the work to make this mergeable @kdmurray91! And sorry for ducking out on it myself; got distracted with other things.

@kdm9
Copy link
Contributor

kdm9 commented Mar 10, 2015

All good, thanks for your initial work 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants