RFC: Exposing the raw data in a count-min sketch/bloom filter via numpy? #667

kdm9 · 2014-11-20T05:27:35Z

Hi all,

Firstly, forgive me if this has been discussed elsewhere and my google-fu is lacking.

I've been hacking with various things around khmer that use the values of buckets in a count min sketch. Currently, I don't see any simple way to go thru the CountingHash's tables bin-wise. What do you think about making a transparent (possible R/W) numpy interface to these tables. Passed by reference of course, which I think would involve creating a numpy C array struct and pointing its data member it at the start of the table.

In asking this question, I am volunteering to do this 😄, however it might not happen very soon.

The API I'd propose would be:

ht.tables = (np.array([0,0,0,0,0, dtype=uint8]),
             np.array([0,0,0,0,0,0,0, dtype=uint8]))

for a CMS of 2 tables >= 5

Cheers,
K

The text was updated successfully, but these errors were encountered:

mr-c · 2014-11-20T13:14:11Z

Interesting idea! If you are volunteering I am not against it. Note: within the year we'd like to support variable bucket sizes to eliminate the need for bigcount or save lots of memory in the case of diginorm with a cutoff of 5.

Would NumPy be able to cope with variable bit width buckets?

ctb · 2014-11-20T13:19:03Z

On Thu, Nov 20, 2014 at 05:14:12AM -0800, Michael R. Crusoe wrote:

Interesting idea! If you are volunteering I am not against it. Note: within the year we'd like to support variable bucket sizes to eliminate the need for bigcount or save lots of memory in the case of diginorm with a cutoff of 5.

Would NumPy be able to cope with variable bit width buckets?

I'm +0 on supporting static views (i.e. return a copy). I'm -1 on supporting
dynamic views unless we have a strong, important use case.

That should make the variable bucket size situation easier :)

cheers,

--titus

C. Titus Brown, ctb@msu.edu

kdm9 · 2014-11-21T00:40:29Z

By variable widths, do you mean, e.g., uint8_t thru uint64_t and/or <8 bit width? If the former, I think numpy handles it fine. Otherwise, not so sure.

camillescott · 2014-11-21T01:19:14Z

Arbitrary widths would be preferable. For example, as few as 2 or 3 bits would be sufficient for most diginorm runs, which numpy doesn't really handle.

kdm9 · 2014-11-21T01:32:01Z

True. I guess you could just give the raw bins themselves, as uint64_t or whatever the container is for the bit array. I'm also thinking that this should be hashtable._tables, indicating the fact that this is an internal member.

Essentially, (and especially in light of Titus' latest blog post) I see this feature as a way of hacking on the internals of khmer without having to write C++, as I am now. This would allow more rapid prototyping and would make my general workflow of "Hey, I wonder what happens if..." a lot less painful (for better or worse).

kdm9 · 2014-11-21T01:34:41Z

One thing that may be possible is to pass by reference if the bit width is a multiple of 8, otherwise make a static, RO copy and zero-fill each bin to the next largest round bit width. That may be rather slow and inefficient, but again, you (A) have the choice of using a round number bit width whilst hacking, and (B) are hacking, so slowness is less of a concern.

ctb · 2014-11-22T13:19:34Z

You know, it occurred to me that it should be pretty easy to expose the individual subtables from counting and hashbits directly to Python via the existing interfaces. That is, _counts[0] could be something that you request once you have a CountingHash object. Then you can access individual counts or bits directly via their index. Make any sense?

kdm9 · 2014-11-23T03:14:45Z

Yup, I initially considered that, however I thought that having things as numpy arrays would allow you do do, e.g. np.sum(ht._tables[0] > 2) and similar calculations in a vectored manner (at least in terms of writing code).

No idea what will be the best method, I may just have a go at implementing them all and seeing what works best. But that probably won't happen this year.

camillescott · 2014-11-23T03:19:55Z

The only thing I would be worried about is making numpy an install requirement -- it's a hefty build for many platforms, even if a lot of users have it already.

With that in mind, a good middle ground might be to expose the tables as python buffers using the buffer interface. Then, a user who wants to use numpy can construct a numpy array efficiently using numpy.frombuffer.

kdm9 · 2014-11-23T04:02:01Z

@camillescott Perfect!! I forgot you could do that. Thanks for the suggestion

camillescott · 2014-11-23T04:32:47Z

No problem!

I was thinking about messing with this tonight, but if it's something you'd like to go ahead with I'll leave you to your own devices :)

kdm9 · 2014-11-23T04:36:00Z

Don't wait for me, mess away! I'll lend a hand any way I can.

camillescott · 2014-11-24T00:51:52Z

@kdmurray91 I believe this should cover some of your needs for now: #671

It's read-only, which satisfies @ctb's concerns, but also avoids copying any memory.

kdm9 · 2015-03-31T01:45:16Z

I'm happy for this to be closed now #671 is merged :)

camillescott mentioned this issue Nov 24, 2014

Initial implementation of read-only buffer access to raw tables #671

Merged

mr-c closed this as completed Mar 31, 2015

betatim mentioned this issue Jan 31, 2017

Atomic nibble instead of mutex #1601

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Exposing the raw data in a count-min sketch/bloom filter via numpy? #667

RFC: Exposing the raw data in a count-min sketch/bloom filter via numpy? #667

kdm9 commented Nov 20, 2014

mr-c commented Nov 20, 2014

ctb commented Nov 20, 2014

kdm9 commented Nov 21, 2014

camillescott commented Nov 21, 2014

kdm9 commented Nov 21, 2014

kdm9 commented Nov 21, 2014

ctb commented Nov 22, 2014

kdm9 commented Nov 23, 2014

camillescott commented Nov 23, 2014

kdm9 commented Nov 23, 2014

camillescott commented Nov 23, 2014

kdm9 commented Nov 23, 2014

camillescott commented Nov 24, 2014

kdm9 commented Mar 31, 2015

RFC: Exposing the raw data in a count-min sketch/bloom filter via numpy? #667

RFC: Exposing the raw data in a count-min sketch/bloom filter via numpy? #667

Comments

kdm9 commented Nov 20, 2014

mr-c commented Nov 20, 2014

ctb commented Nov 20, 2014

--titus

kdm9 commented Nov 21, 2014

camillescott commented Nov 21, 2014

kdm9 commented Nov 21, 2014

kdm9 commented Nov 21, 2014

ctb commented Nov 22, 2014

kdm9 commented Nov 23, 2014

camillescott commented Nov 23, 2014

kdm9 commented Nov 23, 2014

camillescott commented Nov 23, 2014

kdm9 commented Nov 23, 2014

camillescott commented Nov 24, 2014

kdm9 commented Mar 31, 2015