Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Exposing the raw data in a count-min sketch/bloom filter via numpy? #667

Closed
kdm9 opened this issue Nov 20, 2014 · 14 comments
Closed

RFC: Exposing the raw data in a count-min sketch/bloom filter via numpy? #667

kdm9 opened this issue Nov 20, 2014 · 14 comments

Comments

@kdm9
Copy link
Contributor

kdm9 commented Nov 20, 2014

Hi all,

Firstly, forgive me if this has been discussed elsewhere and my google-fu is lacking.

I've been hacking with various things around khmer that use the values of buckets in a count min sketch. Currently, I don't see any simple way to go thru the CountingHash's tables bin-wise. What do you think about making a transparent (possible R/W) numpy interface to these tables. Passed by reference of course, which I think would involve creating a numpy C array struct and pointing its data member it at the start of the table.

In asking this question, I am volunteering to do this 😄, however it might not happen very soon.

The API I'd propose would be:

ht.tables = (np.array([0,0,0,0,0, dtype=uint8]),
             np.array([0,0,0,0,0,0,0, dtype=uint8]))

for a CMS of 2 tables >= 5

Cheers,
K

@mr-c
Copy link
Contributor

mr-c commented Nov 20, 2014

Interesting idea! If you are volunteering I am not against it. Note: within the year we'd like to support variable bucket sizes to eliminate the need for bigcount or save lots of memory in the case of diginorm with a cutoff of 5.

Would NumPy be able to cope with variable bit width buckets?

@ctb
Copy link
Member

ctb commented Nov 20, 2014

On Thu, Nov 20, 2014 at 05:14:12AM -0800, Michael R. Crusoe wrote:

Interesting idea! If you are volunteering I am not against it. Note: within the year we'd like to support variable bucket sizes to eliminate the need for bigcount or save lots of memory in the case of diginorm with a cutoff of 5.

Would NumPy be able to cope with variable bit width buckets?

I'm +0 on supporting static views (i.e. return a copy). I'm -1 on supporting
dynamic views unless we have a strong, important use case.

That should make the variable bucket size situation easier :)

cheers,

--titus

C. Titus Brown, ctb@msu.edu

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 21, 2014

By variable widths, do you mean, e.g., uint8_t thru uint64_t and/or <8 bit width? If the former, I think numpy handles it fine. Otherwise, not so sure.

@camillescott
Copy link
Member

Arbitrary widths would be preferable. For example, as few as 2 or 3 bits would be sufficient for most diginorm runs, which numpy doesn't really handle.

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 21, 2014

True. I guess you could just give the raw bins themselves, as uint64_t or whatever the container is for the bit array. I'm also thinking that this should be hashtable._tables, indicating the fact that this is an internal member.

Essentially, (and especially in light of Titus' latest blog post) I see this feature as a way of hacking on the internals of khmer without having to write C++, as I am now. This would allow more rapid prototyping and would make my general workflow of "Hey, I wonder what happens if..." a lot less painful (for better or worse).

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 21, 2014

One thing that may be possible is to pass by reference if the bit width is a multiple of 8, otherwise make a static, RO copy and zero-fill each bin to the next largest round bit width. That may be rather slow and inefficient, but again, you (A) have the choice of using a round number bit width whilst hacking, and (B) are hacking, so slowness is less of a concern.

@ctb
Copy link
Member

ctb commented Nov 22, 2014

You know, it occurred to me that it should be pretty easy to expose the individual subtables from counting and hashbits directly to Python via the existing interfaces. That is, _counts[0] could be something that you request once you have a CountingHash object. Then you can access individual counts or bits directly via their index. Make any sense?

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 23, 2014

Yup, I initially considered that, however I thought that having things as numpy arrays would allow you do do, e.g. np.sum(ht._tables[0] > 2) and similar calculations in a vectored manner (at least in terms of writing code).

No idea what will be the best method, I may just have a go at implementing them all and seeing what works best. But that probably won't happen this year.

@camillescott
Copy link
Member

The only thing I would be worried about is making numpy an install requirement -- it's a hefty build for many platforms, even if a lot of users have it already.

With that in mind, a good middle ground might be to expose the tables as python buffers using the buffer interface. Then, a user who wants to use numpy can construct a numpy array efficiently using numpy.frombuffer.

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 23, 2014

@camillescott Perfect!! I forgot you could do that. Thanks for the suggestion

@camillescott
Copy link
Member

No problem!

I was thinking about messing with this tonight, but if it's something you'd like to go ahead with I'll leave you to your own devices :)

@kdm9
Copy link
Contributor Author

kdm9 commented Nov 23, 2014

Don't wait for me, mess away! I'll lend a hand any way I can.

@camillescott
Copy link
Member

@kdmurray91 I believe this should cover some of your needs for now: #671

It's read-only, which satisfies @ctb's concerns, but also avoids copying any memory.

@kdm9
Copy link
Contributor Author

kdm9 commented Mar 31, 2015

I'm happy for this to be closed now #671 is merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants