Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: extract up to a full (underlying) word type from BitFieldVec #52

Open
rob-p opened this issue Sep 2, 2024 · 7 comments

Comments

@rob-p
Copy link

rob-p commented Sep 2, 2024

I was wondering if it would be possible to add a function to BitFieldVec that would allow extract multiple elements and returning them in the underlying word type. My specific use-case is this (but I think it's a common pattern).

I have a long string of DNA (alphabet of size 4) that I am packing into a BitFieldVec of width 2. Later, however, I want to extract entire k-mers (substrings of fixed length k). It would be great to be able to extract an entire k-mer at once, which for most common values of k (i.e. k <= 32) would either be entirely in one underlying u64, or would span at most 2. Right now, I have to extract the individual 2-bit characters one-by-one and build up the k-mer, which is quite inefficient.

I would propose something like get_slice(start, num_elem) or get_slice_in_word(start, num_elem), but am happy to leave the design aspects to your preference. Let me know if you have any questions or feedback about the request.

Thanks!
Rob

@RagnarGrootKoerkamp
Copy link
Contributor

For reference, I also added this to my own 'packed' wrapper type, which I think I created specifically because sux didn't have it:

https://github.com/RagnarGrootKoerkamp/minimizers/blob/master/src/par/packed.rs#L134

In the future, I would also like to have a to_u128 version of this, but not sure what's the best API there.

@vigna
Copy link
Owner

vigna commented Sep 3, 2024

What about getting an iterator, rather then a slice? Then you could collect it. I realize that the method you suggest is more convenient tho.

I think there's a way to get an unchecked iterator. So get_slice should be trivial to implement using that.

@rob-p
Copy link
Author

rob-p commented Sep 3, 2024

Hi Sebastiano,

Indeed, the solution I am currently implementing to obtain what I need given the existing interface is to call into_unchecked_iter_from starting at the beginning of the slice I need, and then to collect the next k items into a machine word, appropriately shifted. This certainly gives the correct answer, but I'm guessing it's substantially more overhead (I've not looked at the generated assembly yet). The reason is that in the original BitFieldVec, the items that constitute the underlying k-mer are already packed maximally into the storage. So, if the k-mer exists within a single storage unit (say a u64 if we're using that), then obtaining it is just a fetch, a shift and a mask. It's a bit more work if it spans two storage units, but not much. On the other hand, with the iterator approach, each 2-bit element has to be properly shifted into the output word as it is obtained. For longer k, this can be non-trivially more work than just pulling out the needed bits. I do suspect the compiler is clever enough to help some, and the iterator is pulling in adjacent elements so everything is cached. However, if this use-case is general enough, it may be worthwhile to have specific support for it?

--Rob

@vigna
Copy link
Owner

vigna commented Sep 3, 2024

Ah ok I mistakenly understood you wanted them unpacked.

I already have this code in Java (the sublist code of the list view of a bit vector). I think it shouldn't be too difficult to port it.

And yes, it can be significantly optimized.

Or, we can adapt Ragnar's code.

@vigna
Copy link
Owner

vigna commented Oct 27, 2024

Hey, is there still interest for this?

@RagnarGrootKoerkamp
Copy link
Contributor

I think I worked around it for now in my own code, but for the future, yes this is definitely still useful.

I think the API that Rob suggested, taking a start index and number of elements to cover, makes sense.
Then you can shift the resulting value so that it covers C * num_elements bits, starting at the lowest bit.

This is easy when the value fits in up to 7 bytes (or up to 58 bits, to be precise, or better: not 59, 61, 62, or 63 bits), since then a single unaligned read suffices. After that you may need two reads.

@rob-p
Copy link
Author

rob-p commented Oct 27, 2024

Yes, I would still be interested in this feature and think it would be valuable in several contexts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants