-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-Copy Serialization/Deserialization #5
Comments
rkyv looks good, and adding that with an optional dependency (because I am quite keen on keeping it zero-dependencies) might be an option. I'll look further into it, because it seems like a nice additon. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Well if you're using the (de)serialization as way to create a data-exchange/file format (think .jpg not application instance specific .dat), then that format will want to decide on some endianess. In my case it's a file format for knowledge graphs, with the added bonus that you can query it without having to build any indexes first, just mmap and go. So it's always going to be in be. The Stable Cross-Platform Database File section in the SQLite documentation is probably the best description of that use case. Avoiding breaking changes caused by the way rkyv stores things is also an argument for rolling our own framework agnostic data layout. Edit: Btw it's also completely fine if such a use-case doesn't align with the project goals 😄 |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I pushed some changes to a new branch All functionality of As it stands, this constitutes a breaking change, because you now have to import the trait to access methods on |
Awesome, I'll check it out asap! |
I just started some work on a minimalist approach to this, using a combination of techniques from This would still require some form of archive format (i.e. a header with length and layer count), but most of the data (data/blocks/...) can just be sliced from the binary data source. The main reason I'm writing this comment though is that I've noticed that pub struct WaveletMatrix {
data: Box<[RsVec]>,
bits_per_element: u16,
} appears to have the information about the number of layers redundantly. Is this an atavism from a version in which data was only a single Really awesome implementation btw, I'm super impressed. It smells like you were able to justify working on it as part of your PhD 😆 |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
*but use specific sizes in the structs 😆 Could we fixate the
const SUPER_BLOCK_SIZE: usize = 1 << 13; The The Lastly the fields Btw this should read /// The indices do not point into the bit-vector, but into the ~~select~~super-block vector.
right? |
No, the counter isn't reset between super-blocks, lest you'd need to generate a prefix-sum of all super-blocks before the query.
Yes. I'll add it to #9.
Yes
Can't you just add a value to the file header that tells the loading code how big the slice is? |
Ah gotcha, sorry
I'll create a PR
True, it's just something that needs to be stored per layer, so Btw which memory layout do you prefer? Storing the Block/SuperBlock/...s so that each level is stored consecutively |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I just misread your question 😅. I don't plan on adding slab allocation or anything, because this isn't a dynamic data structure, and thus it shouldn't be worth messing with allocation here. The only thing you should keep in mind is that I do plan on merging I also don't think changing the layout here gets any improvements in locality, because for that, multiple block-lists would have to share cache lines, which shouldn't happen for non-trivial vectors. So for simplicity, |
Gotcha 😄! I do see how that's potentially nice for |
Good point, that needs to be evaluated during implementation |
I did some thinking. Maybe it's not a good idea to try to provide the zerocopy serialization/deserialization as a pre-packaged one-size-fits-all solution. Someone designing a on-disk file format, will have thoughts and insights about the layout themselves. For example if you store multiple wavelet-matrixes with the same alphabet but different orderings, you can share that information between them, whereas an implementation provided by us would have to replicate that into every serialised instance. So I think it might be better to make the internals part of the public interface, and expose constructors that can take something like the generic read-only I mean the cool thing about succinct vectors is that they are somewhat simple Data-structures, they are somewhat canonical by construction (modulo hyperparameters like superblock ranges). |
so you suggest just a from-raw-parts layer, maybe some conversions with common libs and then let downstream crates handle it? |
Yeah exactly. Focusing on making the raw parts stable and |
Yeah, that actually sounds reasonable, and gets rid of a lot of inelegant decision-making. And the raw-parts are pseudo-stable anyway, since I don't want to break serde compatibility between versions. This also means that it's probably possible to implement the necessary parts of this issue without a major version bump, theyll just break alongside everything else when I do one. |
Maybe. pub struct RsVec {
data: Vec<u64>,
len: usize,
blocks: Vec<BlockDescriptor>,
super_blocks: Vec<SuperBlockDescriptor>,
select_blocks: Vec<SelectSuperBlockDescriptor>,
rank0: usize,
rank1: usize,
} Becomes pub struct RsVec {
len: usize,
pub(crate) rank0: usize,
pub(crate) rank1: usize,
data: PackedSlice<u64>,
blocks: PackedSlice<BlockDescriptor>,
super_blocks: PackedSlice<SuperBlockDescriptor>,
select_blocks: PackedSlice<SelectSuperBlockDescriptor>,
} Or any other I think it should serialise similarly with Serde, but it would probably require a custom (de)serializer implementation that creates a byte owning |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Yeah I think that's fair. I did some preliminary benchmarks, and saw no impact on performance (the One interesting side effect, wrt. performance, of using the I also just realized that there is a third option of me implementing I've created a draft pull request to make discussions about the code easier at #14. ARM SIMD support is also on my wishlist but that's a different issue 😉
|
Wouldn't that induce |
Yes, definitely, but the way Since |
I just stumbled upon I've look at their code and it's less flexible and more cumbersomethan what we have imho, so it's also nice that we can make some improvements in this space. From friday on I'll be in Sicily for two weeks to harvest some olives, so I might find the time to push this a bit 😁 |
This comment was marked as off-topic.
This comment was marked as off-topic.
So, since I've been doing some work here anyway, whats the status on this? I tend to agree with your last approach that we only expose some constructors and add some tags on the structures (guarded by a crate feature obviously), and leave the details of serialization to downstream crates. |
It was relatively easy to replace the I just never got around to write an de/serialization example itself (the epsilon part of the epsilon serialization), because integrating vers into my storage layer was a bit harder than anticipated and then there were other things coming up at work. But the tests pass. The only thing missing are constructors that allow the creation from the byte types, I'll try to add those. |
The pointer-free nature of succinct data-structures makes them very amenable to (de)serialization by simply casting their memory to/from a bunch of bytes.
Not only would this remove most (de)serialization costs, it could also enable very fast and simple on-disk storage when combined with mmap.
One might want to implement this via
rkyv
, but simply providing a safe transmute to and frombytes::Bytes
(with a potential rkyv implementation on top of that) might be the simpler, more agnostic solution.The text was updated successfully, but these errors were encountered: