-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the bytecode / interpreter machinery and use the serde API directly #118
Comments
Thanks for raising this issue :). The reasons are mostly historic ones. I agree that replacing the event stream abstraction with direct serialization makes sense. I simply did not get around to actually implement it. I don't think there is a factor 10x on the table table, though. At most 4x - 5x (see benchmarks). I believe To explain the history: this crate is my first exposure of interacting with the |
Very clear, thanks! :)
Ah yes, we probably wouldn't get significantly better than arrow2-convert, I did get those gains compared to apache-avro but that implementation was very far from optimal in many places - I didn't mean to express that we could expect the same gains here. |
I implemented everything required to execute benchmarks (90 % sure the resulting buffers are filled correctly :D). The result is a speed up of aprox. 2x. (1.81 for complex types, 2.12 for simple types):
|
label | time [ms] | arrow2_convert | serde_arrow_ng | serde_arrow | arrow |
---|---|---|---|---|---|
arrow2_convert | 695.25 | 1.00 | 0.40 | 0.22 | 0.14 |
serde_arrow_ng | 1745.76 | 2.51 | 1.00 | 0.55 | 0.36 |
serde_arrow | 3152.74 | 4.53 | 1.81 | 1.00 | 0.65 |
arrow | 4873.73 | 7.01 | 2.79 | 1.55 | 1.00 |
primitives_serialize(1000000)
label | time [ms] | arrow2_convert | serde_arrow_ng | serde_arrow | arrow |
---|---|---|---|---|---|
arrow2_convert | 136.05 | 1.00 | 0.47 | 0.22 | 0.10 |
serde_arrow_ng | 290.84 | 2.14 | 1.00 | 0.47 | 0.21 |
serde_arrow | 617.97 | 4.54 | 2.12 | 1.00 | 0.45 |
arrow | 1373.00 | 10.09 | 4.72 | 2.22 | 1.00 |
The next steps will be to switch the internal test suite to the new serializer and then squash test errors one by one. I would expect progress the next couple of weeks to be a bit slower though due to time constraints. As a side note: I am really surprised how well the idea of the bytecode serializer holds up.
Amazing! Congratz! |
Not sure how urgent this feature is for you. I defintely would like to touch up some more details before cutting a proper release. But I could do a pre-release for you to test, if that helps. Oh. And thanks again for pointing me into this direction. The result is now approx 2x performance, -3000 lines of code, and, I beliefe, a much easier to understand impl :). |
Your open-mindedness and reactivity made that an absolute pleasure 😊
Don't mind me, I'll just point to github in the meantime. 😊 |
I cut a pre-release after all for my own package using serde_arrow :) |
First of all, thanks for this crate!
Secondly, sorry for opening an issue to ask a question - I couldn't find a "discussions" section.
Disclaimer: I don't know much about building arrow values but I do know serde very well - I'm considering using this crate for transcoding from avro to arrow but I'm wondering about the performance implications of the intermediate-event-stream representation.
Why the interpreter and event stream intermediate representation instead of just pushing in arrow vectors directly from your
impl Serializer
s? (Isn't it possible to "just push to arrow vectors"?) Is that a performance improvement? If so, why does that improved performance (compared to the alternative below)? Doesn't that prevent e.g. copying the&str
s provided by theSerialize
implementor directly in their appropriate buffer and instead require extra allocation? (Event::Str("1970-01-01T00:00:00Z").to_owned(),
)Basically my question is: why not something along those lines:
(I've achieved x10 performance on avro serialization compared to apache-avro implementation specifically by removing the intermediate representation they have which is why at first glance I'm surprised by this, but I don't know arrow well.)
Thanks a lot ❤️
The text was updated successfully, but these errors were encountered: