-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Avro Support #4886
Comments
I think having a native Avro --> Arrow code will be good as well as continue to encourage additional use of Arrow in the ecosystem. |
This would be awesome! |
Awesome happy to review any pr and help with tests. |
I think this may have gotten bumped by other priorities and I think @tustvold plans to wrap up whatever his current state is while he works on other things |
I intend to keep working away at this in the background, but any help on reviews would be most appreciated |
FYI a duplicate issue also exists: #727 |
Today I learned that there is a version of the avro reader/writer in arrow2: |
Hello, |
I think this means "reading avro records into Here is the way it is implemented in datafusion: There are more sophisticated ways of implementing this feature, for example the tape based methods of the JSON and CSV readers in this crate What I would personally recommend doing is:
|
I have most of the bits and pieces and hope to push something up in the coming week or two. The major missing thing at the moment is tests.
I had a brief look at https://github.com/Ten0/serde_avro_fast and I am not sure it necessarily would be a good fit for arrow-rs as it appears to rely on knowing the data schema at compile time, but I could be completely off base here |
Heyy, Thanks for the quick answers.
It doesn't 😊 |
That's an interesting idea, typically the way to achieve performant decode is to decode the values for a column at a time, as this allows amortizing per-row overheads, reducing branch misses, etc... This would obviously not be possible with the serde model which is inherently value-oriented, but it is also possible that the nature of the avro encoding, which relies extensively on varint encoding, reduces the benefits of such a columnar approach. I'll try to get what I have polished up over the next few days, and we can compare benchmarks. |
Oh that's interesting! I would have imagined that we would have prepared all the vectors and be pushing to each of them as we read each field. What are you referring to with regards to per-row overheads? (I'd like to read documentation on this topic, I'm familiar with branch prediction but not this.) That being said indeed with avro's encoding where you have to precisely deserialize each field of an object before you know where the next object starts, plus with the block encoding with the compression, it's very hard for me to imagine that reading several times to extract a single field each time would be the most performant approach. (But even if that was the case, that would look very close to driving the deserializer multiple times, just ignoring all the fields but one each time.)
Wonderful! 😊
So IIUC the interface we'd want is basically something that enables to convert from an arbitrary |
Yes, whilst this is more important for file formats like parquet that achieve much higher compression ratios than avro, having streaming iterators is pretty standard practice.
You might also be interested in https://docs.rs/arrow-json/50.0.0/arrow_json/reader/struct.Decoder.html#method.serialize
Converting between row-oriented and columnar formats is very fiddly, especially where they encode nullability differently 😅
The major downside of row-oriented approaches is that unless you know the types at compile time, or have a JIT, you are paying the overhead of type-dispatch for every field. The whole premise of vectorised execution is that by instead operating on columns, you can amortise this cost across thousands of values within a given column, as well as make it easier for the compiler to optimise the code. |
Thanks!
So I've asked and it turns out that it might indeed be sub-optimal 🫤
The benchmarks here seem to suggest that this may be a less efficient approach than the one I drafted at chmp/serde_arrow#118 (comment) but I will also have a look, maybe the performance loss comes from elsewhere 🤔 |
Here's a quick POC for for full-featured Avro to Arrow using It holds in <150 lines total ATM and successfully loads avro object container files to arrow Performance of serde_arrow should be very close to zero-cost abstraction since chmp/serde_arrow#120. I'll probably PR that before benchmarks (if @chmp hasn't done it before 🚀)
So I've checked and it adds significant intermediate representations in the "tape" thing. It seems pretty clear that is indeed why it's so far behind in the benchmarks. |
Sounds promising, I'm afraid I'm not likely to have time to take a look for the next week or so, but I will do so when able. I'm curious if you've tried comparing the performance of arrow-json vs serde_json + serde_transcode + serde_arrow. The reason I ask is the motivation for the tape is to amortise dispatch overheads (among other things) so I am curious if that is being outweighed by other factors |
Just a quick comment: the performance of |
That's likely the result of #4861 which avoided needing to serialize numerics to strings. TBC the tape largely exists to serve the needs of the raw arrow JSON decoder. It was only later hooked up into serde because we could do so with little additional effort, only very basic effort has been made to optimise it for this usecase. I am, however, very interested if a combination of serde_json and serde_arrow is competitive with the raw JSON implementation, as that would open up some very interesting possibilities. |
There are some limited tests for |
First draft of the benchmark seems to show that the tape achieves ~ the same performance as serde_arrow when plugged to serde_json, but that wouldn't be true as soon as the BTreeMap lookups are removed according to the But more importantly, the specialized implementation of json to arrow that's done in arrow_json performs much better than going through serde, and according to ATM it's using random inputs though, which basically contain only escaped strings, so that might be damaging the benchmark's relevance. |
Updates the Implementation Status docs page to reflect that the Go implementation can read Avro files. For the Rust implementation, I inferred from [this PR](apache/arrow-rs#4886) and [this comment](apache/arrow-rs#5562 (comment)) that we should hold off on indicating that the Rust implementation can read Avro files. * GitHub Issue: #41386
) Updates the Implementation Status docs page to reflect that the Go implementation can read Avro files. For the Rust implementation, I inferred from [this PR](apache/arrow-rs#4886) and [this comment](apache/arrow-rs#5562 (comment)) that we should hold off on indicating that the Rust implementation can read Avro files. * GitHub Issue: apache#41386
) Updates the Implementation Status docs page to reflect that the Go implementation can read Avro files. For the Rust implementation, I inferred from [this PR](apache/arrow-rs#4886) and [this comment](apache/arrow-rs#5562 (comment)) that we should hold off on indicating that the Rust implementation can read Avro files. * GitHub Issue: apache#41386
) Updates the Implementation Status docs page to reflect that the Go implementation can read Avro files. For the Rust implementation, I inferred from [this PR](apache/arrow-rs#4886) and [this comment](apache/arrow-rs#5562 (comment)) that we should hold off on indicating that the Rust implementation can read Avro files. * GitHub Issue: apache#41386
) Updates the Implementation Status docs page to reflect that the Go implementation can read Avro files. For the Rust implementation, I inferred from [this PR](apache/arrow-rs#4886) and [this comment](apache/arrow-rs#5562 (comment)) that we should hold off on indicating that the Rust implementation can read Avro files. * GitHub Issue: apache#41386
) Updates the Implementation Status docs page to reflect that the Go implementation can read Avro files. For the Rust implementation, I inferred from [this PR](apache/arrow-rs#4886) and [this comment](apache/arrow-rs#5562 (comment)) that we should hold off on indicating that the Rust implementation can read Avro files. * GitHub Issue: apache#41386
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Avro is a widely used binary, row-oriented data encoding. It is very similar to protobuf, and has seen very wide adoption in the data ecosystem, especially for streaming workloads.
Describe the solution you'd like
A new arrow_avro crate will provide vectorised support for reading and writing avro data. The APIs should be designed in such a way as to work for the various different container formats for avro encoded data, including single object encoding, object container files and message even if first-class support is not provided for all these framing mechanisms.
Describe alternatives you've considered
Additional context
DataFusion has some avro support, however, it is based on the row-based apache_avro crate and is therefore likely extremely sub-optimal.
FYI @Samrose-Ahmed @sarutak @devinjdangelo I intend to work on this, but any help with reviews / testing would be most welcome
The text was updated successfully, but these errors were encountered: