Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avro table provider #903

Closed
Igosuki opened this issue Aug 18, 2021 · 4 comments · Fixed by #910
Closed

Avro table provider #903

Igosuki opened this issue Aug 18, 2021 · 4 comments · Fixed by #910
Labels
enhancement New feature or request

Comments

@Igosuki
Copy link
Contributor

Igosuki commented Aug 18, 2021

In a platform I work on, I decided to write avro log files so I could easily close and append binary files to s3. Since I didn't want to bother transforming it to another format using Spark, which is the thing I wanted to drop in the first place, I started writing what's required to read avro as a datasource in datafusion.

Here is the branch on my fork (I merged the nested field PR in it but it can be removed) :
https://github.com/Igosuki/arrow-datafusion/tree/avro2_m

I transformed all parquet test files to avro and plan to add a test case for each of these.

My question would be is Avro support desirable for datafusion or should I just make a sidecar crate on my own ?

Describe alternatives you've considered
Transforming data in json or parquet to reuse the existing code.

Additional context
I'm new to the new arrow data types, and it's been a challenge to find out what I should do with avro union types that are just a nullable field. Ultimately I decided to make them nullable fields and drop the union, but I had to add special cases here and there because of that.

@Igosuki Igosuki added the enhancement New feature or request label Aug 18, 2021
@Igosuki
Copy link
Contributor Author

Igosuki commented Aug 18, 2021

Just tested my code on real avro files I own and got 200Mb processed in 0.3s (over a window function, Datafusion is the real deal !

@Dandandan
Copy link
Contributor

I think avro files as source would be great to have in DataFusion 👍

@houqp
Copy link
Member

houqp commented Aug 19, 2021

👍 from me as long as you can commit to keep maintaining the code :)

@Igosuki
Copy link
Contributor Author

Igosuki commented Aug 20, 2021

#910

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants