Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avro support #1844

Closed
gianm opened this issue Oct 21, 2015 · 3 comments
Closed

Avro support #1844

gianm opened this issue Oct 21, 2015 · 3 comments

Comments

@gianm
Copy link
Contributor

gianm commented Oct 21, 2015

Should probably be an extension.

For realtime we need a ByteBufferInputRowParser (something similar to the ProtoBufInputRowParser, but for Avro).

For batch we need a recommended Avro-aware InputFormat and an InputRowParser that can read whatever type is returned by that InputFormat. I haven't used Avro before so I'm not sure what the right choice of InputFormat is. AvroKeyInputFormat from https://avro.apache.org/docs/1.7.0/api/java/org/apache/avro/mapreduce/AvroKeyInputFormat.html seems like a possible candidate.

@himanshug
Copy link
Contributor

We have had avro working for a while, but code is not generic enough and very specific to our schemas. In fact, it will not be possible to take a general avro schema and convert it to druid row because avro has support for very many complex types, so we will have to compromise anyway. Also, it was written pre druid-0.8.0 era where it wasn't possible to have InputFormats that could return anything but Text records.
With druid-0.8.2, the limitation regarding Text records is gone. In my org, some people are working on next gen druid avro integration.

@zhaown
Copy link
Contributor

zhaown commented Oct 22, 2015

I'm using avro with druid for production, for batch indexing, it's not complicated based on @himanshug 's #1472, and I'm using my AvroValueInputFormat which is the mirror of AvroKeyInputFormat.

But for realtime indexing, it's a bit more cumbersome because you need an schema to deserialize avro object from binary stream, and you don't want to send schema with every serialized record to kafka. Then you need an schema registry, currently we are using schemarepo and camus schema registry client, the latter is not in the maven central...

I'll try to clean my code and try to submit an PR for this this weekend if I got some time.

@zhaown
Copy link
Contributor

zhaown commented Oct 25, 2015

Please check #1858

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants