Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An EEL sink for Flume #168

Open
hannesmiller opened this issue Sep 20, 2016 · 4 comments
Open

An EEL sink for Flume #168

hannesmiller opened this issue Sep 20, 2016 · 4 comments
Assignees
Milestone

Comments

@hannesmiller
Copy link
Contributor

The experimental kite data set sink exist for Flume 1.6.0 which looks on the face of it has the capability of ingesting directly into Hive tables.

Headers are usually used for content based routing and multiplexing an event to different sinks.

  • When writing the custom EEL sink one takes an event of the channel (queue), interprets the headers if necessary, transforms the body (payload) into an EEL frame so that it can be passed directly into an EEL sink.

See https://flume.apache.org/FlumeUserGuide.html#kite-dataset-sink

  • Now my guess without looking at the source code is that the Kite Dataset Sink extracts the byte stream from the body and deserialises it to a AVRO GenericRecord which in turn can be passed directly into the Kite write API.

I think for EEL we should do something similar:

  1. Deserialise the payload to a GenericRecord
  2. Transform the GenericRecord to an EEL frame - note from each GenericRecord you can ascertain the AVRO schema so it should be trivial to convert to a Frame schema.
  3. Pass the frame to the EEL sink.

There are various options for batching up events and sending securely over SSL - you could even send via Kafka to a Flume Kafka Source

@sksamuel sksamuel added this to the 1.2 milestone Oct 2, 2016
@sksamuel sksamuel removed this from the 1.2 milestone Jan 14, 2017
@hannesmiller
Copy link
Contributor Author

You can assign this one to me.

@sksamuel
Copy link
Contributor

Do we still want a flume connector? I think flume is more suited to streaming data, running on a continuous basis, and eel is very much batch based.

@hannesmiller
Copy link
Contributor Author

I am fine with this...Yeah I guess its main purpose is for streaming l, but I have used it for ingesting large batches of events (rows) into HDFS.

My initial thoughts is you could have a scenario like this...

JdbcSource -> FlumeSink

  1. The EEL FlumeSink would accept rows and convert each row into a FlumeEvent, then send onto a FlumeAgent
  2. The FlumeSink wraps a FlumeClient - there are canned ones for AvroRpc, ThriftRpc, Kafka, JMS, etc... Note events can be batched to mitigate the number of RPCs
  3. The FlumeAgent itself can be configured to accept events on an AvroSource which in turn routes it to one of its canned sinks, e.g. HdfsSink - you can even write your own custom EelSink

With Flume there's no need for Hadoop to be installed on the client machine - there are other features I haven't touched upon.

Kite have written an interceptor (Morphlines) which is invoked before events hit the sink...they have a bunch of modules and a DSL for transform events.

@sksamuel
Copy link
Contributor

Ok lets do a flume sink.

@sksamuel sksamuel reopened this Jul 14, 2017
@sksamuel sksamuel added this to the 1.4 milestone Jul 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants