[STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu... #807

harishreedharan · 2014-05-17T00:54:28Z

...sh model

Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the
receiver fails, it currently has to be restarted on the same node to be able to receive data.

This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new
DStream that is also included in this commit. This model ensures that data can be pulled into Spark from
Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on
multiple threads for better performance.

… push model Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the receiver fails, it currently has to be restarted on the same node to be able to receive data. This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new DStream that is also included in this commit. This model ensures that data can be pulled into Spark from Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on multiple threads for better performance.

AmplabJenkins · 2014-05-17T00:57:58Z

Can one of the admins verify this patch?

… push model Update to the previous patch fixing some error cases and also excluding Netty dependencies. Also updated the unit tests.

… push model Exclude IO Netty in the Flume sink.

rxin · 2014-05-19T04:35:09Z

@pwendell @tdas

rxin · 2014-05-19T04:35:14Z

Jenkins, test this please.

AmplabJenkins · 2014-05-19T04:38:00Z

Merged build triggered.

AmplabJenkins · 2014-05-19T04:38:08Z

Merged build started.

AmplabJenkins · 2014-05-19T05:18:56Z

Merged build finished.

AmplabJenkins · 2014-05-19T05:18:56Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15074/

harishreedharan · 2014-05-19T06:42:53Z

Not sure why the build failed with a dependency resolution issue. It seems to work locally when I do sbt assembly locally. Does the order of modules specified in the build spec matter?

Any advice from someone more familiar with sbt?

On May 18, 2014, at 10:19 PM, UCB AMPLab notifications@github.com wrote:

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15074/

—
Reply to this email directly or view it on GitHub.

srowen · 2014-05-19T07:00:52Z

external/flume-sink/pom.xml

@@ -0,0 +1,82 @@
+<?xml version="1.0" encoding="UTF-8"?>


Pardon for jumping in with comments, but why a new module instead of external/flume?

The sink will be deployed to a Flume agent and not within the spark application. Adding it in external/flume would require that all of the spark dependencies be bundled with the jar, while keeping this module separate (which does not depend on the rest of Spark) allows the user to simply deploy this jar to the Flume plugins directory. In fact, this module does not have any dependencies that Flume already does not pull in by default.

harishreedharan · 2014-05-19T16:19:31Z

Thanks for the comments @srowen!

… push model Removing previousArtifact from build spec, so that the build runs fine.

harishreedharan · 2014-05-19T16:29:51Z

The latest commit should fix the build issue.

… push model Updated Maven build to be equivalent of the sbt build.

… push model Fix build with maven.

harishreedharan · 2014-05-23T18:20:44Z

I am going to update this to support polling from multiple Flume agents rather than just one.

… push model Added support for polling several Flume agents from a single receiver.

harishreedharan · 2014-05-24T08:34:25Z

The latest commit adds support for polling several flume agents.

tdas · 2014-06-05T18:25:27Z

external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumePollingInputDStream.scala

+      val client = SpecificRequestor.getClient(classOf[SparkFlumeProtocol.Callback], transceiver)
+      connectionBuilder += new FlumeConnection(transceiver, client)
+    })
+    connections = connectionBuilder.result()


connections could be built in a more Scala-like functional style. Something like this.

connections = addresses.map { host => ..... new FlumeConnection(....) }.toArray

No need for ArrayBuilder

This ends up in data being copied, but since it is one-time I guess it is ok.

SparkQA · 2014-07-28T19:39:00Z

QA results for PR 807:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class SparkSink extends AbstractSink with Logging with Configurable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17298/consoleFull

tdas · 2014-07-28T19:54:21Z

The binary compatibility test failed because it tried to compare current version of flume-sink with previous version (which does not exist). So we need to make an exclusion here. Let me figure this out with @pwendell

harishreedharan · 2014-07-28T20:07:30Z

Thanks @tdas! I was trying to figure it out when I saw the failure - but I can't see a place to add the exclusions.

harishreedharan · 2014-07-28T20:22:18Z

Added sparkSink to mima excludes. This should fix the Jenkins failure.

SparkQA · 2014-07-28T20:23:44Z

QA tests have started for PR 807. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17303/consoleFull

tdas · 2014-07-28T21:06:16Z

Yeah, I will merge this as soon as this passes.

SparkQA · 2014-07-28T21:12:01Z

QA results for PR 807:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class SparkSink extends AbstractSink with Logging with Configurable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17303/consoleFull

SparkQA · 2014-07-28T22:59:19Z

QA tests have started for PR 807. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17313/consoleFull

tdas · 2014-07-28T23:09:51Z

All right! Merging this!
THIS WAS A LOT OF WORK. THANK YOU VERY MUCH!

SparkQA · 2014-07-28T23:47:42Z

QA results for PR 807:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class SparkSink extends AbstractSink with Logging with Configurable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17313/consoleFull

SparkQA · 2014-07-29T00:09:02Z

QA tests have started for PR 807. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17317/consoleFull

SparkQA · 2014-07-29T00:56:46Z

QA results for PR 807:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class SparkSink extends AbstractSink with Logging with Configurable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17317/consoleFull

Conflicts: project/SparkBuild.scala

SparkQA · 2014-07-29T02:58:51Z

QA tests have started for PR 807. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17332/consoleFull

SparkQA · 2014-07-29T04:16:52Z

QA results for PR 807:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class SparkSink extends AbstractSink with Logging with Configurable {
class SparkSQLOperationManager(hiveContext: HiveContext) extends OperationManager with Logging {
class HadoopTableReader(

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17332/consoleFull

harishreedharan · 2014-07-29T04:24:38Z

SparkSQLOperationManager and HadoopTableReader were added to ASF git master. Since the github mirror is out of sync, the merge commit between this PR and that one makes the Jenkins script think that this PR added these two classes.

… the current pu... ...sh model Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the receiver fails, it currently has to be restarted on the same node to be able to receive data. This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new DStream that is also included in this commit. This model ensures that data can be pulled into Spark from Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on multiple threads for better performance. Author: Hari Shreedharan <harishreedharan@gmail.com> Author: Hari Shreedharan <hshreedharan@apache.org> Author: Tathagata Das <tathagata.das1565@gmail.com> Author: harishreedharan <hshreedharan@cloudera.com> Closes apache#807 from harishreedharan/master and squashes the following commits: e7f70a3 [Hari Shreedharan] Merge remote-tracking branch 'asf-git/master' 96cfb6f [Hari Shreedharan] Merge remote-tracking branch 'asf/master' e48d785 [Hari Shreedharan] Documenting flume-sink being ignored for Mima checks. 5f212ce [Hari Shreedharan] Ignore Spark Sink from mima. 981bf62 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' 7a1bc6e [Hari Shreedharan] Fix SparkBuild.scala a082eb3 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' 1f47364 [Hari Shreedharan] Minor fixes. 73d6f6d [Hari Shreedharan] Cleaned up tests a bit. Added some docs in multiple places. 65b76b4 [Hari Shreedharan] Fixing the unit test. e59cc20 [Hari Shreedharan] Use SparkFlumeEvent instead of the new type. Also, Flume Polling Receiver now uses the store(ArrayBuffer) method. f3c99d1 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' 3572180 [Hari Shreedharan] Adding a license header, making Jenkins happy. 799509f [Hari Shreedharan] Fix a compile issue. 3c5194c [Hari Shreedharan] Merge remote-tracking branch 'asf/master' d248d22 [harishreedharan] Merge pull request apache#1 from tdas/flume-polling 10b6214 [Tathagata Das] Changed public API, changed sink package, and added java unit test to make sure Java API is callable from Java. 1edc806 [Hari Shreedharan] SPARK-1729. Update logging in Spark Sink. 8c00289 [Hari Shreedharan] More debug messages 393bd94 [Hari Shreedharan] SPARK-1729. Use LinkedBlockingQueue instead of ArrayBuffer to keep track of connections. 120e2a1 [Hari Shreedharan] SPARK-1729. Some test changes and changes to utils classes. 9fd0da7 [Hari Shreedharan] SPARK-1729. Use foreach instead of map for all Options. 8136aa6 [Hari Shreedharan] Adding TransactionProcessor to map on returning batch of data 86aa274 [Hari Shreedharan] Merge remote-tracking branch 'asf/master' 205034d [Hari Shreedharan] Merging master in 4b0c7fc [Hari Shreedharan] FLUME-1729. New Flume-Spark integration. bda01fc [Hari Shreedharan] FLUME-1729. Flume-Spark integration. 0d69604 [Hari Shreedharan] FLUME-1729. Better Flume-Spark integration. 3c23c18 [Hari Shreedharan] SPARK-1729. New Spark-Flume integration. 70bcc2a [Hari Shreedharan] SPARK-1729. New Flume-Spark integration. d6fa3aa [Hari Shreedharan] SPARK-1729. New Flume-Spark integration. e7da512 [Hari Shreedharan] SPARK-1729. Fixing import order 9741683 [Hari Shreedharan] SPARK-1729. Fixes based on review. c604a3c [Hari Shreedharan] SPARK-1729. Optimize imports. 0f10788 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model 87775aa [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model 8df37e4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model 03d6c1c [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model 08176ad [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model d24d9d4 [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model 6d6776a [Hari Shreedharan] SPARK-1729. Make Flume pull data from source, rather than the current push model

### What changes were proposed in this pull request? This PR removes sbt-avro plugin dependency. In the current master, Build with SBT depends on the plugin but it seems never used. Originally, the plugin was introduced for `flume-sink` in SPARK-1729 (#807) but `flume-sink` is no longer in Spark repository. After SBT was upgraded to 1.x in SPARK-21708 (#29286), `avroGenerate` part was introduced in `object SQL` in `SparkBuild.scala`. It's confusable but I understand `Test / avroGenerate := (Compile / avroGenerate).value` is for suppressing sbt-avro for `sql` sub-module. In fact, Test/compile will fail if `Test / avroGenerate :=(Compile / avroGenerate).value` is commented out. `sql` sub-module contains `parquet-compat.avpr` and `parquet-compat.avdl` but according to `sql/core/src/test/README.md`, they are intended to be handled by `gen-avro.sh`. Also, in terms of Maven build, there seems to be no definition to handle `*.avpr` or `*.avdl`. Based on the above, I think we can remove `sbt-avro`. ### Why are the changes needed? If `sbt-avro` is really no longer used, it's confusable that `sbt-avro` related configurations are in `SparkBuild.scala`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #33190 from sarutak/remove-avro-from-sbt. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR removes sbt-avro plugin dependency. In the current master, Build with SBT depends on the plugin but it seems never used. Originally, the plugin was introduced for `flume-sink` in SPARK-1729 (#807) but `flume-sink` is no longer in Spark repository. After SBT was upgraded to 1.x in SPARK-21708 (#29286), `avroGenerate` part was introduced in `object SQL` in `SparkBuild.scala`. It's confusable but I understand `Test / avroGenerate := (Compile / avroGenerate).value` is for suppressing sbt-avro for `sql` sub-module. In fact, Test/compile will fail if `Test / avroGenerate :=(Compile / avroGenerate).value` is commented out. `sql` sub-module contains `parquet-compat.avpr` and `parquet-compat.avdl` but according to `sql/core/src/test/README.md`, they are intended to be handled by `gen-avro.sh`. Also, in terms of Maven build, there seems to be no definition to handle `*.avpr` or `*.avdl`. Based on the above, I think we can remove `sbt-avro`. ### Why are the changes needed? If `sbt-avro` is really no longer used, it's confusable that `sbt-avro` related configurations are in `SparkBuild.scala`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA. Closes #33190 from sarutak/remove-avro-from-sbt. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6c4616b) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

harishreedharan added 2 commits May 18, 2014 00:58

SPARK-1729. Make Flume pull data from source, rather than the current…

d24d9d4

… push model Update to the previous patch fixing some error cases and also excluding Netty dependencies. Also updated the unit tests.

SPARK-1729. Make Flume pull data from source, rather than the current…

08176ad

… push model Exclude IO Netty in the Flume sink.

srowen reviewed May 19, 2014
View reviewed changes

SPARK-1729. Make Flume pull data from source, rather than the current…

03d6c1c

… push model Removing previousArtifact from build spec, so that the build runs fine.

harishreedharan added 2 commits May 19, 2014 23:09

SPARK-1729. Make Flume pull data from source, rather than the current…

8df37e4

… push model Updated Maven build to be equivalent of the sbt build.

SPARK-1729. Make Flume pull data from source, rather than the current…

87775aa

… push model Fix build with maven.

SPARK-1729. Make Flume pull data from source, rather than the current…

0f10788

… push model Added support for polling several Flume agents from a single receiver.

SPARK-1729. Optimize imports.

c604a3c

tdas reviewed Jun 5, 2014
View reviewed changes

Ignore Spark Sink from mima.

5f212ce

Documenting flume-sink being ignored for Mima checks.

e48d785

Merge remote-tracking branch 'asf/master'

96cfb6f

Merge remote-tracking branch 'asf-git/master'

e7f70a3

Conflicts: project/SparkBuild.scala

asfgit closed this in 800ecff Jul 29, 2014

harishreedharan mentioned this pull request Aug 1, 2014

SPARK-2201 Improve FlumeInputDStream's stability and make it scalable #1310

Closed

sarutak mentioned this pull request Jul 2, 2021

[SPARK-35990][BUILD] Remove avro-sbt plugin dependency #33190

Closed

agirish pushed a commit to HPEEzmeral/apache-spark that referenced this pull request May 5, 2022

MapR [SPARK-871] Spark 2.4.7 job fails from mapr-client (apache#807)

5a8fddb

udaynpusa pushed a commit to mapr/spark that referenced this pull request Jan 30, 2024

MapR [SPARK-871] Spark 2.4.7 job fails from mapr-client (apache#807)

0d8ca89

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu... #807

[STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu... #807

harishreedharan commented May 17, 2014

AmplabJenkins commented May 17, 2014

rxin commented May 19, 2014

rxin commented May 19, 2014

AmplabJenkins commented May 19, 2014

AmplabJenkins commented May 19, 2014

AmplabJenkins commented May 19, 2014

AmplabJenkins commented May 19, 2014

harishreedharan commented May 19, 2014

srowen May 19, 2014

harishreedharan May 19, 2014

harishreedharan commented May 19, 2014

harishreedharan commented May 19, 2014

harishreedharan commented May 23, 2014

harishreedharan commented May 24, 2014

tdas Jun 5, 2014

harishreedharan Jun 5, 2014

SparkQA commented Jul 28, 2014

tdas commented Jul 28, 2014

harishreedharan commented Jul 28, 2014

harishreedharan commented Jul 28, 2014

SparkQA commented Jul 28, 2014

tdas commented Jul 28, 2014

SparkQA commented Jul 28, 2014

SparkQA commented Jul 28, 2014

tdas commented Jul 28, 2014

SparkQA commented Jul 28, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

harishreedharan commented Jul 29, 2014

[STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu... #807

[STREAMING] SPARK-1729. Make Flume pull data from source, rather than the current pu... #807

Conversation

harishreedharan commented May 17, 2014

AmplabJenkins commented May 17, 2014

rxin commented May 19, 2014

rxin commented May 19, 2014

AmplabJenkins commented May 19, 2014

AmplabJenkins commented May 19, 2014

AmplabJenkins commented May 19, 2014

AmplabJenkins commented May 19, 2014

harishreedharan commented May 19, 2014

srowen May 19, 2014

Choose a reason for hiding this comment

harishreedharan May 19, 2014

Choose a reason for hiding this comment

harishreedharan commented May 19, 2014

harishreedharan commented May 19, 2014

harishreedharan commented May 23, 2014

harishreedharan commented May 24, 2014

tdas Jun 5, 2014

Choose a reason for hiding this comment

harishreedharan Jun 5, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 28, 2014

tdas commented Jul 28, 2014

harishreedharan commented Jul 28, 2014

harishreedharan commented Jul 28, 2014

SparkQA commented Jul 28, 2014

tdas commented Jul 28, 2014

SparkQA commented Jul 28, 2014

SparkQA commented Jul 28, 2014

tdas commented Jul 28, 2014

SparkQA commented Jul 28, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

SparkQA commented Jul 29, 2014

harishreedharan commented Jul 29, 2014