-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12177] [STREAMING] Update KafkaDStreams to new Kafka 0.9 Consumer API (Closes #10681) #10953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Getting rid of an extra dependency
More changes
…ad of v09 to avoid confusion. Python work is being done in a separate branch github.com/markgrover/spark/tree/kafka09-integration-python
… the old api using 0.9.0.0 only.
|
@tdas @harishreedharan @koeninger would appreciate your review here. Thanks! |
|
Test build #50220 has finished for PR 10953 at commit
|
|
Test build #50224 has finished for PR 10953 at commit
|
|
OK, I have no idea what Mima is but I will take a look and try to run them locally and fix the issues. |
|
MiMA is a binary compatibility checker. It's complaining that some changes you made caused the public APIs exposed in the compiled classes to change - meaning existing code compiled against the current Spark version might not run on the next Spark. First I'd look at whether those changes are necessary; if they are, it might be ok to add exclusions because we're being a bit lenient with API breakages in 2.0. |
…nd to a private member so excludes are ok
|
Thanks Marcelo, I added excludes because I think it's ok to have such minor binary incompatibility in this case (Kafka 0.8->Kafka 0.9, Spark 1.6->Spark 2.0). |
|
Test build #50332 has finished for PR 10953 at commit
|
| * NOT zookeeper servers, specified in host1:port1,host2:port2 form | ||
| */ | ||
| private[spark] | ||
| class NewKafkaCluster[K: ClassTag, V: ClassTag](val kafkaParams: Map[String, String]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another place where we can't use 'new' in the class name . See #9007 which recently got merged. We have to make this a public class because of valid use-cases like https://issues.apache.org/jira/browse/SPARK-13106
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it did conflict with this PR but I have merged it. I will make the NewKafkaCluster class public as well, similar to this change now.
Conflicts: external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaCluster.scala
…grover/spark into kafka09-integration-refactor-params
|
Test build #50946 has finished for PR 10953 at commit
|
|
Test build #51125 has finished for PR 10953 at commit
|
|
Test build #51193 has finished for PR 10953 at commit
|
Conflicts: project/MimaExcludes.scala
|
Test build #52280 has finished for PR 10953 at commit
|
|
Will fix shortly. I was building against a custom built version of kafka
|
|
Test build #52322 has finished for PR 10953 at commit
|
|
Hmm, this is weird, these tests were passing before. Anyways, I was able to circumvent them by setting SPARK_DRIVER_MEMORY to a higher value on my local box. I'll kick off another test run to see if the failure was transient. And, if not, I'd have to see if a bump is necessary here too, or why is it necessary now. |
|
Jenkins, retest this please. |
|
Test build #52352 has finished for PR 10953 at commit
|
|
Test build #52353 has finished for PR 10953 at commit
|
|
@markgrover Mind adding |
|
Kafka 0.10.0 is out, any plans of merging this PR or updating it with newest Kafka? |
|
I think this PR is fairly out of date regarding things like needing to cache consumers on executors, getPreferredLocations, and a separate subproject for kafka-0.8 |
|
Yeah, I agree with @koeninger. This PR is pretty out of date, it makes sense to turn focus on Cody's PR #11863 |
First off, many thanks and credit to @nikit-os for starting this work. I am building up on his good work done at #10681.
Here is some context: Kafka has a new consumer API starting Apache Kafka 0.9 (Kafka 0.9.0.0, to be precise). This API supports security features like authentication (via Kerberos) and encryption (via SSL). This new API is in beta but is the only way we can integrate Spark and Kafka in a secure environment.
Kafka 0.9 ships with both the new and old consumer API. So, we can up our dependency of Kafka from 0.8* to 0.9. However, the old consumer API in Kafka 0.9 is binary incompatible with the old consumer API in Kafka 0.8_. There are also some minor API compatibility issues between the old API of Kafka 0.8._ and Kafka 0.9, mostly around ZkClient and ZkUtils API. In almost all cases, when users are using the old API whether they are using Kafka 0.8 or Kafka 0.9 with Spark, they won't have to modify any code on their side. It's only when they use the new consumer API from Kafka 0.9, they will have to modify their code.
In addition to supporting to the new API from Kafka 0.9, we can support the old API from Kafka 0.8 only (which works with Kafka 0.8 and Kafka 0.9 brokers), or old API from Kafka 0.9 (which only works with Kafka 0.9 brokers). If we support the old API from 0.8.0, we complicate our integration. We will have to build 2 artifacts (kafka-assembly jars, that is), one for Kafka 0.8 and Kafka 0.9. We'll have to do maven profiles, one for Kafka 0.8 and one for 0.9. Alternatively, if we only support old API from 0.9, users will have upgrade Kafka brokers to 0.9, when using Spark 2.0.
In fact, users should ideally upgrade Kafka brokers before they upgrade Spark since old 0.8.0 clients can work with the new 0.9.0 brokers). I am personally ok with requiring users to upgrade Kafka brokers to 0.9.0 since 1) this is going in Spark 2.0, 2) that's what the Kafka community is pushing for, 3) I think the baggage of supporting Kafka 0.8's old consumer API might be too much to carry until another major release of Spark. Again, with Kafka 0.9's old consumer API, which we support, no user code needs to be changed, they do need to upgrade their brokers though.
As far as the code in this PR goes, we decided to put code for the new API in a new package (called org.apache.spark.streaming.kafka.newapi*), creating a separate maven artifact spark-streaming-kafka-newapi_2.10.
In essence, this means that there are two KafkaCluster classes and two KafkaUtils classes, etc. (similar to how we have NewHadoopRDD) because there wasn't enough duplication between the two implementations. Where there was enough duplication, example KafkaTestUtils, there is only one implementation.
I'd love to get this reviewed. Thanks in advance!
The examples of the new API have some extra required parameters, I will change it shortly for them to have reasonable defaults and be optional.
Follow on work: