-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19719][SS] Kafka writer for both structured streaming and batch queires #17043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
d371758
add kafka relation and refactor kafka source
b6c3055
update
4c81812
update
ab02a4c
single kafka provider for both stream and batch
e6b57ed
added uninterruptible thread version of kafka offset reader
ff94ed8
added uninterruptible thread version of kafka offset reader
f8fd34c
update tests
41271e2
resolve conflicts in KafakSource
74d96fc
update comments
d31fc81
address comments from @zsxwing
1db1649
update
3b0d48b
Merge branch 'master' of https://github.com/apache/spark into SPARK-1…
a5b0269
address comments from @zsxwing
c08c01f
late binding offsets
79d335e
update to late binding logic
a44b365
Merge branch 'SPARK-18682' into kafka-writer
51291e3
remove kafka log4j debug
b597cf1
remove kafka log4j debug
84b32c5
Merge branch 'SPARK-18682' into kafka-writer
f5ae301
update
2487a72
address comments from @zsxwing
789d3af
update
56a06e7
Merge branch 'SPARK-18682' into kafka-writer
e74473b
update
73df054
update
5b48fc6
address comments from @tdas
5776009
address feedback from @tdas and @sxwing
63d453f
update merge
3c4eecf
update
b0611e4
update
3c6a52b
update
8ba33a7
update
68a2a18
update
c8c38e1
update
71f8de0
update
c4c9395
address comments from @tdas
8f5da8b
update
c85b803
update
9d7a00d
update
66fa01b
update
129cfcd
update
67e3c06
update
e6b6dc1
revise exceptions and topic option
3981d7b
Merge branch 'master' of https://github.com/apache/spark into kafka-w…
b48f173
address comments from @tdas @zsxwing
2dd3ffb
update
b1d554a
address comments from @zsxwing
107e513
update
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
43 changes: 43 additions & 0 deletions
43
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSink.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,43 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.kafka010 | ||
|
|
||
| import java.{util => ju} | ||
|
|
||
| import org.apache.spark.internal.Logging | ||
| import org.apache.spark.sql.{DataFrame, SQLContext} | ||
| import org.apache.spark.sql.execution.streaming.Sink | ||
|
|
||
| private[kafka010] class KafkaSink( | ||
| sqlContext: SQLContext, | ||
| executorKafkaParams: ju.Map[String, Object], | ||
| topic: Option[String]) extends Sink with Logging { | ||
| @volatile private var latestBatchId = -1L | ||
|
|
||
| override def toString(): String = "KafkaSink" | ||
|
|
||
| override def addBatch(batchId: Long, data: DataFrame): Unit = { | ||
| if (batchId <= latestBatchId) { | ||
| logInfo(s"Skipping already committed batch $batchId") | ||
| } else { | ||
| KafkaWriter.write(sqlContext.sparkSession, | ||
| data.queryExecution, executorKafkaParams, topic) | ||
| latestBatchId = batchId | ||
| } | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
123 changes: 123 additions & 0 deletions
123
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,123 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.kafka010 | ||
|
|
||
| import java.{util => ju} | ||
|
|
||
| import org.apache.kafka.clients.producer.{KafkaProducer, _} | ||
| import org.apache.kafka.common.serialization.ByteArraySerializer | ||
|
|
||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.expressions.{Attribute, Cast, Literal, UnsafeProjection} | ||
| import org.apache.spark.sql.types.{BinaryType, StringType} | ||
|
|
||
| /** | ||
| * A simple trait for writing out data in a single Spark task, without any concerns about how | ||
| * to commit or abort tasks. Exceptions thrown by the implementation of this class will | ||
| * automatically trigger task aborts. | ||
| */ | ||
| private[kafka010] class KafkaWriteTask( | ||
| producerConfiguration: ju.Map[String, Object], | ||
| inputSchema: Seq[Attribute], | ||
| topic: Option[String]) { | ||
| // used to synchronize with Kafka callbacks | ||
| @volatile private var failedWrite: Exception = null | ||
| private val projection = createProjection | ||
| private var producer: KafkaProducer[Array[Byte], Array[Byte]] = _ | ||
|
|
||
| /** | ||
| * Writes key value data out to topics. | ||
| */ | ||
| def execute(iterator: Iterator[InternalRow]): Unit = { | ||
| producer = new KafkaProducer[Array[Byte], Array[Byte]](producerConfiguration) | ||
| while (iterator.hasNext && failedWrite == null) { | ||
| val currentRow = iterator.next() | ||
| val projectedRow = projection(currentRow) | ||
| val topic = projectedRow.getUTF8String(0) | ||
| val key = projectedRow.getBinary(1) | ||
| val value = projectedRow.getBinary(2) | ||
| if (topic == null) { | ||
| throw new NullPointerException(s"null topic present in the data. Use the " + | ||
| s"${KafkaSourceProvider.TOPIC_OPTION_KEY} option for setting a default topic.") | ||
| } | ||
| val record = new ProducerRecord[Array[Byte], Array[Byte]](topic.toString, key, value) | ||
| val callback = new Callback() { | ||
| override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = { | ||
| if (failedWrite == null && e != null) { | ||
| failedWrite = e | ||
| } | ||
| } | ||
| } | ||
| producer.send(record, callback) | ||
| } | ||
| } | ||
|
|
||
| def close(): Unit = { | ||
| if (producer != null) { | ||
| checkForErrors | ||
| producer.close() | ||
| checkForErrors | ||
| producer = null | ||
| } | ||
| } | ||
|
|
||
| private def createProjection: UnsafeProjection = { | ||
| val topicExpression = topic.map(Literal(_)).orElse { | ||
| inputSchema.find(_.name == KafkaWriter.TOPIC_ATTRIBUTE_NAME) | ||
| }.getOrElse { | ||
| throw new IllegalStateException(s"topic option required when no " + | ||
| s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present") | ||
| } | ||
| topicExpression.dataType match { | ||
| case StringType => // good | ||
| case t => | ||
| throw new IllegalStateException(s"${KafkaWriter.TOPIC_ATTRIBUTE_NAME} " + | ||
| s"attribute unsupported type $t. ${KafkaWriter.TOPIC_ATTRIBUTE_NAME} " + | ||
| s"must be a ${StringType}") | ||
| } | ||
| val keyExpression = inputSchema.find(_.name == KafkaWriter.KEY_ATTRIBUTE_NAME) | ||
| .getOrElse(Literal(null, BinaryType)) | ||
| keyExpression.dataType match { | ||
| case StringType | BinaryType => // good | ||
| case t => | ||
| throw new IllegalStateException(s"${KafkaWriter.KEY_ATTRIBUTE_NAME} " + | ||
| s"attribute unsupported type $t") | ||
| } | ||
| val valueExpression = inputSchema | ||
| .find(_.name == KafkaWriter.VALUE_ATTRIBUTE_NAME).getOrElse( | ||
| throw new IllegalStateException(s"Required attribute " + | ||
| s"'${KafkaWriter.VALUE_ATTRIBUTE_NAME}' not found") | ||
| ) | ||
| valueExpression.dataType match { | ||
| case StringType | BinaryType => // good | ||
| case t => | ||
| throw new IllegalStateException(s"${KafkaWriter.VALUE_ATTRIBUTE_NAME} " + | ||
| s"attribute unsupported type $t") | ||
| } | ||
| UnsafeProjection.create( | ||
| Seq(topicExpression, Cast(keyExpression, BinaryType), | ||
| Cast(valueExpression, BinaryType)), inputSchema) | ||
| } | ||
|
|
||
| private def checkForErrors: Unit = { | ||
| if (failedWrite != null) { | ||
| throw failedWrite | ||
| } | ||
| } | ||
| } | ||
|
|
97 changes: 97 additions & 0 deletions
97
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.spark.sql.kafka010 | ||
|
|
||
| import java.{util => ju} | ||
|
|
||
| import org.apache.spark.internal.Logging | ||
| import org.apache.spark.sql.{AnalysisException, SparkSession} | ||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.expressions._ | ||
| import org.apache.spark.sql.execution.{QueryExecution, SQLExecution} | ||
| import org.apache.spark.sql.types.{BinaryType, StringType} | ||
| import org.apache.spark.util.Utils | ||
|
|
||
| /** | ||
| * The [[KafkaWriter]] class is used to write data from a batch query | ||
| * or structured streaming query, given by a [[QueryExecution]], to Kafka. | ||
| * The data is assumed to have a value column, and an optional topic and key | ||
| * columns. If the topic column is missing, then the topic must come from | ||
| * the 'topic' configuration option. If the key column is missing, then a | ||
| * null valued key field will be added to the | ||
| * [[org.apache.kafka.clients.producer.ProducerRecord]]. | ||
| */ | ||
| private[kafka010] object KafkaWriter extends Logging { | ||
| val TOPIC_ATTRIBUTE_NAME: String = "topic" | ||
| val KEY_ATTRIBUTE_NAME: String = "key" | ||
| val VALUE_ATTRIBUTE_NAME: String = "value" | ||
|
|
||
| override def toString: String = "KafkaWriter" | ||
|
|
||
| def validateQuery( | ||
| queryExecution: QueryExecution, | ||
| kafkaParameters: ju.Map[String, Object], | ||
| topic: Option[String] = None): Unit = { | ||
| val schema = queryExecution.logical.output | ||
| schema.find(_.name == TOPIC_ATTRIBUTE_NAME).getOrElse( | ||
| if (topic == None) { | ||
| throw new AnalysisException(s"topic option required when no " + | ||
| s"'$TOPIC_ATTRIBUTE_NAME' attribute is present. Use the " + | ||
| s"${KafkaSourceProvider.TOPIC_OPTION_KEY} option for setting a topic.") | ||
| } else { | ||
| Literal(topic.get, StringType) | ||
| } | ||
| ).dataType match { | ||
| case StringType => // good | ||
| case _ => | ||
| throw new AnalysisException(s"Topic type must be a String") | ||
| } | ||
| schema.find(_.name == KEY_ATTRIBUTE_NAME).getOrElse( | ||
| Literal(null, StringType) | ||
| ).dataType match { | ||
| case StringType | BinaryType => // good | ||
| case _ => | ||
| throw new AnalysisException(s"$KEY_ATTRIBUTE_NAME attribute type " + | ||
| s"must be a String or BinaryType") | ||
| } | ||
| schema.find(_.name == VALUE_ATTRIBUTE_NAME).getOrElse( | ||
| throw new AnalysisException(s"Required attribute '$VALUE_ATTRIBUTE_NAME' not found") | ||
| ).dataType match { | ||
| case StringType | BinaryType => // good | ||
| case _ => | ||
| throw new AnalysisException(s"$VALUE_ATTRIBUTE_NAME attribute type " + | ||
| s"must be a String or BinaryType") | ||
| } | ||
| } | ||
|
|
||
| def write( | ||
| sparkSession: SparkSession, | ||
| queryExecution: QueryExecution, | ||
| kafkaParameters: ju.Map[String, Object], | ||
| topic: Option[String] = None): Unit = { | ||
| val schema = queryExecution.logical.output | ||
| validateQuery(queryExecution, kafkaParameters, topic) | ||
| SQLExecution.withNewExecutionId(sparkSession, queryExecution) { | ||
| queryExecution.toRdd.foreachPartition { iter => | ||
| val writeTask = new KafkaWriteTask(kafkaParameters, schema, topic) | ||
| Utils.tryWithSafeFinally(block = writeTask.execute(iter))( | ||
| finallyBlock = writeTask.close()) | ||
| } | ||
| } | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this return value is called in
CreateDataSourceTableAsSelectCommand. Kafka cannot support it. I think it's better to make the methods of this special BaseRelation throwUnsupportedOperationExceptionin case the returned relation is used by mistake.