[SPARK-29509][SQL][SS] Deduplicate codes from Kafka data source #26158

HeartSaVioR · 2019-10-18T05:02:28Z

What changes were proposed in this pull request?

This patch deduplicates code blocks in Kafka data source which are being repeated multiple times in a method.

Why are the changes needed?

This change would simplify the code and open possibility to simplify future code whenever fields are added to Kafka writer schema.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

HeartSaVioR · 2019-10-18T05:08:33Z

I found these duplications during reviewing #26153 - #26153 adds a new field to the schema of writer and it gets more and more redundant. It would be ideal to deduplicate them to help reducing the necessary change when addressing #26153.

SparkQA · 2019-10-18T05:43:41Z

Test build #112252 has finished for PR 26158 at commit eb8ab27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-10-18T11:41:46Z

cc. @tdas @zsxwing @jose-torres @gaborgsomogyi

gaborgsomogyi · 2019-10-18T11:49:27Z

I think the direction is good but not sure it's a minor stuff.

HeartSaVioR · 2019-10-18T13:13:06Z

Thanks for the suggestion. Filed an issue and changed the title.

gaborgsomogyi

Some minors found.

gaborgsomogyi · 2019-10-21T11:42:37Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala

+    def assertDataType(attrName: String, desired: Seq[DataType], actual: DataType): Unit = {
+      if (!desired.exists(_.sameType(actual))) {
+        throw new IllegalStateException(s"$attrName attribute unsupported type " +
+          s"${actual.catalogString}")


Maybe we can add ...$attrName must be a $desired?

gaborgsomogyi · 2019-10-21T14:08:35Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala

+      }
+    }
+
    val topicExpression = topic.map(Literal(_)).orElse {


I was thinking about whether it's possible to put expression function into assertDataType but then seen that topicExpression is calculated in a different way. Do you think we can do this somehow?

Yeah I thought it would be complicated or require more change on expression but turned out it's not. I'll make a change like below:

val topicExpression = topic.map(Literal(_)).getOrElse( expression(KafkaWriter.TOPIC_ATTRIBUTE_NAME) { () => throw new IllegalStateException(s"topic option required when no " + s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present") } )

gaborgsomogyi · 2019-10-22T12:41:15Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala

+      defaultFn: () => Expression): Unit = {
+    val attr = schema.find(_.name == attrName).getOrElse(defaultFn())
+    if (!desired.exists(_.sameType(attr.dataType))) {
+      throw new AnalysisException(s"$attrName attribute type must be a " +


I think it would be helpful to print the actual type and the expected types, just like in the previous case.

gaborgsomogyi · 2019-10-22T12:46:34Z

...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala

-      )
-    }
-    assert(ex.getMessage.toLowerCase(Locale.ROOT).contains("topic type must be a string"))
+    assertWrongType(input.toDF(), Seq("CAST('1' as INT) as topic", "value"),


Nit: input.toDF() is repeating.

gaborgsomogyi · 2019-10-22T12:47:05Z

...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala

  }

  test("streaming - write data with valid schema but wrong types") {
+    def assertWrongType(df: DataFrame, selectExpr: Seq[String], expectErrorMsg: String): Unit = {


Nit: expectedErrorMsg

SparkQA · 2019-10-22T23:53:09Z

Test build #112493 has finished for PR 26158 at commit 47967d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi

LGTM.

gaborgsomogyi · 2019-10-24T18:14:39Z

cc @srowen

vanzin · 2019-10-24T18:41:36Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriteTask.scala

-    }.getOrElse {
-      throw new IllegalStateException(s"topic option required when no " +
-        s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present")
+    def expression(attrName: String)(defaultFn: () => Expression): Expression = {


defaultFn: => Expression. Then you don't need () => everywhere.

vanzin · 2019-10-24T18:47:20Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala

    }
  }
+
+  private def validateAttribute(


So this looks like exactly the same thing you have in KafkaWriterTask. You could even use the same one-method approach there, as far as I can see, instead of calling expression + assertDataType.

If the goal is to deduplicate code, then here's another one you can deduplicate.

Nice suggestion. I'll need to see where it's the good place to put. Thanks!

vanzin · 2019-10-24T18:48:07Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala

+        selectExpr: Seq[String],
+        expectErrorMsg: String): Unit = {
+      var writer: StreamingQuery = null
+      var ex: Exception = null


val ex = try { ... }

vanzin · 2019-10-24T18:48:47Z

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala

+          writer.processAllAvailable()
+        }
+      } finally {
+        writer.stop()


writer can be null here.

SparkQA · 2019-10-25T00:49:52Z

Test build #112628 has finished for PR 26158 at commit 848ad1a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-10-25T11:19:24Z

I just spent more time to reduce more, as I saw more spots to deduplicate easily while addressing review comments. Please take a look again. Thanks!

SparkQA · 2019-10-25T11:21:09Z

Test build #112663 has finished for PR 26158 at commit 272d91a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-10-25T13:10:16Z

#26153 merged so this has to be adapted...

HeartSaVioR · 2019-10-25T17:16:38Z

Rebased. Please take a look at the next round of review. Thanks!

SparkQA · 2019-10-25T17:56:36Z

Test build #112683 has finished for PR 26158 at commit 5a2371b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-10-25T18:53:18Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaWriter.scala

+  }
+
+  def keyExpression(schema: Seq[Attribute]): Expression = {
+    expression(schema, KEY_ATTRIBUTE_NAME, Seq(StringType, BinaryType))(


Prefer { ... } for the function argument (like in the case of throwing an exception).

vanzin · 2019-10-25T18:53:32Z

...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala

 import org.apache.spark.sql.types.{BinaryType, DataType}
 import org.apache.spark.util.Utils

+


vanzin · 2019-10-25T19:00:24Z

...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSinkSuite.scala

+      input: DataFrame,
+      selectExpr: Seq[String],
+      expectErrorMsg: String): Unit = {
+    verifyException[AnalysisException](expectErrorMsg)(


Can you reuse runAndVerifyStreamingQueryException here like in the other suite? These seem to expect the error to happen when the writer is created, so the stuff that method does after it calls writeFn shouldn't influence the test result.

That could make it possible to reuse these for both test suites (e.g. by adding them to KafkaTestUtils).

I separated both because one does need actual input topic and the other doesn't, but no harm to provide input topic for latter as well. I'll make a change.

Btw, there's some difference of waiting the query result and checking exception between batch/micro-batch and continuous so it doesn't seem to be a complete duplication. Actually the purpose of this PR was deduplicating the code which is due to the number of fields, and scope seems to be continuously increasing. Maybe we can revisit deduplicating code between batch/micro-batch and continuous once more in follow-up PR. WDYT?

SparkQA · 2019-10-26T02:50:28Z

Test build #112703 has finished for PR 26158 at commit f05a6ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-10-28T18:13:48Z

Merging to master.

HeartSaVioR · 2019-10-28T19:13:07Z

Thanks all for reviewing and merging!

HeartSaVioR mentioned this pull request Oct 18, 2019

[SPARK-29500][SQL][SS] Support partition column when writing to Kafka #26153

Closed

HeartSaVioR changed the title ~~[MINOR][SQL][SS] Deduplicate codes from Kafka data source~~ [SPARK-29509][SQL][SS] Deduplicate codes from Kafka data source Oct 18, 2019

dongjoon-hyun added SQL STRUCTURED STREAMING labels Oct 18, 2019

gaborgsomogyi reviewed Oct 22, 2019

View reviewed changes

gaborgsomogyi approved these changes Oct 24, 2019

View reviewed changes

vanzin reviewed Oct 24, 2019

View reviewed changes

[MINOR][SQL][SS] Deduplicate codes from Kafka data source

5a2371b

HeartSaVioR force-pushed the MINOR-deduplicate-kafka-source branch from 272d91a to 5a2371b Compare October 25, 2019 17:14

vanzin reviewed Oct 25, 2019

View reviewed changes

Reflect review comments

f05a6ca

vanzin closed this in 762db39 Oct 28, 2019

HeartSaVioR deleted the MINOR-deduplicate-kafka-source branch October 28, 2019 19:13

HeartSaVioR mentioned this pull request Oct 29, 2019

[SPARK-29635][SS] Extract base test suites between Kafka micro-batch sink and Kafka continuous sink #26292

Closed

		import org.apache.spark.sql.types.{BinaryType, DataType}
		import org.apache.spark.util.Utils

[SPARK-29509][SQL][SS] Deduplicate codes from Kafka data source #26158

[SPARK-29509][SQL][SS] Deduplicate codes from Kafka data source #26158

Uh oh!

Conversation

HeartSaVioR commented Oct 18, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Oct 18, 2019

Uh oh!

SparkQA commented Oct 18, 2019

Uh oh!

HeartSaVioR commented Oct 18, 2019

Uh oh!

gaborgsomogyi commented Oct 18, 2019

Uh oh!

HeartSaVioR commented Oct 18, 2019

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 22, 2019

Uh oh!

gaborgsomogyi left a comment

Choose a reason for hiding this comment

Uh oh!

gaborgsomogyi commented Oct 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

HeartSaVioR commented Oct 25, 2019

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

gaborgsomogyi commented Oct 25, 2019

Uh oh!

HeartSaVioR commented Oct 25, 2019

Uh oh!

SparkQA commented Oct 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Oct 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 26, 2019

Uh oh!

vanzin commented Oct 28, 2019

Uh oh!

HeartSaVioR commented Oct 28, 2019

Uh oh!

HeartSaVioR Oct 26, 2019 •

edited

Loading