-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29509][SQL][SS] Deduplicate codes from Kafka data source #26158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29509][SQL][SS] Deduplicate codes from Kafka data source #26158
Conversation
|
Test build #112252 has finished for PR 26158 at commit
|
|
I think the direction is good but not sure it's a minor stuff. |
|
Thanks for the suggestion. Filed an issue and changed the title. |
gaborgsomogyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minors found.
| def assertDataType(attrName: String, desired: Seq[DataType], actual: DataType): Unit = { | ||
| if (!desired.exists(_.sameType(actual))) { | ||
| throw new IllegalStateException(s"$attrName attribute unsupported type " + | ||
| s"${actual.catalogString}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add ...$attrName must be a $desired?
| } | ||
| } | ||
|
|
||
| val topicExpression = topic.map(Literal(_)).orElse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about whether it's possible to put expression function into assertDataType but then seen that topicExpression is calculated in a different way. Do you think we can do this somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought it would be complicated or require more change on expression but turned out it's not. I'll make a change like below:
val topicExpression = topic.map(Literal(_)).getOrElse(
expression(KafkaWriter.TOPIC_ATTRIBUTE_NAME) { () =>
throw new IllegalStateException(s"topic option required when no " +
s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present")
}
)
| defaultFn: () => Expression): Unit = { | ||
| val attr = schema.find(_.name == attrName).getOrElse(defaultFn()) | ||
| if (!desired.exists(_.sameType(attr.dataType))) { | ||
| throw new AnalysisException(s"$attrName attribute type must be a " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to print the actual type and the expected types, just like in the previous case.
| ) | ||
| } | ||
| assert(ex.getMessage.toLowerCase(Locale.ROOT).contains("topic type must be a string")) | ||
| assertWrongType(input.toDF(), Seq("CAST('1' as INT) as topic", "value"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: input.toDF() is repeating.
| } | ||
|
|
||
| test("streaming - write data with valid schema but wrong types") { | ||
| def assertWrongType(df: DataFrame, selectExpr: Seq[String], expectErrorMsg: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: expectedErrorMsg
|
Test build #112493 has finished for PR 26158 at commit
|
gaborgsomogyi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
|
cc @srowen |
| }.getOrElse { | ||
| throw new IllegalStateException(s"topic option required when no " + | ||
| s"'${KafkaWriter.TOPIC_ATTRIBUTE_NAME}' attribute is present") | ||
| def expression(attrName: String)(defaultFn: () => Expression): Expression = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaultFn: => Expression. Then you don't need () => everywhere.
| } | ||
| } | ||
|
|
||
| private def validateAttribute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this looks like exactly the same thing you have in KafkaWriterTask. You could even use the same one-method approach there, as far as I can see, instead of calling expression + assertDataType.
If the goal is to deduplicate code, then here's another one you can deduplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion. I'll need to see where it's the good place to put. Thanks!
| selectExpr: Seq[String], | ||
| expectErrorMsg: String): Unit = { | ||
| var writer: StreamingQuery = null | ||
| var ex: Exception = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val ex = try { ... }
| writer.processAllAvailable() | ||
| } | ||
| } finally { | ||
| writer.stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
writer can be null here.
|
Test build #112628 has finished for PR 26158 at commit
|
|
I just spent more time to reduce more, as I saw more spots to deduplicate easily while addressing review comments. Please take a look again. Thanks! |
|
Test build #112663 has finished for PR 26158 at commit
|
|
#26153 merged so this has to be adapted... |
272d91a to
5a2371b
Compare
|
Rebased. Please take a look at the next round of review. Thanks! |
|
Test build #112683 has finished for PR 26158 at commit
|
| } | ||
|
|
||
| def keyExpression(schema: Seq[Attribute]): Expression = { | ||
| expression(schema, KEY_ATTRIBUTE_NAME, Seq(StringType, BinaryType))( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer { ... } for the function argument (like in the case of throwing an exception).
| import org.apache.spark.sql.types.{BinaryType, DataType} | ||
| import org.apache.spark.util.Utils | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
| input: DataFrame, | ||
| selectExpr: Seq[String], | ||
| expectErrorMsg: String): Unit = { | ||
| verifyException[AnalysisException](expectErrorMsg)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reuse runAndVerifyStreamingQueryException here like in the other suite? These seem to expect the error to happen when the writer is created, so the stuff that method does after it calls writeFn shouldn't influence the test result.
That could make it possible to reuse these for both test suites (e.g. by adding them to KafkaTestUtils).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separated both because one does need actual input topic and the other doesn't, but no harm to provide input topic for latter as well. I'll make a change.
Btw, there's some difference of waiting the query result and checking exception between batch/micro-batch and continuous so it doesn't seem to be a complete duplication. Actually the purpose of this PR was deduplicating the code which is due to the number of fields, and scope seems to be continuously increasing. Maybe we can revisit deduplicating code between batch/micro-batch and continuous once more in follow-up PR. WDYT?
|
Test build #112703 has finished for PR 26158 at commit
|
|
Merging to master. |
|
Thanks all for reviewing and merging! |
What changes were proposed in this pull request?
This patch deduplicates code blocks in Kafka data source which are being repeated multiple times in a method.
Why are the changes needed?
This change would simplify the code and open possibility to simplify future code whenever fields are added to Kafka writer schema.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing UTs.