[SPARK-27042][SS] Close cached Kafka producer in case of task retry #23956

gaborgsomogyi · 2019-03-04T09:32:50Z

What changes were proposed in this pull request?

If a task is failing due to a corrupt cached KafkaProducer and the task is retried in the same executor then the task is getting the same KafkaProducer over and over again unless it's invalidated with the timeout configured by spark.kafka.producer.cache.timeout which is not really probable. After several retries the query stops.

In this PR I'm closing the old cached KafkaProducer and reopen a new one. The functionality is similar to the KafkaConsumer side here.

How was this patch tested?

Additional unit tests + on cluster.

SparkQA · 2019-03-04T10:02:54Z

Test build #102982 has finished for PR 23956 at commit 6a5e00f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi · 2019-03-04T10:13:39Z

cc @HeartSaVioR @jose-torres @zsxwing

srowen · 2019-03-04T13:12:06Z

external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaProducer.scala

  private[kafka010] def getOrCreate(kafkaParams: ju.Map[String, Object]): Producer = {
+    if (TaskContext.get != null && TaskContext.get.attemptNumber >= 1) {
+      logDebug(s"Reattempt detected, invalidating cached producer for params $kafkaParams")
+      close(kafkaParams)


This is probably fine; is there a way to close this earlier, when the task fails?

Since any part of the code can throw exception which may or may not caught I thought this is the most safe solution. The other consideration was that the consumer part works similar way without problem.

@gaborgsomogyi We cannot close a cached producer that can still be used by other tasks. A Kafka producer can be shared by all tasks that are using the same Kafka parameters. It is different than the consumer cache.

But isn't the assumption that a bad producer will cause all those tasks to fail anyway? This would recover from that situation (and prevent the task retries from failing).

It may be that the task failed for other reasons and other tasks using the same producer would make progress, but that sounds both less likely and more complicated to handle.

@vanzin even if a bad producer could happen, this approach is still not correct. The new created producer can be closed by an attempt of a different task at once.

AFAIK, the current issue about the cached Kafka producer is https://issues.apache.org/jira/browse/SPARK-21869, which definitely can be solved in a smarter way.

By the way, I have never seen that anyone reported an issue about corrupt Kafka producers in Spark or Kafka community. @gaborgsomogyi do you have any ticket related to this one?

The new created producer can be closed by an attempt of a different task at once.

Good point. Seems hard to solve without keeping more state about the producer... :-/

Agree with @zsxwing, and once Kafka producer is made to thread-safe, it should have self-heal mechanism in itself to prevent one broken request-response to break others.

The new created producer can be closed by an attempt of a different task at once.

Also think it's a good point. Some sort of if (!inUse) close() mechanism would be correct.

@zsxwing Just for the sake of my deeper understanding in which scenario can happen that a 2 tasks in the same executor are writing the same topicpartition?

@ScrapCodes are you proceeding with SPARK-21869? This PR needs the inUse flag what you've shown in #19096. Happy to help any way.

@gaborgsomogyi , I wanted to revive it soon. sorry for the delay. Now I am on it. I will need your help for sure, to discuss possible approaches.

Cool, ping me and coming...

vanzin

Looks ok except for a small test thing.

vanzin · 2019-03-05T18:19:10Z

...l/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/CachedKafkaProducerSuite.scala

+    CachedKafkaProducer.invokePrivate(getAsMap())
+  }
+
+  private def getCacheMapItem(map: ConcurrentMap[Seq[(String, Object)], KP], offset: Int): KP = {


Hmm... maps don't necessarily have deterministic iteration order, so this method only really makes sense if the map has a single item. Since you always call it with 0 as the offset anyway, it'd be better to just simplify it (map.values().iterator().next()).

Or maybe explicitly using map.get(kafkaParams) in the tests.

jose-torres · 2019-03-11T23:52:54Z

(I don't really have enough context to meaningfully review this.)

HeartSaVioR · 2019-09-05T21:51:57Z

As we are putting effort on #19096 regarding this issue, I guess we can close this for cleaning up and discuss further on #19096.

gaborgsomogyi · 2019-09-10T11:48:43Z

I think this shouldn't be mixed up with the solution for SPARK-21869 and has to be kept separate "feature". On the other hand SPARK-21869 is definitely needed to proceed with this.

AmplabJenkins · 2019-09-16T18:15:34Z

Can one of the admins verify this patch?

gaborgsomogyi · 2019-11-08T08:54:17Z

#25853 merged which allows to invalidate producers (and not close when in use) in case to task retry. I'm going to come up with a new PR soon...

[SPARK-27042][SS] Close cached Kafka consumer in case of task retry

6a5e00f

srowen reviewed Mar 4, 2019

View reviewed changes

gaborgsomogyi changed the title ~~[SPARK-27042][SS] Close cached Kafka consumer in case of task retry~~ [SPARK-27042][SS] Close cached Kafka producer in case of task retry Mar 5, 2019

vanzin reviewed Mar 5, 2019

View reviewed changes

dongjoon-hyun added the STRUCTURED STREAMING label Jun 14, 2019

gaborgsomogyi mentioned this pull request Jul 3, 2019

[SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking. #19096

Closed

gaborgsomogyi mentioned this pull request Jul 15, 2019

[SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer #22138

Closed

gaborgsomogyi closed this Nov 11, 2019

[SPARK-27042][SS] Close cached Kafka producer in case of task retry #23956

[SPARK-27042][SS] Close cached Kafka producer in case of task retry #23956

Uh oh!

Conversation

gaborgsomogyi commented Mar 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 4, 2019

Uh oh!

gaborgsomogyi commented Mar 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jose-torres commented Mar 11, 2019

Uh oh!

HeartSaVioR commented Sep 5, 2019

Uh oh!

gaborgsomogyi commented Sep 10, 2019

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

gaborgsomogyi commented Nov 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

gaborgsomogyi commented Mar 4, 2019 •

edited

Loading