Source doesn't resume partitions after a failed seek() operation #1333

gygabyte · 2021-03-04T17:24:29Z

A typical use case in Kafka to achieve Exactly-Once semantics when consuming messages is to store offset external to kafka atomically with appropriate state.. for that the Alpakka library provides the committablePartitionedManualOffsetSource source where offsets to start consuming from are provided through onAssign function which result is a Future with all TopicPartition->Offset(Long) assigned to this consumer.

However, due to the async nature of this operation, a rebalance can occur in between the previous assignment and the Future with offsets being completed.
The source will attempt to do a seek to all these TopicPartition(s), but some of them will no longer be assigned to this kafka consumer.
https://github.com/akka/alpakka-kafka/blob/master/core/src/main/scala/akka/kafka/internal/SubSourceLogic.scala#L154

This results on a failure (on the internal KafkaConsumerActor) that is sent back to the SubSourceLogic actor.. However, the SubSourceLogic is only expecting a AskTimeoutException and not a Failure(Throwable) message..

https://github.com/akka/alpakka-kafka/blob/master/core/src/main/scala/akka/kafka/internal/SubSourceLogic.scala#L157
https://github.com/akka/alpakka-kafka/blob/master/core/src/main/scala/akka/kafka/internal/KafkaConsumerActor.scala#L287

The end result is the Source does not resume the sources, but also doesn't fail the consumer/source. It is a silent failure and there is no indication that we will not be able to recover from this.

I can suggest two approaches for this:

recover by failing the Source... However, this is will cause more rebalances and potentially again the same race condition and so on.. this will go on until no rebalances happen in-between the two operations (assign and Future completion).. also this may cause unnecessary thrashing/load on the external offset store where the consumers are retrieving the offsets from

.recover {
    case _: Exception => 
       stageFailCB.invoke(
                new ConsumerFailed(
                  s"$idLogPrefix Consumer failed during seek for partitions: ${offsets.keys.mkString(", ")}."
                )
              )
}

Ignore the seek() failures on the KafkaConsumerActor and just send back Done when completed.. I believe this would also be a suitable approach and doesn't cause more rebalances/changes for the race condition to keep occurring.

case Seek(offsets) =>
      offsets.foreach {
        case (tp, offset) =>
          try {
            consumer.seek(tp, offset)
          } catch {
            case NonFatal(e) =>
              log.warning("seek failed on consumer from {}, {} -> {}}", sender(), tp, offset)
          }
      }
      sender() ! Done

I am running on 2.11 and 2.0.7 version.. It would be great if we could have other release for scala 2.11.
This is a critical issue that has huge impacts on volatile environments (running consumers on AWS spot instances) where consumers might come and go at will..

The text was updated successfully, but these errors were encountered:

seglo · 2021-03-09T20:31:28Z

Thanks for raising the issue. What exception is raised when the seek fails? I think it would be acceptable to log a warning for such failures and carry on. Do you have some time to put up a PR?

gygabyte · 2021-03-18T17:09:11Z

Sorry for the delay getting back to you.

the exception is

java.lang.IllegalStateException: No current assignment for partition topic-0

I should have some time, but need to go through some internal legal stuff before submitting a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source doesn't resume partitions after a failed seek() operation #1333

Source doesn't resume partitions after a failed seek() operation #1333

gygabyte commented Mar 4, 2021 •

edited

Loading

seglo commented Mar 9, 2021

gygabyte commented Mar 18, 2021

Source doesn't resume partitions after a failed seek() operation #1333

Source doesn't resume partitions after a failed seek() operation #1333

Comments

gygabyte commented Mar 4, 2021 • edited Loading

seglo commented Mar 9, 2021

gygabyte commented Mar 18, 2021

gygabyte commented Mar 4, 2021 •

edited

Loading