[SPARK-54027] Kafka Source RTM support #52729

jerrypeng · 2025-10-25T00:15:56Z

What changes were proposed in this pull request?

Add support for Real-time Mode in the Kafka Source. Which means KafkaMicroBatchStream needs to implement the SupportsRealTimeMode interface and the KakfaPartitionBatchReader needs to extend SupportRealTimeRead interface.

Why are the changes needed?

So that Kafka source and sink can be used by Real-time Mode queries

Does this PR introduce any user-facing change?

Yes, Kafka source and sink can be used by Real-time Mode queries

How was this patch tested?

Many tests added

Was this patch authored or co-authored using generative AI tooling?

No

viirya · 2025-10-30T07:09:15Z

...ctor/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala

+      )
+    }
+
+    // This function is used by Low Latency Mode, where we expect 1:1 mapping between a


Suggested change

// This function is used by Low Latency Mode, where we expect 1:1 mapping between a

// This function is used by real time mode, where we expect 1:1 mapping between a

viirya · 2025-10-30T07:17:31Z

.../kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaBatchPartitionReader.scala

+      if (record.timestampType() == TimestampType.LOG_APPEND_TIME ||
+        record.timestampType() == TimestampType.CREATE_TIME) {
+        if (!timestampTypeLogged) {
+          logInfo(log"Kafka source record timestamp type is " +
+            log"${MDC(LogKeys.TIMESTAMP_COLUMN_NAME, record.timestampType())}")
+          timestampTypeLogged = true
+        }


Could you explain more on this logging behavior? Why we need to do this logging?

This tells us the semantics of of the timestamp column from a kafka record. That is, whether timestamp for records from this topic is set to wal append time (when the record is persisted by kafka brokers) or create time which is either when the record is produced by a kafka producer or is user-defined. This information is use when calculating latency to understand what journey we are actually measuring.

viirya · 2025-10-30T07:56:24Z

...ctor/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala

+      // admin function call. But we consider new partition is rare and getting earliest offset
+      // aligns with what we do in micro-batch mode and can potentially enable more sanity checks
+      // in executor side.
+      val newPartitionOffsets = kafkaOffsetReader.fetchEarliestOffsets(newPartitions.toSeq)


KafkaMicroBatchStream's existing planInputPartitions calls kafkaOffsetReader.getOffsetRangesFromResolvedOffsets to handle partition offsets.

It handles deleted partitions cases but this new planInputPartitions doesn't, should we also do the same?

Kafka doesn't support deleting partitions so I am not sure if that case is worth checking. If the topic was deleted and recreated the offsets which not be valid and we would fail in that case anyways.

Hmm, this is what currently in getOffsetRangesFromResolvedOffsets called by KafkaMicroBatchStream.planInputPartitions:

if (newPartitionInitialOffsets.keySet != newPartitions) { // We cannot get from offsets for some partitions. It means they got deleted. val deletedPartitions = newPartitions.diff(newPartitionInitialOffsets.keySet) reportDataLoss( s"Cannot find earliest offsets of ${deletedPartitions}. Some data may have been missed", () => KafkaExceptions.initialOffsetNotFoundForPartitions(deletedPartitions)) }

The behavior of reportDataLoss is configurable. It can be a failure like what you did here, or a log warning.

I would suggest to follow existing behavior instead of two different behaviors.

viirya · 2025-10-30T07:57:26Z

...ctor/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala

+      // Filter out new partition offsets that are not 0 and log a warning
+      val nonZeroNewPartitionOffsets = newPartitionOffsets.filter {
+        case (_, offset) => offset != 0
+      }
+      // Log the non-zero new partition offsets
+      if (nonZeroNewPartitionOffsets.nonEmpty) {
+        logWarning(log"new partitions should start from offset 0: " +
+          log"${MDC(OFFSETS, nonZeroNewPartitionOffsets)}")
+      }


For non zero offset new partitions case, getOffsetRangesFromResolvedOffsets delegates to reportDataLoss closure. Should we do the same?

Let me add that and make the behavior consistent.

viirya · 2025-10-30T08:00:05Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala


  case class WaitUntilBatchProcessed(batchId: Long) extends StreamAction with StreamMustBeRunning

+  case object WaitUntilCurrentBatchProcessed extends StreamAction with StreamMustBeRunning


Why need this? Can't we use WaitUntilBatchProcessed?

This waits until the current batch finishes. It is just an easier API to use for testing when we just need to wait for the current batch to finish and not for a specific batch to finish.

### What changes were proposed in this pull request? This patch extracts the same method `getOffsetRangesFromResolvedOffsets` from two `KafkaOffsetReader` implementations. ### Why are the changes needed? When reviewing #52729, found that `KafkaOffsetReaderConsumer` and `KafkaOffsetReaderAdmin` have the exactly same `getOffsetRangesFromResolvedOffsets` method. The method is actually long so seems good to extract them to common one. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #52788 from viirya/kafkaoffsetreader_refactor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>

viirya · 2025-10-31T16:57:54Z

...ctor/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala

+        // If we are in micro-batch mode, we need to get the latest partition offsets at the
+        // start of the batch and recalculate the latest offsets at the end for backlog
+        // estimation.
+        Some(kafkaOffsetReader.fetchLatestOffsets(Some(latestPartitionOffsets)))


This changes original behavior? Previously it just uses latestPartitionOffsets without fetching latest offsets again.

This is actually fixing an issue with non-rtm queries using kafka. The calculation is is not correct here and will always result in the backlog metrics being zero. "latestPartitionOffsets" is calculated at when "latestOffset" is called at the beginning of a batch. It is basically the offset this batch will read up to so for non-rtm streaming queries latestConsumedOffset will be the same as latestPartitionOffsets resulting in zero backlog. What we should be doing is get the latest offsets from source kafka topic after the batch is processed i.e. when metrics() is called to calculate a useful backlog metric. I know this is not really related to RTM so let me know if I should just create a separate PR for this.

Yea, let's focus on RTM in this PR and don't change/fix existing behavior here. Please open a separate PR to fix it if you think it is an issue.

viirya · 2025-10-31T16:59:24Z

...ctor/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala

+            val latestOffsets = kafkaOffsetReader.fetchLatestOffsets(
+              Some(latestConsumedOffset.get.asInstanceOf[KafkaSourceOffset].partitionToOffsets))
+            val endTime = System.currentTimeMillis()
+            rtmFetchLatestOffsetsTimeMs = Some(endTime - startTime)


Hmm, I'm not sure if I miss something. rtmFetchLatestOffsetsTimeMs is assigned here, but it is not used anymore?

let me remove this variable it is no necessary.

viirya · 2025-10-31T17:00:40Z

...kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/consumer/KafkaDataConsumer.scala

+   *
+   * @param startOffsets, the starting positions to read from, inclusive.
+   */
+  def getIterator(offset: Long): KafkaDataConsumerIterator = {


The param doc is startOffsets instead of offset.

Suggested change

def getIterator(offset: Long): KafkaDataConsumerIterator = {

def getIterator(startOffsets: Long): KafkaDataConsumerIterator = {

viirya · 2025-10-31T17:08:38Z

...ka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRealTimeIntegrationSuite.scala

+    props
+  }
+
+  test("Union two kafka streams, for each write to sink") {


Why to have this test separately in this KafkaRealTimeModeE2ESuite instead KafkaRealTimeModeSuite or KafkaRealTimeIntegrationSuite?

KafkaRealTimeModeE2ESuite and KafkaRealTimeIntegrationSuite look like both e2e test suites. Should we have just one e2e test suite?

There is some difference. KafkaRealTimeModeSuite only has tests that uses StreamTest framework that allows us to perform step wise testing that writes to a memory sink. KafkaRealTimeIntegrationSuite and KafkaRealTimeModeE2ESuite are more E2E in a more realistic setting. KafkaRealTimeIntegrationSuite deploys a multiple worker cluster to run tests. KafkaRealTimeModeE2ESuite deploys a local in process cluster to test s foreach use case. The reason we are not consolidating the two is because it is easier to retrieve results from foreach sink writer if it is in process. Though we can consolidate the two by creating a kafka producer in foreach sink to write to results to kafka. Perhaps as a follow up item I can look into consolidating KafkaRealTimeModeE2ESuite and KafkaRealTimeIntegrationSuite?

Okay. Thanks for the explanation. Looks good to me.

viirya · 2025-10-31T17:11:55Z

...ka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRealTimeIntegrationSuite.scala

+ * Tests with a distributed spark cluster with
+ * separate executors processes deployed.
+ */
+class KafkaRealTimeIntegrationSuite


The difference between this test suite and KafkaRealTimeModeSuite is that this is specially for distributed spark cluster?

Yes. I explain in more detail the differences of the test suites here:
#52729 (comment)

viirya · 2025-10-31T17:17:02Z

...ka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRealTimeIntegrationSuite.scala

+ */
+class KafkaRealTimeIntegrationSuite
+  extends KafkaSourceTest
+    with StreamRealTimeModeSuiteBase


Hmm, I saw you created two suite base classes: StreamRealTimeModeE2ESuiteBase and StreamRealTimeModeSuiteBase. This test suite looks like for e2e tests too, why it doesn't use StreamRealTimeModeE2ESuiteBase but StreamRealTimeModeSuiteBase?

We can it is just this test suite does really need any of the additional functionality provided in StreamRealTimeModeE2ESuiteBase

viirya

Overall looks good. Some minor comments.

viirya

To catch up tomorrow's cut, since this looks good already, I will merge this today later. If there are some comments are not addressed yet, we can address them in some followups.

cc @HeartSaVioR

viirya · 2025-10-31T21:12:50Z

...ctor/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala

+        // If we are in micro-batch mode, we need to get the latest partition offsets at the
+        // start of the batch and recalculate the latest offsets at the end for backlog
+        // estimation.


This comment is not needed right?

Will revert this code based on this thread:
https://github.com/apache/spark/pull/52729/files#r2482113567

So this is not needed

viirya · 2025-11-01T06:03:27Z

Merged to master branch.

Thanks @jerrypeng

github-actions bot added SQL STRUCTURED STREAMING CORE labels Oct 25, 2025

jerrypeng force-pushed the SPARK-54027-int branch 3 times, most recently from 32d57b7 to 32a58a4 Compare October 27, 2025 05:08

[SPARK-54027] Kafka Source RTM support

89b8f7f

jerrypeng force-pushed the SPARK-54027-int branch from 32a58a4 to 89b8f7f Compare October 28, 2025 17:14

github-actions bot removed the CORE label Oct 28, 2025

jerrypeng changed the title ~~[WIP] [SPARK-54027] Kafka Source RTM support~~ [SPARK-54027] Kafka Source RTM support Oct 28, 2025

fix scalastyle

b69d1f8

viirya mentioned this pull request Oct 30, 2025

[SPARK-54094][SQL] Extract common methods to KafkaOffsetReaderBase #52788

Closed

viirya reviewed Oct 30, 2025

View reviewed changes

addressing comments

59fbadd

addressing feedback

7a244f0

jerrypeng requested a review from viirya October 30, 2025 23:25

viirya reviewed Oct 31, 2025

View reviewed changes

viirya approved these changes Oct 31, 2025

View reviewed changes

addressing comments

7f66c9d

viirya reviewed Oct 31, 2025

View reviewed changes

addressing comments

3880546

viirya closed this in 928f253 Nov 1, 2025

	// This function is used by Low Latency Mode, where we expect 1:1 mapping between a
	// This function is used by real time mode, where we expect 1:1 mapping between a


		case class WaitUntilBatchProcessed(batchId: Long) extends StreamAction with StreamMustBeRunning

		case object WaitUntilCurrentBatchProcessed extends StreamAction with StreamMustBeRunning

	def getIterator(offset: Long): KafkaDataConsumerIterator = {
	def getIterator(startOffsets: Long): KafkaDataConsumerIterator = {

[SPARK-54027] Kafka Source RTM support #52729

[SPARK-54027] Kafka Source RTM support #52729

Uh oh!

Conversation

jerrypeng commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

jerrypeng commented Oct 25, 2025 •

edited

Loading