[SPARK-50017] Support Avro encoding for TransformWithState operator #48401

ericm-db · 2024-10-09T16:59:26Z

What changes were proposed in this pull request?

Currently, we use the internal byte representation to store state for stateful streaming operators in the StateStore. This PR introduces Avro serialization and deserialization capabilities in the RocksDBStateEncoder so that we can instead use Avro encoding to store state. This is currently enabled for the TransformWithState operator via SQLConf to support all functionality supported by TWS

Why are the changes needed?

UnsafeRow is an inherently unstable format that makes no guarantees of being backwards-compatible. Therefore, if the format changes between Spark releases, this could cause StateStore corruptions. Avro is more stable, and inherently enables schema evolution.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Amended and added to unit tests

Was this patch authored or co-authored using generative AI tooling?

No

connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroDataSourceV2.scala

gengliangwang · 2024-10-18T21:36:23Z

sql/core/src/main/java/org/apache/spark/sql/core/avro/AvroFileFormat.scala

why do we need to move this file?

Because it's used in AvroOptions

gengliangwang · 2024-10-18T21:38:37Z

sql/core/src/main/java/org/apache/spark/sql/core/avro/SchemaConverters.scala

Have we considered introducing a deprecated class under org.apache.spark.sql.avro that retains all the existing public methods, while moving their implementations into sql/core?

Sure, we can do this.

connector/avro/src/main/scala/org/apache/spark/sql/avro/DeprecatedSchemaConverters.scala

connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/pom.xml

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ListStateImpl.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ValueStateImplWithTTL.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

gengliangwang · 2024-10-24T22:15:29Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/DeprecatedSchemaConverters.scala

+
+@deprecated("Use org.apache.spark.sql.core.avro.SchemaConverters instead", "4.0.0")
+@Evolving
+object DeprecatedSchemaConverters {


Let's keep the name SchemaConverters and don't have Deprecated in the object name

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithListStateSuite.scala

anishshri-db · 2024-11-12T23:39:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala

@@ -104,7 +106,10 @@ case class TransformWithStateExec(
   * @return a new instance of the driver processor handle
   */
  private def getDriverProcessorHandle(): DriverStatefulProcessorHandleImpl = {
-    val driverProcessorHandle = new DriverStatefulProcessorHandleImpl(timeMode, keyEncoder)
+


nit: extra newline ?

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

anishshri-db · 2024-11-12T23:41:14Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

+  def encodePrefixKeyForRangeScan(
+      row: UnsafeRow,
+      avroType: Schema
+  ): Array[Byte] = {


nit: lets confirm the style here

anishshri-db · 2024-11-12T23:44:00Z

...rg/apache/spark/sql/execution/streaming/state/RocksDBStateStoreCheckpointFormatV2Suite.scala

@@ -91,7 +91,8 @@ case class CkptIdCollectingStateStoreWrapper(innerStore: StateStore) extends Sta
      valueSchema: StructType,
      keyStateEncoderSpec: KeyStateEncoderSpec,
      useMultipleValuesPerKey: Boolean = false,
-      isInternal: Boolean = false): Unit = {
+      isInternal: Boolean = false,
+      avroEnc: Option[AvroEncoder]): Unit = {


nit: lets use default args here as well ?

anishshri-db · 2024-11-12T23:46:08Z

...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala

        StructField("key2", StringType, false),
-        StructField("ordering-2", IntegerType, false),


can we add a test to verify the behavior if - is used within the state var names since its not supported in Avro ?

anishshri-db

lgtm with pending nits

brkyvz

Working through the PR, but some first comments to work on while I continue review

brkyvz · 2024-11-15T19:09:28Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala

@@ -24,6 +24,7 @@ import org.apache.avro.generic.GenericDatumReader
 import org.apache.avro.io.{BinaryDecoder, DecoderFactory}

 import org.apache.spark.SparkException
+import org.apache.spark.sql.avro.SchemaConverters


is this a stray change?

No, because we changed the directory of the file, we had to add imports.

brkyvz · 2024-11-15T19:10:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .checkValue(v => Set("UnsafeRow", "Avro").contains(v),
+        "Valid values are 'UnsafeRow' and 'Avro'")


nit: do we want to be case insensitive here?

brkyvz · 2024-11-15T19:12:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

+      case statefulOp: StatefulOperator =>
+        statefulOp match {
+          case op: TransformWithStateExec =>


nit: I assume you're doing this two step matching, because the avro serde will be added to other operators too in follow ups?

Yeah, it will be.

brkyvz · 2024-11-15T19:13:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala

   * Fetching the columnFamilySchemas from the StatefulProcessorHandle
   * after init is called.
   */
-  private def getColFamilySchemas(): Map[String, StateStoreColFamilySchema] = {
+  def getColFamilySchemas(): Map[String, StateStoreColFamilySchema] = {
    val columnFamilySchemas = getDriverProcessorHandle().getColumnFamilySchemas
    closeProcessorHandle()
    columnFamilySchemas


should this be moved to a static method?

Good question - I guess if we can pass the stateful processor in, it can be.

Actually, I don't think you can make it static - we need the the statefulProcessor that is passed into this particular instance of the TransformWithStateExec class.

brkyvz · 2024-11-15T19:17:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

+            op.copy(
+              columnFamilySchemas = op.getColFamilySchemas()
+            )


this is a bit confusing imho. If you have the getColFamilySchemas method as part of the class available, why do you have to set it on the class with a copy.

Two possible suggestions:

Make the getColFamilySchemas a static method. Not sure if that's possible though looking at the logic a bit more in TransformWithStateExec. It feels weird that you're opening and closing these handles just to get some of the information out.

Add a comment here that this needs to be run on the Driver, and also instead rename the method to: withColumnFamilySchemas which calls copy internally.

brkyvz · 2024-11-15T19:35:20Z

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

+      ttlValSchema,
+      Some(RangeKeyScanStateEncoderSpec(ttlKeySchema, Seq(0))),
+      avroEnc = getAvroSerde(
+        StructType(ttlKeySchema.drop(2)),


this drop(2) looks magical. Can you add a comment mentioning that these represent the null/positive/negative byte and the big endian representation?

brkyvz · 2024-11-15T19:36:01Z

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

+  // This function creates the StateStoreColFamilySchema for
+  // the TTL secondary index.
+  // Because we want to encode fixed-length types as binary types
+  // if we are using Avro, we need to do some schema conversion to ensure
+  // we can use range scan


ditto on docs

Also please specify when this method should be used and not the one above

brkyvz · 2024-11-15T19:36:08Z

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

+      ttlValSchema,
+      Some(RangeKeyScanStateEncoderSpec(ttlKeySchema, Seq(0))),
+      avroEnc = getAvroSerde(
+        StructType(ttlKeySchema.drop(2)),


brkyvz · 2024-11-15T19:37:14Z

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

+      ))
+  }
+
+  // This function creates the StateStoreColFamilySchema for


brkyvz · 2024-11-15T19:37:26Z

.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala

+      valSchema,
+      Some(RangeKeyScanStateEncoderSpec(keySchema, Seq(0))),
+      avroEnc = getAvroSerde(
+        StructType(avroKeySchema.drop(2)),


ditto on comment

brkyvz

I think we have an abstraction leak here. Ideally the Avro encoders should be created in the StateStore, not be passed around in the XStateImpl classes. The serialization format should be the duty of the StateStore. Having the state impl classes knowledge of the serialization format seems like an abstraction leak

brkyvz · 2024-11-15T19:44:01Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala

+    if (!schemas.contains(stateName)) {
+      None
+    } else {
+      schemas(stateName).avroEnc
+    }


schemas.get(stateName).map(_.avroEnc)?

brkyvz · 2024-11-15T19:46:52Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StatefulProcessorHandleImpl.scala

@@ -343,7 +497,8 @@ class StatefulProcessorHandleImpl(
 * actually done. We need this class because we can only collect the schemas after
 * the StatefulProcessor is initialized.
 */
-class DriverStatefulProcessorHandleImpl(timeMode: TimeMode, keyExprEnc: ExpressionEncoder[Any])
+class DriverStatefulProcessorHandleImpl(
+    timeMode: TimeMode, keyExprEnc: ExpressionEncoder[Any], initializeAvroEnc: Boolean)


nit: one line per parameter please

brkyvz · 2024-11-15T20:17:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala

-    (f: StateStore => CompletionIterator[InternalRow, Iterator[InternalRow]]):
+    (f: StateStore =>
+      CompletionIterator[InternalRow, Iterator[InternalRow]]):


uber nit: change necessary?

also nit: should we use type aliasing to shorten this CompletionIterator[InternalRow, Iterator[InternalRow]]? Like type ResultIterator = CompletionIterator[InternalRow, Iterator[InternalRow]]

brkyvz · 2024-11-15T20:22:04Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala

+    def getDriverProcessorHandle(): DriverStatefulProcessorHandleImpl = {
+      val driverProcessorHandle = new DriverStatefulProcessorHandleImpl(
+        timeMode, keyEncoder, initializeAvroEnc =
+         avroEncodingEnabled(stateStoreEncoding))
+      driverProcessorHandle.setHandleState(StatefulProcessorHandleState.PRE_INIT)
+      statefulProcessor.setHandle(driverProcessorHandle)
+      statefulProcessor.init(outputMode, timeMode)
+      driverProcessorHandle
+    }
+
+    val columnFamilySchemas = getDriverProcessorHandle().getColumnFamilySchemas


nit: maybe if you add a withColumnFamilySchema method, you can remove the need for this duplication and can just call it below after creating the class

brkyvz · 2024-11-15T20:24:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala

@@ -104,7 +106,9 @@ case class TransformWithStateExec(
   * @return a new instance of the driver processor handle
   */
  private def getDriverProcessorHandle(): DriverStatefulProcessorHandleImpl = {


nit: should this method have an assertion that it is being called on the Driver?

brkyvz

The changes are SOOO much cleaner now, thank you. It can get even cleaner though:

I feel like you can add a Serde interface for the StateEncoder code changes. That should simplify the code even further
Any reason we just didn't extend the suites with a different SQLConf to test out the different encoding type? I feel that would remove a ton of code changes as well

brkyvz · 2024-11-21T13:07:32Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

@@ -563,13 +684,233 @@ class RangeKeyScanStateEncoder(
    writer.getRow()
  }

+  def encodePrefixKeyForRangeScan(


Can you add a scaladoc please?

brkyvz · 2024-11-21T13:08:49Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

+    out.toByteArray
+  }
+
+  def decodePrefixKeyForRangeScan(


ditto on scaladoc please

brkyvz · 2024-11-21T13:27:48Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

-    virtualColFamilyId: Option[Short] = None)
-  extends RocksDBKeyStateEncoderBase(useColumnFamilies, virtualColFamilyId) {
+    virtualColFamilyId: Option[Short] = None,
+    avroEnc: Option[AvroEncoder] = None)


Instead of avroEnc, I would honestly introduce another interface:

trait Serde { def encodeToBytes(...) def decodeToUnsafeRow(...) def encodePrefixKeyForRangeScan(...) def decodePrefixKeyForRangeScan(...) }

and move the logic in there so that you don't have to keep on doing avroEnc.isDefined for these

The logic seems pretty similar except for the input data. The AvroStateSerde or whatever you want to name it would have the private lazy val remainingKeyAvroType = SchemaConverters.toAvroType(remainingKeySchema)

Spoke offline - it doesn't look like this simplifies things an awful lot - can be a follow-up.

brkyvz · 2024-11-21T13:29:50Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

-    virtualColFamilyId: Option[Short] = None)
-  extends RocksDBKeyStateEncoderBase(useColumnFamilies, virtualColFamilyId) {
+    virtualColFamilyId: Option[Short] = None,
+    avroEnc: Option[AvroEncoder] = None)


ditto on the Serde.

brkyvz · 2024-11-21T13:34:41Z

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

+          Some(newColFamilyId), avroEnc), RocksDBStateEncoder.getValueEncoder(valueSchema,
+          useMultipleValuesPerKey, avroEnc)))
+    }
+    private def getAvroSerializer(schema: StructType): AvroSerializer = {


nit: line before the method please

brkyvz · 2024-11-21T13:37:32Z

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

@@ -74,10 +75,71 @@ private[sql] class RocksDBStateStoreProvider
        isInternal: Boolean = false): Unit = {
      verifyColFamilyCreationOrDeletion("create_col_family", colFamilyName, isInternal)
      val newColFamilyId = rocksDB.createColFamilyIfAbsent(colFamilyName)
+      // Create cache key using store ID to avoid collisions
+      val avroEncCacheKey = s"${stateStoreId.operatorId}_" +


Do we have the stream runId (maybe it's available in the HadoopConf)? We should add runId, otherwise there could be collisions

brkyvz · 2024-11-21T13:39:42Z

...n/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaCompatibilityChecker.scala

+// Avro encoder that is used by the RocksDBStateStoreProvider and RocksDBStateEncoder
+// in order to serialize from UnsafeRow to a byte array of Avro encoding.


Can you please turn this into a proper scaladoc?

/** * ... */

brkyvz · 2024-11-21T13:42:17Z

...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala

    TestWithBothChangelogCheckpointingEnabledAndDisabled ) { colFamiliesEnabled =>
    val testSchema: StructType = StructType(
      Seq(
-        StructField("ordering-1", LongType, false),


oh, why'd you have to change these? If these are not supported by Avro, do we have any check anywhere to disallow the usage of the Avro encoder?

Avro code would just throw an error, saying that there are invalid characters in the field name

brkyvz · 2024-11-21T13:42:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

+  def testWithEncodingTypes(testName: String, testTags: Tag*)
+                           (testBody: => Any): Unit = {


one parameter per line like below please

...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala

brkyvz · 2024-11-21T13:46:51Z

oh forgot - we need to add the stream run id to the Avro encoder cache key, otherwise we may risk some unintended re-use of avro encoders. we should limit the size of that cache and add expiry to it

brkyvz

LGTM!

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

...n/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaCompatibilityChecker.scala

ericm-db changed the title ~~[WIP] Avrfo~~ [WIP] Avro Oct 9, 2024

github-actions bot added SQL STRUCTURED STREAMING BUILD AVRO labels Oct 9, 2024

ericm-db changed the title ~~[WIP] Avro~~ [SPARK-50017] Support Avro encoding for TransformWithState operator - ValueState Oct 17, 2024

ericm-db force-pushed the avro branch 2 times, most recently from ec1e07a to 1aca8f4 Compare October 18, 2024 18:47

gengliangwang reviewed Oct 18, 2024

View reviewed changes

connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroDataSourceV2.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Oct 18, 2024

View reviewed changes

ericm-db requested a review from gengliangwang October 21, 2024 19:22

ericm-db closed this Oct 22, 2024

ericm-db reopened this Oct 24, 2024