Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51065][SQL] Disallowing non-nullable schema when Avro encoding is used for TransformWithState #49751

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

ericm-db
Copy link
Contributor

@ericm-db ericm-db commented Jan 31, 2025

What changes were proposed in this pull request?

Right now, effectively set all fields in a schema to nullable, regardless of what the user specifies. If a field is specified as non-nullable and Avro encoding is used we will throw an error

Why are the changes needed?

In order to keep parity with the user-specified schema with the actual schema that we use.

Does this PR introduce any user-facing change?

This error is thrown if the schema is defined as non-nullable

Traceback (most recent call last):
  File "/Users/eric.marnadi/spark/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py", line 1496, in test_not_nullable_fails
    self._run_evolution_test(
  File "/Users/eric.marnadi/spark/python/pyspark/sql/tests/pandas/test_pandas_transform_with_state.py", line 1344, in _run_evolution_test
    q.processAllAvailable()
  File "/Users/eric.marnadi/spark/python/pyspark/sql/streaming/query.py", line 351, in processAllAvailable
    return self._jsq.processAllAvailable()
  File "/Users/eric.marnadi/spark/python/lib/py4j-0.10.9.9-src.zip/py4j/java_gateway.py", line 1362, in __call__
    return_value = get_return_value(
  File "/Users/eric.marnadi/spark/python/pyspark/errors/exceptions/captured.py", line 258, in deco
    raise converted from None
pyspark.errors.exceptions.captured.StreamingQueryException: [STREAM_FAILED] Query [id = 541c5df0-24e4-4702-b87a-c4edfb6a952c, runId = 4259c7b9-3846-4f73-9204-c3d71b07018c] terminated with exception: [STATE_STORE_SCHEMA_MUST_BE_NULLABLE] If schema evolution is enabled, all the fields in the schema for column family state must be nullable.
Please set the 'spark.sql.streaming.stateStore.encodingFormat' to 'UnsafeRow' or make the schema nullable.
Current schema: StructType(StructField(id,IntegerType,false),StructField(name,StringType,false)) SQLSTATE: XXKST SQLSTATE: XXKST
=== Streaming Query ===
Identifier: evolution_test [id = 541c5df0-24e4-4702-b87a-c4edfb6a952c, runId = 4259c7b9-3846-4f73-9204-c3d71b07018c]
Current Committed Offsets: {}
Current 

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

"STATE_STORE_SCHEMA_MUST_BE_NULLABLE" : {
"message" : [
"If schema evolution is enabled, all the fields in the schema for column family <columnFamilyName> must be nullable.",
"Please set the 'spark.sql.streaming.stateStore.encodingFormat' to 'UnsafeRow' or make the schema nullable.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is stored in the offset log though. So maybe just say that they should make schema nullable ?

schemas.map { case (colFamilyName, schema) =>
// assert that each field is nullable if schema evolution is enabled
schema.valueSchema.fields.foreach { field =>
if (!field.nullable && ensureNullableFields) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just always enforce this for transformWithState ?

@anishshri-db
Copy link
Contributor

@ericm-db - can u add the SPARK ticket in the PR title ?

@anishshri-db
Copy link
Contributor

@ericm-db - also, is test failure related to the change ?

@ericm-db ericm-db changed the title Disallowing non-nullable schema when Avro encoding is used for TransformWithState [SPARK-51065] Disallowing non-nullable schema when Avro encoding is used for TransformWithState Feb 3, 2025
@HyukjinKwon HyukjinKwon changed the title [SPARK-51065] Disallowing non-nullable schema when Avro encoding is used for TransformWithState [SPARK-51065][SQL] Disallowing non-nullable schema when Avro encoding is used for TransformWithState Feb 3, 2025
schemas.map { case (colFamilyName, schema) =>
// assert that each field is nullable if schema evolution is enabled
schema.valueSchema.fields.foreach { field =>
if (!field.nullable && shouldCheckNullable && !isInternal(colFamilyName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to treat internal col families differently ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants