[SPARK-46633][SQL] Fix Avro reader to handle zero-length blocks #44635

sadikovi · 2024-01-09T04:56:04Z

What changes were proposed in this pull request?

This PR fixes a bug in Avro connector with regard to zero-length blocks. If a file contains one of these blocks, the Avro connector may return an incorrect number of records or even an empty DataFrame in some cases.

This was due to the way the hasNextRow check worked. hasNext method in Avro loads the next block so if the block is empty, it would return false and Avro connector will stop reading rows. However, we should continue checking the next block instead until the sync point.

Why are the changes needed?

Fixes a correctness bug in Avro connector.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I added a unit test and a generated sample file to verify the fix. Without the patch, reading such file would return fewer records or 0 compared to the actual number (depends on the maxPartitionBytes config).

Was this patch authored or co-authored using generative AI tooling?

No.

sadikovi · 2024-01-09T04:56:33Z

cc @cloud-fan @dongjoon-hyun

cloud-fan

good catch!

HyukjinKwon · 2024-01-10T00:47:05Z

Merged to master.

JoshRosen · 2024-03-22T00:32:20Z

Cross-linking for discoverability: this PR introduced a regression was was fixed in #45578

github-actions bot added SQL AVRO labels Jan 9, 2024

cloud-fan approved these changes Jan 9, 2024

View reviewed changes

update

41fdc27

sadikovi force-pushed the SPARK-46633 branch from 40a53f6 to 41fdc27 Compare January 9, 2024 20:46

HyukjinKwon approved these changes Jan 10, 2024

View reviewed changes

HyukjinKwon closed this in 3a6b9ad Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46633][SQL] Fix Avro reader to handle zero-length blocks #44635

[SPARK-46633][SQL] Fix Avro reader to handle zero-length blocks #44635

Uh oh!

sadikovi commented Jan 9, 2024

Uh oh!

sadikovi commented Jan 9, 2024

Uh oh!

cloud-fan left a comment

Uh oh!

HyukjinKwon commented Jan 10, 2024

Uh oh!

JoshRosen commented Mar 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-46633][SQL] Fix Avro reader to handle zero-length blocks #44635

[SPARK-46633][SQL] Fix Avro reader to handle zero-length blocks #44635

Uh oh!

Conversation

sadikovi commented Jan 9, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sadikovi commented Jan 9, 2024

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 10, 2024

Uh oh!

JoshRosen commented Mar 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants