Skip to content

Conversation

@sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Jan 9, 2024

What changes were proposed in this pull request?

This PR fixes a bug in Avro connector with regard to zero-length blocks. If a file contains one of these blocks, the Avro connector may return an incorrect number of records or even an empty DataFrame in some cases.

This was due to the way the hasNextRow check worked. hasNext method in Avro loads the next block so if the block is empty, it would return false and Avro connector will stop reading rows. However, we should continue checking the next block instead until the sync point.

Why are the changes needed?

Fixes a correctness bug in Avro connector.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I added a unit test and a generated sample file to verify the fix. Without the patch, reading such file would return fewer records or 0 compared to the actual number (depends on the maxPartitionBytes config).

Was this patch authored or co-authored using generative AI tooling?

No.

@sadikovi
Copy link
Contributor Author

sadikovi commented Jan 9, 2024

cc @cloud-fan @dongjoon-hyun

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@HyukjinKwon
Copy link
Member

Merged to master.

@JoshRosen
Copy link
Contributor

Cross-linking for discoverability: this PR introduced a regression was was fixed in #45578

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants