Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47125][SQL] Return null if Univocity never triggers parsing #45210

Closed
wants to merge 1 commit into from

Conversation

HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to prevent null for tokenizer.getContext. This is similar with #28029. getContext seemingly via the univocity library, it can return null if begingParsing is not invoked (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/AbstractParser.java#L53). This can happen when parseLine is not invoked at

- parseLine invokes begingParsing.

Why are the changes needed?

To fix up a bug.

Does this PR introduce any user-facing change?

Yes. In a very rare case, when CsvToStructs is used as a sole predicate against an empty row, it might trigger NPE. This PR fixes it.

How was this patch tested?

Manually tested, but test case will be done in a separate PR. We should backport this to all branches.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Feb 22, 2024
@HyukjinKwon
Copy link
Member Author

cc @wzhfy

@dongjoon-hyun
Copy link
Member

Feel free to land to the release branches too if needed, @HyukjinKwon . I'll leave this to you.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 22, 2024

Merged to master, branch-3.5 and branch-3.4.

HyukjinKwon added a commit that referenced this pull request Feb 22, 2024
### What changes were proposed in this pull request?

This PR proposes to prevent `null` for `tokenizer.getContext`. This is similar with #28029. `getContext` seemingly via the univocity library, it can return null if `begingParsing` is not invoked (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/AbstractParser.java#L53). This can happen when `parseLine` is not invoked at https://github.com/apache/spark/blob/e081f06ea401a2b6b8c214a36126583d35eaf55f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L300 - `parseLine` invokes `begingParsing`.

### Why are the changes needed?

To fix up a bug.

### Does this PR introduce _any_ user-facing change?

Yes. In a very rare case, when `CsvToStructs` is used as a sole predicate against an empty row, it might trigger NPE. This PR fixes it.

### How was this patch tested?

Manually tested, but test case will be done in a separate PR. We should backport this to all branches.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45210 from HyukjinKwon/SPARK-47125.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request Feb 22, 2024
This PR proposes to prevent `null` for `tokenizer.getContext`. This is similar with #28029. `getContext` seemingly via the univocity library, it can return null if `begingParsing` is not invoked (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/AbstractParser.java#L53). This can happen when `parseLine` is not invoked at https://github.com/apache/spark/blob/e081f06ea401a2b6b8c214a36126583d35eaf55f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L300 - `parseLine` invokes `begingParsing`.

To fix up a bug.

Yes. In a very rare case, when `CsvToStructs` is used as a sole predicate against an empty row, it might trigger NPE. This PR fixes it.

Manually tested, but test case will be done in a separate PR. We should backport this to all branches.

No.

Closes #45210 from HyukjinKwon/SPARK-47125.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a87015e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Mar 5, 2024
### What changes were proposed in this pull request?

This PR proposes to prevent `null` for `tokenizer.getContext`. This is similar with apache#28029. `getContext` seemingly via the univocity library, it can return null if `begingParsing` is not invoked (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/AbstractParser.java#L53). This can happen when `parseLine` is not invoked at https://github.com/apache/spark/blob/e081f06ea401a2b6b8c214a36126583d35eaf55f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L300 - `parseLine` invokes `begingParsing`.

### Why are the changes needed?

To fix up a bug.

### Does this PR introduce _any_ user-facing change?

Yes. In a very rare case, when `CsvToStructs` is used as a sole predicate against an empty row, it might trigger NPE. This PR fixes it.

### How was this patch tested?

Manually tested, but test case will be done in a separate PR. We should backport this to all branches.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45210 from HyukjinKwon/SPARK-47125.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Mar 26, 2024
This PR proposes to prevent `null` for `tokenizer.getContext`. This is similar with apache#28029. `getContext` seemingly via the univocity library, it can return null if `begingParsing` is not invoked (https://github.com/uniVocity/univocity-parsers/blob/master/src/main/java/com/univocity/parsers/common/AbstractParser.java#L53). This can happen when `parseLine` is not invoked at https://github.com/apache/spark/blob/e081f06ea401a2b6b8c214a36126583d35eaf55f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L300 - `parseLine` invokes `begingParsing`.

To fix up a bug.

Yes. In a very rare case, when `CsvToStructs` is used as a sole predicate against an empty row, it might trigger NPE. This PR fixes it.

Manually tested, but test case will be done in a separate PR. We should backport this to all branches.

No.

Closes apache#45210 from HyukjinKwon/SPARK-47125.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a87015e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants