Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect column pruning #529

Open
MaxGekk opened this issue Jan 13, 2024 · 1 comment
Open

Incorrect column pruning #529

MaxGekk opened this issue Jan 13, 2024 · 1 comment

Comments

@MaxGekk
Copy link

MaxGekk commented Jan 13, 2024

On the following file:

"1","DE","","Yes"
"5",",","",","
"3","SA","","No"
"10","abcd""efgh"" \ndef","",""

when I select the index 0, I would expect 1, 5, 3, 10 but got 1, 5, 10.

Here is the example which reproduces the issue:

    CsvParserSettings settings = new CsvParserSettings();
    CsvFormat format = settings.getFormat();
    format.setQuoteEscape('"');
    settings.selectIndexes(0);

    CsvParser parser = new CsvParser(settings);
    File initialFile = new File("test.csv");
    InputStream inputStream = new FileInputStream(initialFile);
    List<String[]> allLines = parser.parseAll(inputStream);

    int count = 0;
    for(String[] line : allLines){
      System.out.println("Line " + ++count);
      for(String element : line){
        System.out.println("\t" + element);
      }
      System.out.println();
    }

the output is:

Line 1
	1

Line 2
	5

Line 3
	10

but when I select at least 3 indexes (0, 1, 2) or remove settings.selectIndexes(0), the output is correct.

settings.selectIndexes(0, 1, 2);
Line 1
	1
	DE
	null

Line 2
	5
	,
	null

Line 3
	3
	SA
	null

Line 4
	10
	abcd"efgh" \ndef
	null
@MaxGekk
Copy link
Author

MaxGekk commented Jan 13, 2024

We faced to the issue in Apache Spark since the column pruning feature is enabled by default in the CSV datasource. It would be nice to fix the issue in uniVocity instead of disabling the feature by default. cc @cloud-fan @HyukjinKwon

MaxGekk added a commit to apache/spark that referenced this issue Jan 26, 2024
### What changes were proposed in this pull request?
In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode.

### Why are the changes needed?
To workaround the issue in the `uniVocity` parser used by the CSV datasource: uniVocity/univocity-parsers#529

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44872 from MaxGekk/csv-disable-column-pruning.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk added a commit to apache/spark that referenced this issue Jan 26, 2024
### What changes were proposed in this pull request?
In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode.

### Why are the changes needed?
To workaround the issue in the `uniVocity` parser used by the CSV datasource: uniVocity/univocity-parsers#529

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44872 from MaxGekk/csv-disable-column-pruning.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 829e742)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk added a commit to apache/spark that referenced this issue Jan 26, 2024
### What changes were proposed in this pull request?
In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode.

### Why are the changes needed?
To workaround the issue in the `uniVocity` parser used by the CSV datasource: uniVocity/univocity-parsers#529

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44872 from MaxGekk/csv-disable-column-pruning.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 829e742)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
szehon-ho pushed a commit to szehon-ho/spark that referenced this issue Feb 7, 2024
### What changes were proposed in this pull request?
In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode.

### Why are the changes needed?
To workaround the issue in the `uniVocity` parser used by the CSV datasource: uniVocity/univocity-parsers#529

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *CSVv1Suite"
$ build/sbt "test:testOnly *CSVv2Suite"
$ build/sbt "test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "testOnly *.CsvFunctionsSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#44872 from MaxGekk/csv-disable-column-pruning.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 829e742)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant