Incorrect column pruning #529

MaxGekk · 2024-01-13T08:57:56Z

On the following file:

"1","DE","","Yes"
"5",",","",","
"3","SA","","No"
"10","abcd""efgh"" \ndef","",""

when I select the index 0, I would expect 1, 5, 3, 10 but got 1, 5, 10.

Here is the example which reproduces the issue:

    CsvParserSettings settings = new CsvParserSettings();
    CsvFormat format = settings.getFormat();
    format.setQuoteEscape('"');
    settings.selectIndexes(0);

    CsvParser parser = new CsvParser(settings);
    File initialFile = new File("test.csv");
    InputStream inputStream = new FileInputStream(initialFile);
    List<String[]> allLines = parser.parseAll(inputStream);

    int count = 0;
    for(String[] line : allLines){
      System.out.println("Line " + ++count);
      for(String element : line){
        System.out.println("\t" + element);
      }
      System.out.println();
    }

the output is:

Line 1
	1

Line 2
	5

Line 3
	10

but when I select at least 3 indexes (0, 1, 2) or remove settings.selectIndexes(0), the output is correct.

settings.selectIndexes(0, 1, 2);

Line 1
	1
	DE
	null

Line 2
	5
	,
	null

Line 3
	3
	SA
	null

Line 4
	10
	abcd"efgh" \ndef
	null

The text was updated successfully, but these errors were encountered:

MaxGekk · 2024-01-13T09:01:56Z

We faced to the issue in Apache Spark since the column pruning feature is enabled by default in the CSV datasource. It would be nice to fix the issue in uniVocity instead of disabling the feature by default. cc @cloud-fan @HyukjinKwon

### What changes were proposed in this pull request? In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode. ### Why are the changes needed? To workaround the issue in the `uniVocity` parser used by the CSV datasource: uniVocity/univocity-parsers#529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44872 from MaxGekk/csv-disable-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode. ### Why are the changes needed? To workaround the issue in the `uniVocity` parser used by the CSV datasource: uniVocity/univocity-parsers#529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44872 from MaxGekk/csv-disable-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 829e742) Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? In the PR, I propose to disable the column pruning feature in the CSV datasource for the `multiLine` mode. ### Why are the changes needed? To workaround the issue in the `uniVocity` parser used by the CSV datasource: uniVocity/univocity-parsers#529 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt "test:testOnly *CSVv1Suite" $ build/sbt "test:testOnly *CSVv2Suite" $ build/sbt "test:testOnly *CSVLegacyTimeParserSuite" $ build/sbt "testOnly *.CsvFunctionsSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#44872 from MaxGekk/csv-disable-column-pruning. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 829e742) Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk mentioned this issue Jan 24, 2024

[SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode apache/spark#44872

Closed

mythrocks mentioned this issue Mar 12, 2024

[AUDIT][TASK] Test CSV Reader for multi-line mode NVIDIA/spark-rapids#10577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect column pruning #529

Incorrect column pruning #529

MaxGekk commented Jan 13, 2024

MaxGekk commented Jan 13, 2024

Incorrect column pruning #529

Incorrect column pruning #529

Comments

MaxGekk commented Jan 13, 2024

MaxGekk commented Jan 13, 2024