[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV #15138

HyukjinKwon · 2016-09-18T09:16:46Z

What changes were proposed in this pull request?

This PR includes the changes below:

Upgrade Univocity library from 2.1.1 to 2.2.1

This includes some performance improvement and also enabling auto-extending buffer in maxCharsPerColumn option in CSV. Please refer the release notes.
Remove useless rowSeparator variable existing in CSVOptions

We have this unused variable in CSVOptions.scala#L127 but it seems possibly causing confusion that it actually does not care of \r\n. For example, we have an issue open about this, SPARK-17227, describing this variable.

This variable is virtually not being used because we rely on LineRecordReader in Hadoop which deals with only both \n and \r\n.
Set the default value of maxCharsPerColumn to auto-expending.

We are setting 1000000 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default.

To make sure, using -1 is being described in the release note, 2.2.0.

How was this patch tested?

N/A

…default for maxCharsPerColumn option in CSV

HyukjinKwon · 2016-09-18T09:24:15Z

Do you mind if I ask to review this @holdenk and @srowen ? I remember bumping up the version was reviewed by you both.

SparkQA · 2016-09-18T11:02:00Z

Test build #65566 has finished for PR 15138 at commit ee04aad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-18T13:18:36Z

Test build #65569 has finished for PR 15138 at commit 70cb567.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-18T15:57:15Z

I think you know this area better than anyone. The update seems reasonable. So the only substantive change is to not limit the maximum length of a line? I tend to agree with removing arbitrary limits (especially if one can still set them if desired) but was this set for a particular reason before, like, guarding against someone parsing non-CSV and chewing up a load of memory?

SparkQA · 2016-09-18T16:34:57Z

Test build #65575 has finished for PR 15138 at commit 70cb567.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-09-19T00:26:37Z

Yes, you are right and also yes, the purpose of the setting is to prevent OOM (documentation). I believe this limit was initially set by @falaki and I remember I had a positive answer when I try to increase this value.

If this is a normal case, it'd make sense if we set explicit limit because it is possible to try to read a whole file as a value within a column. However, I guess we are already reading and parsing line by line via LineRecordReader and via CsvReader.parseLine(line: String). Therefore, I think the limit can't exceed the length of each line which I think is okay as a default value.

BTW, Apache common CSV does not have this limit IIRU.

srowen

It seems OK to me. Even with LineRecordReader it's possible you read some binary file with no real line separators in sight but that's a corner, error case anyway.

srowen · 2016-09-21T09:35:49Z

Merged to master

HyukjinKwon · 2016-09-21T09:39:03Z

Thanks!

Remove unused rowSeparator variable and set auto-expanding buffer as …

ee04aad

…default for maxCharsPerColumn option in CSV

Adds the comment for default value

70cb567

HyukjinKwon closed this Sep 18, 2016

HyukjinKwon deleted the SPARK-17583 branch September 18, 2016 14:05

HyukjinKwon restored the SPARK-17583 branch September 18, 2016 14:05

HyukjinKwon reopened this Sep 18, 2016

srowen approved these changes Sep 19, 2016

View reviewed changes

asfgit closed this in 25a020b Sep 21, 2016

HyukjinKwon deleted the SPARK-17583 branch January 2, 2018 03:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV #15138

[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV #15138

Uh oh!

HyukjinKwon commented Sep 18, 2016 •

edited

Loading

Uh oh!

HyukjinKwon commented Sep 18, 2016

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

srowen commented Sep 18, 2016

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016 •

edited

Loading

Uh oh!

srowen left a comment

Uh oh!

srowen commented Sep 21, 2016

Uh oh!

HyukjinKwon commented Sep 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV #15138

[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV #15138

Uh oh!

Conversation

HyukjinKwon commented Sep 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Sep 18, 2016

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

srowen commented Sep 18, 2016

Uh oh!

SparkQA commented Sep 18, 2016

Uh oh!

HyukjinKwon commented Sep 19, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented Sep 21, 2016

Uh oh!

HyukjinKwon commented Sep 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HyukjinKwon commented Sep 18, 2016 •

edited

Loading

HyukjinKwon commented Sep 19, 2016 •

edited

Loading