Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Sep 18, 2016

What changes were proposed in this pull request?

This PR includes the changes below:

  1. Upgrade Univocity library from 2.1.1 to 2.2.1

    This includes some performance improvement and also enabling auto-extending buffer in maxCharsPerColumn option in CSV. Please refer the release notes.

  2. Remove useless rowSeparator variable existing in CSVOptions

    We have this unused variable in CSVOptions.scala#L127 but it seems possibly causing confusion that it actually does not care of \r\n. For example, we have an issue open about this, SPARK-17227, describing this variable.

    This variable is virtually not being used because we rely on LineRecordReader in Hadoop which deals with only both \n and \r\n.

  3. Set the default value of maxCharsPerColumn to auto-expending.

    We are setting 1000000 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default.

    To make sure, using -1 is being described in the release note, 2.2.0.

How was this patch tested?

N/A

@HyukjinKwon
Copy link
Member Author

Do you mind if I ask to review this @holdenk and @srowen ? I remember bumping up the version was reviewed by you both.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65566 has finished for PR 15138 at commit ee04aad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-17583][SQL] Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV [SPARK-17583][SQL] Remove u rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV Sep 18, 2016
@HyukjinKwon HyukjinKwon changed the title [SPARK-17583][SQL] Remove u rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV [SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV Sep 18, 2016
@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65569 has finished for PR 15138 at commit 70cb567.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon deleted the SPARK-17583 branch September 18, 2016 14:05
@HyukjinKwon HyukjinKwon restored the SPARK-17583 branch September 18, 2016 14:05
@HyukjinKwon HyukjinKwon reopened this Sep 18, 2016
@srowen
Copy link
Member

srowen commented Sep 18, 2016

I think you know this area better than anyone. The update seems reasonable. So the only substantive change is to not limit the maximum length of a line? I tend to agree with removing arbitrary limits (especially if one can still set them if desired) but was this set for a particular reason before, like, guarding against someone parsing non-CSV and chewing up a load of memory?

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65575 has finished for PR 15138 at commit 70cb567.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Sep 19, 2016

Yes, you are right and also yes, the purpose of the setting is to prevent OOM (documentation). I believe this limit was initially set by @falaki and I remember I had a positive answer when I try to increase this value.

If this is a normal case, it'd make sense if we set explicit limit because it is possible to try to read a whole file as a value within a column. However, I guess we are already reading and parsing line by line via LineRecordReader and via CsvReader.parseLine(line: String). Therefore, I think the limit can't exceed the length of each line which I think is okay as a default value.

BTW, Apache common CSV does not have this limit IIRU.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems OK to me. Even with LineRecordReader it's possible you read some binary file with no real line separators in sight but that's a corner, error case anyway.

@srowen
Copy link
Member

srowen commented Sep 21, 2016

Merged to master

@asfgit asfgit closed this in 25a020b Sep 21, 2016
@HyukjinKwon
Copy link
Member Author

Thanks!

@HyukjinKwon HyukjinKwon deleted the SPARK-17583 branch January 2, 2018 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants