-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17583][SQL] Remove uesless rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV #15138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…default for maxCharsPerColumn option in CSV
|
Test build #65566 has finished for PR 15138 at commit
|
|
Test build #65569 has finished for PR 15138 at commit
|
|
I think you know this area better than anyone. The update seems reasonable. So the only substantive change is to not limit the maximum length of a line? I tend to agree with removing arbitrary limits (especially if one can still set them if desired) but was this set for a particular reason before, like, guarding against someone parsing non-CSV and chewing up a load of memory? |
|
Test build #65575 has finished for PR 15138 at commit
|
|
Yes, you are right and also yes, the purpose of the setting is to prevent OOM (documentation). I believe this limit was initially set by @falaki and I remember I had a positive answer when I try to increase this value. If this is a normal case, it'd make sense if we set explicit limit because it is possible to try to read a whole file as a value within a column. However, I guess we are already reading and parsing line by line via BTW, Apache common CSV does not have this limit IIRU. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems OK to me. Even with LineRecordReader it's possible you read some binary file with no real line separators in sight but that's a corner, error case anyway.
|
Merged to master |
|
Thanks! |
What changes were proposed in this pull request?
This PR includes the changes below:
Upgrade Univocity library from 2.1.1 to 2.2.1
This includes some performance improvement and also enabling auto-extending buffer in
maxCharsPerColumnoption in CSV. Please refer the release notes.Remove useless
rowSeparatorvariable existing inCSVOptionsWe have this unused variable in CSVOptions.scala#L127 but it seems possibly causing confusion that it actually does not care of
\r\n. For example, we have an issue open about this, SPARK-17227, describing this variable.This variable is virtually not being used because we rely on
LineRecordReaderin Hadoop which deals with only both\nand\r\n.Set the default value of
maxCharsPerColumnto auto-expending.We are setting 1000000 for the length of each column. It'd be more sensible we allow auto-expending rather than fixed length by default.
To make sure, using
-1is being described in the release note, 2.2.0.How was this patch tested?
N/A