-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty lines should have a single empty field #208
Comments
@tomwhoiscontrary Thanks for this well documented issue. I agree with you that picocsv should follow the RFC4180 spec closely. |
I still have one problem.
When selecting the second column, how do I interpret the last row if the end-of-line characters are absent as it is permitted by the spec ?
|
You're right, that case is ambiguous. I can't think of anything in the RFC that lets you choose one over the other. In my opinion, I think practically, you have to interpret as a single record followed by end of file, not a record followed by an "empty" record, followed by end of file. Otherwise, every file ending with a line break (as is traditional on Unix) would grow such an empty record. In this specific example, personally, I would want the file to contain Champagne Riesling . If someone elided the last , I would blame them for the ambiguity! There is no good answer here. It's CSV! |
Indeed :) It seems that there is a new RFC in preparation and newline after last record will be mandatory. |
picocsv lists amongst its features that it:
RFC 4180 says (my emphasis):
And gives this ABNF production for a record:
I think it is clear from both of those that every record (line, in picocsv parlance) comprises at least one field. The grammar does not allow a record with no fields.
Therefore, even an empty line must contain at least one record. Because there are no characters in the line, that field must itself be empty. So, i would expect this test to pass:
But, as of commit 3c9908c, it fails at the line marked
// here
.This is not a purely theoretical point. Consider taking this file:
which is valid, and contains four rows of three fields each, then selecting the second column (as with
cut -d , -f 2
on the unix command line). The result must also be valid, and contain four rows of one field each.But with a natural use of the picocsv API, this is not what a user would see - rather, it seems like the file has one line with no fields at all.
I appreciate that this is a very pedantic point. But do you think this can be considered a bug?
I can imagine that some users might prefer the current semantics. And changing them is definitely not backward compatible. If you don't think the default behaviour here should be changed, would it make sense to add a flag to the ReaderOptions to opt in to the 'pedantic' interpretation?
Also, perhaps it would be useful to add a method like
isEmptyLine()
to the reader, that would return true if the line contains no text, regardless of whether that is interpeted as no fields, or one empty field. If so, perhaps this should also return true if the line contains an empty quoted string, like the second line here:As i think that's semantically equivalent to the line containing no characters at all.
The text was updated successfully, but these errors were encountered: