Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty lines should have a single empty field #208

Closed
tomwhoiscontrary opened this issue Mar 2, 2024 · 5 comments
Closed

Empty lines should have a single empty field #208

tomwhoiscontrary opened this issue Mar 2, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@tomwhoiscontrary
Copy link

picocsv lists amongst its features that it:

  • follows the RFC4180 specification

RFC 4180 says (my emphasis):

  1. Within the header and each record, there may be one or more fields, separated by commas.

And gives this ABNF production for a record:

record = field *(COMMA field)

I think it is clear from both of those that every record (line, in picocsv parlance) comprises at least one field. The grammar does not allow a record with no fields.

Therefore, even an empty line must contain at least one record. Because there are no characters in the line, that field must itself be empty. So, i would expect this test to pass:

    @Test
    void everyLineHasAtLeastOneField() throws IOException {
        String csv = "A\r\n"
                     + "\r\n"
                     + "B\r\n";

        try (Csv.Reader reader = Csv.Reader.of(Csv.Format.RFC4180, Csv.ReaderOptions.DEFAULT, new StringReader(csv))) {
            assertThat(reader.readLine()).isTrue();
            assertThat(reader.readField()).isTrue();
            assertThat(reader.toString()).isEqualTo("A");
            assertThat(reader.readField()).isFalse();

            assertThat(reader.readLine()).isTrue();
            assertThat(reader.readField()).isTrue(); // here
            assertThat(reader.toString()).isEqualTo("");
            assertThat(reader.readField()).isFalse();

            assertThat(reader.readLine()).isTrue();
            assertThat(reader.readField()).isTrue();
            assertThat(reader.toString()).isEqualTo("B");
            assertThat(reader.readField()).isFalse();

            assertThat(reader.readLine()).isFalse();
        }
    }

But, as of commit 3c9908c, it fails at the line marked // here.

This is not a purely theoretical point. Consider taking this file:

Country,Best Wine,Best Beer
France,Champagne,saison
UK,,bitter
Germany,Riesling,doppelbock

which is valid, and contains four rows of three fields each, then selecting the second column (as with cut -d , -f 2 on the unix command line). The result must also be valid, and contain four rows of one field each.

But with a natural use of the picocsv API, this is not what a user would see - rather, it seems like the file has one line with no fields at all.

I appreciate that this is a very pedantic point. But do you think this can be considered a bug?

I can imagine that some users might prefer the current semantics. And changing them is definitely not backward compatible. If you don't think the default behaviour here should be changed, would it make sense to add a flag to the ReaderOptions to opt in to the 'pedantic' interpretation?

Also, perhaps it would be useful to add a method like isEmptyLine() to the reader, that would return true if the line contains no text, regardless of whether that is interpeted as no fields, or one empty field. If so, perhaps this should also return true if the line contains an empty quoted string, like the second line here:

A
""
B

As i think that's semantically equivalent to the line containing no characters at all.

@charphi
Copy link
Member

charphi commented Mar 4, 2024

@tomwhoiscontrary Thanks for this well documented issue.

I agree with you that picocsv should follow the RFC4180 spec closely.
So I'm planning to add an option to interpret an empty line as an empty single field in v2.x.
In v3.x this option will be the default.

@charphi charphi added the bug Something isn't working label Mar 4, 2024
@charphi
Copy link
Member

charphi commented Mar 4, 2024

I still have one problem.
If I take your example and switch the 2 last rows, I obtain this:

Country,Best Wine,Best Beer
France,Champagne,saison
Germany,Riesling,doppelbock
UK,,bitter

When selecting the second column, how do I interpret the last row if the end-of-line characters are absent as it is permitted by the spec ?

2.  The last record in the file may or may not have an ending line
     break.  For example:

     aaa,bbb,ccc CRLF
     zzz,yyy,xxx

@tomwhoiscontrary
Copy link
Author

You're right, that case is ambiguous. I can't think of anything in the RFC that lets you choose one over the other.

In my opinion, I think practically, you have to interpret as a single record followed by end of file, not a record followed by an "empty" record, followed by end of file. Otherwise, every file ending with a line break (as is traditional on Unix) would grow such an empty record.

In this specific example, personally, I would want the file to contain Champagne Riesling . If someone elided the last , I would blame them for the ambiguity!

There is no good answer here. It's CSV!

@charphi
Copy link
Member

charphi commented Mar 11, 2024

Indeed :)

It seems that there is a new RFC in preparation and newline after last record will be mandatory.
See https://github.com/osiegmar/FastCSV/blob/main/doc/interpretation.md#newline-after-last-record

@charphi charphi added enhancement New feature or request and removed bug Something isn't working labels Mar 13, 2024
@charphi
Copy link
Member

charphi commented Mar 13, 2024

Done in https://github.com/nbbrd/picocsv/releases/tag/v2.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants