Empty lines should have a single empty field #208

tomwhoiscontrary · 2024-03-02T16:33:37Z

picocsv lists amongst its features that it:

follows the RFC4180 specification

RFC 4180 says (my emphasis):

Within the header and each record, there may be one or more fields, separated by commas.

And gives this ABNF production for a record:

record = field *(COMMA field)

I think it is clear from both of those that every record (line, in picocsv parlance) comprises at least one field. The grammar does not allow a record with no fields.

Therefore, even an empty line must contain at least one record. Because there are no characters in the line, that field must itself be empty. So, i would expect this test to pass:

    @Test
    void everyLineHasAtLeastOneField() throws IOException {
        String csv = "A\r\n"
                     + "\r\n"
                     + "B\r\n";

        try (Csv.Reader reader = Csv.Reader.of(Csv.Format.RFC4180, Csv.ReaderOptions.DEFAULT, new StringReader(csv))) {
            assertThat(reader.readLine()).isTrue();
            assertThat(reader.readField()).isTrue();
            assertThat(reader.toString()).isEqualTo("A");
            assertThat(reader.readField()).isFalse();

            assertThat(reader.readLine()).isTrue();
            assertThat(reader.readField()).isTrue(); // here
            assertThat(reader.toString()).isEqualTo("");
            assertThat(reader.readField()).isFalse();

            assertThat(reader.readLine()).isTrue();
            assertThat(reader.readField()).isTrue();
            assertThat(reader.toString()).isEqualTo("B");
            assertThat(reader.readField()).isFalse();

            assertThat(reader.readLine()).isFalse();
        }
    }

But, as of commit 3c9908c, it fails at the line marked // here.

This is not a purely theoretical point. Consider taking this file:

Country,Best Wine,Best Beer
France,Champagne,saison
UK,,bitter
Germany,Riesling,doppelbock

which is valid, and contains four rows of three fields each, then selecting the second column (as with cut -d , -f 2 on the unix command line). The result must also be valid, and contain four rows of one field each.

But with a natural use of the picocsv API, this is not what a user would see - rather, it seems like the file has one line with no fields at all.

I appreciate that this is a very pedantic point. But do you think this can be considered a bug?

I can imagine that some users might prefer the current semantics. And changing them is definitely not backward compatible. If you don't think the default behaviour here should be changed, would it make sense to add a flag to the ReaderOptions to opt in to the 'pedantic' interpretation?

Also, perhaps it would be useful to add a method like isEmptyLine() to the reader, that would return true if the line contains no text, regardless of whether that is interpeted as no fields, or one empty field. If so, perhaps this should also return true if the line contains an empty quoted string, like the second line here:

A
""
B

As i think that's semantically equivalent to the line containing no characters at all.

The text was updated successfully, but these errors were encountered:

charphi · 2024-03-04T17:18:59Z

@tomwhoiscontrary Thanks for this well documented issue.

I agree with you that picocsv should follow the RFC4180 spec closely.
So I'm planning to add an option to interpret an empty line as an empty single field in v2.x.
In v3.x this option will be the default.

charphi · 2024-03-04T17:26:25Z

I still have one problem.
If I take your example and switch the 2 last rows, I obtain this:

Country,Best Wine,Best Beer
France,Champagne,saison
Germany,Riesling,doppelbock
UK,,bitter

When selecting the second column, how do I interpret the last row if the end-of-line characters are absent as it is permitted by the spec ?

2.  The last record in the file may or may not have an ending line
     break.  For example:

     aaa,bbb,ccc CRLF
     zzz,yyy,xxx

tomwhoiscontrary · 2024-03-04T19:35:05Z

You're right, that case is ambiguous. I can't think of anything in the RFC that lets you choose one over the other.

In my opinion, I think practically, you have to interpret as a single record followed by end of file, not a record followed by an "empty" record, followed by end of file. Otherwise, every file ending with a line break (as is traditional on Unix) would grow such an empty record.

In this specific example, personally, I would want the file to contain Champagne Riesling . If someone elided the last , I would blame them for the ambiguity!

There is no good answer here. It's CSV!

charphi · 2024-03-11T20:49:15Z

Indeed :)

It seems that there is a new RFC in preparation and newline after last record will be mandatory.
See https://github.com/osiegmar/FastCSV/blob/main/doc/interpretation.md#newline-after-last-record

charphi · 2024-03-13T15:17:38Z

Done in https://github.com/nbbrd/picocsv/releases/tag/v2.3.0

charphi added the bug Something isn't working label Mar 4, 2024

charphi added enhancement New feature or request and removed bug Something isn't working labels Mar 13, 2024

charphi closed this as completed Mar 13, 2024

tomwhoiscontrary mentioned this issue Sep 2, 2024

add picocsv osiegmar/JavaCsvComparison#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty lines should have a single empty field #208

Empty lines should have a single empty field #208

tomwhoiscontrary commented Mar 2, 2024

charphi commented Mar 4, 2024

charphi commented Mar 4, 2024

tomwhoiscontrary commented Mar 4, 2024

charphi commented Mar 11, 2024

charphi commented Mar 13, 2024

Empty lines should have a single empty field #208

Empty lines should have a single empty field #208

Comments

tomwhoiscontrary commented Mar 2, 2024

charphi commented Mar 4, 2024

charphi commented Mar 4, 2024

tomwhoiscontrary commented Mar 4, 2024

charphi commented Mar 11, 2024

charphi commented Mar 13, 2024