Line numbers in error messages should refer to the raw file #2428

etienne-s · 2017-10-18T11:28:08Z

Consider the following example:

src1 <- 'col1,col2
a,b
c,d,e'

fread(src1)
# Error in fread(src1) : 
#   Line 3 from sampling jump 0 starting <<c,d,e>> has more than the expected 2 fields. Separator ',' occurs at position 4 which is character 2 of the last field: <<d,e>>. Consider setting 'comment.char=' if there is a trailing comment to be ignored.

The error message reports a problem on line 3, which is fine.

Now suppose b contains a line break:

src2 <- 'col1,col2
a,"b1 
b2"
c,d,e'

fread(src2)
# Error in fread(src2) : 
#   Line 3 from sampling jump 0 starting <<c,d,e>> has more than the expected 2 fields. Separator ',' occurs at position 4 which is character 2 of the last field: <<d,e>>. Consider setting 'comment.char=' if there is a trailing comment to be ignored.

The problem is reported to be on line 3, which is wrong.

This could seem harmless, but when reading a very large CSV with lots of records containing line breaks, the line number in the message is totally wrong, and we end up with no clue to fix the misformatted file.

sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2

The text was updated successfully, but these errors were encountered:

st-pasha · 2017-10-18T16:54:00Z

Unfortunately it gets worse with the large file: during multi-chunk reading each chunk knows how to find the beginning of a "next good line", but has no idea what line number that is. As a result, it may report errors like "Expecting 3 cols but row 12 contains only 2 cols ...", where that 12 is relative to the current chunk not to the whole file (and therefore is less than useful).

mattdowle · 2018-03-03T09:27:31Z

Thanks @etienne-s for the great report.
This has been recently fixed in dev. Here's new behaviour which looks good to me now.

> src1 <- 'col1,col2
+ a,b
+ c,d,e'
> fread(src1)
   col1 col2
1:    a    b
Warning message:
In fread(src1) : Discarded single-line footer: <<c,d,e>>

> src2 <- 'col1,col2
+ a,"b1 
+ b2"
+ c,d,e'
> fread(src2)
   col1    col2
1:    a b1 \nb2
Warning message:
In fread(src2) : Discarded single-line footer: <<c,d,e>>
>

Also, the line numbers on errors in large files, which @st-pasha pointed out was less than useful before, I've now fixed as well. There's now a concept of a thread being at the head position, which corresponds to the last row written to the result, so that thread knows how many rows have passed. Only the thread at the head position now stops the team and throws the error, including an accurate row number. But, it is a row number and not a line number. If there are no embedded newlines, then they are the same and the line number reported will be correct (even when multi-chunked). But, in the event of embedded newlines, the error won't ever report a line number that could accurately be used with head -n. The error does include quite a lot of the line in error so you greping for a substring from it can often work.
Having said this i) if chunk boundaries fall within fields with many embedded newlines, it's much better handled now in dev. It does it's best and then ultimately the head position is now able to sweep up such awkward jump landings unambiguously since it has full lineage from the start of file. i.e. errors should not happen now due to jumping, so long as the file is valid when read from the start. and ii) on the todo list somewhere is, because I'd prefer not to try and track the number of newlines inside quoted fields (and therefore the true line number of error in that case), instead, fread could output the 2 or 3 lines either side of the line in error, automatically for you. Like a grep -A 2 -B 2 but automatically.

Please test latest dev again and open a new issue if any problems. I'm pretty sure it should be all fine now in this area. Thanks again.

st-pasha · 2018-03-04T02:54:29Z

I believe this is still a useful feature to have, and not very hard to implement after your work in #2627 .
All that is needed is to increment the thread-local line counter in the string field parser when we check if (*ch=='\n') { ... }. These line counters would then propagate from thread to thread in the same way as myDTi counter does.

The @etienne-s example can better be given as

> fread('C1,C2\na,"b1\nb2"\nc,d,e\nf,g\nh,k\n')
   C1     C2
1:  a b1\nb2
Warning message:
In fread("C1,C2\na,\"b1\nb2\"\nc,d,e\nf,g\nh,k\n") :
  Stopped early on line 3. Expected 2 fields but found 3. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<c,d,e>>

Here the warning message says "stopped early on line 3 <<c,d,e>>", however this is the 4th line in the file. Of course, this example is trivial, but in a multi-GB file the discrepancies between lines/rows may reach tens of thousands making it hard to find the error.

etienne-s · 2018-03-12T14:37:24Z

Thanks @mattdowle!

I've tested again and my particular problem is solved in the new version, since the new warning message includes a large part of the faulty line. (The old message included only the first 10 characters, and it was not enough to locate the problem with grep.)

In some other cases it might be useful to have an exact line number reported in the message instead of a row number.

mattdowle changed the title ~~[fread] Line numbers in error messages should refer to the raw file~~ Line numbers in error messages should refer to the raw file Oct 18, 2017

mattdowle added fread dev labels Oct 18, 2017

mattdowle closed this as completed Mar 3, 2018

mattdowle mentioned this issue Mar 3, 2018

Wrong line number in error message #2427

Closed

mattdowle reopened this Apr 22, 2018

mattdowle added this to the v1.11.2 milestone Apr 22, 2018

jangorecki modified the milestones: 1.12.0, 1.11.6 Jun 6, 2018

mattdowle modified the milestones: 1.11.6, 1.12.0 Sep 20, 2018

mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 6, 2019

jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 24, 2019

jangorecki removed the dev label Jan 29, 2019

jangorecki modified the milestones: 1.12.4, 1.13.0 Sep 17, 2019

mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

jangorecki removed this from the 1.16.0 milestone Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line numbers in error messages should refer to the raw file #2428

Line numbers in error messages should refer to the raw file #2428

etienne-s commented Oct 18, 2017

st-pasha commented Oct 18, 2017

mattdowle commented Mar 3, 2018

st-pasha commented Mar 4, 2018

etienne-s commented Mar 12, 2018

Line numbers in error messages should refer to the raw file #2428

Line numbers in error messages should refer to the raw file #2428

Comments

etienne-s commented Oct 18, 2017

st-pasha commented Oct 18, 2017

mattdowle commented Mar 3, 2018

st-pasha commented Mar 4, 2018

etienne-s commented Mar 12, 2018