Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line numbers in error messages should refer to the raw file #2428

Open
etienne-s opened this issue Oct 18, 2017 · 4 comments
Open

Line numbers in error messages should refer to the raw file #2428

etienne-s opened this issue Oct 18, 2017 · 4 comments
Labels

Comments

@etienne-s
Copy link

Consider the following example:

src1 <- 'col1,col2
a,b
c,d,e'

fread(src1)
# Error in fread(src1) : 
#   Line 3 from sampling jump 0 starting <<c,d,e>> has more than the expected 2 fields. Separator ',' occurs at position 4 which is character 2 of the last field: <<d,e>>. Consider setting 'comment.char=' if there is a trailing comment to be ignored.

The error message reports a problem on line 3, which is fine.

Now suppose b contains a line break:

src2 <- 'col1,col2
a,"b1 
b2"
c,d,e'

fread(src2)
# Error in fread(src2) : 
#   Line 3 from sampling jump 0 starting <<c,d,e>> has more than the expected 2 fields. Separator ',' occurs at position 4 which is character 2 of the last field: <<d,e>>. Consider setting 'comment.char=' if there is a trailing comment to be ignored.

The problem is reported to be on line 3, which is wrong.

This could seem harmless, but when reading a very large CSV with lots of records containing line breaks, the line number in the message is totally wrong, and we end up with no clue to fix the misformatted file.

sessionInfo()

R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.5

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2   
@st-pasha
Copy link
Contributor

Unfortunately it gets worse with the large file: during multi-chunk reading each chunk knows how to find the beginning of a "next good line", but has no idea what line number that is. As a result, it may report errors like "Expecting 3 cols but row 12 contains only 2 cols ...", where that 12 is relative to the current chunk not to the whole file (and therefore is less than useful).

@mattdowle mattdowle changed the title [fread] Line numbers in error messages should refer to the raw file Line numbers in error messages should refer to the raw file Oct 18, 2017
@mattdowle
Copy link
Member

Thanks @etienne-s for the great report.
This has been recently fixed in dev. Here's new behaviour which looks good to me now.

> src1 <- 'col1,col2
+ a,b
+ c,d,e'
> fread(src1)
   col1 col2
1:    a    b
Warning message:
In fread(src1) : Discarded single-line footer: <<c,d,e>>

> src2 <- 'col1,col2
+ a,"b1 
+ b2"
+ c,d,e'
> fread(src2)
   col1    col2
1:    a b1 \nb2
Warning message:
In fread(src2) : Discarded single-line footer: <<c,d,e>>
> 

Also, the line numbers on errors in large files, which @st-pasha pointed out was less than useful before, I've now fixed as well. There's now a concept of a thread being at the head position, which corresponds to the last row written to the result, so that thread knows how many rows have passed. Only the thread at the head position now stops the team and throws the error, including an accurate row number. But, it is a row number and not a line number. If there are no embedded newlines, then they are the same and the line number reported will be correct (even when multi-chunked). But, in the event of embedded newlines, the error won't ever report a line number that could accurately be used with head -n. The error does include quite a lot of the line in error so you greping for a substring from it can often work.
Having said this i) if chunk boundaries fall within fields with many embedded newlines, it's much better handled now in dev. It does it's best and then ultimately the head position is now able to sweep up such awkward jump landings unambiguously since it has full lineage from the start of file. i.e. errors should not happen now due to jumping, so long as the file is valid when read from the start. and ii) on the todo list somewhere is, because I'd prefer not to try and track the number of newlines inside quoted fields (and therefore the true line number of error in that case), instead, fread could output the 2 or 3 lines either side of the line in error, automatically for you. Like a grep -A 2 -B 2 but automatically.

Please test latest dev again and open a new issue if any problems. I'm pretty sure it should be all fine now in this area. Thanks again.

@st-pasha
Copy link
Contributor

st-pasha commented Mar 4, 2018

I believe this is still a useful feature to have, and not very hard to implement after your work in #2627 .
All that is needed is to increment the thread-local line counter in the string field parser when we check if (*ch=='\n') { ... }. These line counters would then propagate from thread to thread in the same way as myDTi counter does.

The @etienne-s example can better be given as

> fread('C1,C2\na,"b1\nb2"\nc,d,e\nf,g\nh,k\n')
   C1     C2
1:  a b1\nb2
Warning message:
In fread("C1,C2\na,\"b1\nb2\"\nc,d,e\nf,g\nh,k\n") :
  Stopped early on line 3. Expected 2 fields but found 3. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<c,d,e>>

Here the warning message says "stopped early on line 3 <<c,d,e>>", however this is the 4th line in the file. Of course, this example is trivial, but in a multi-GB file the discrepancies between lines/rows may reach tens of thousands making it hard to find the error.

@etienne-s
Copy link
Author

Thanks @mattdowle!

I've tested again and my particular problem is solved in the new version, since the new warning message includes a large part of the faulty line. (The old message included only the first 10 characters, and it was not enough to locate the problem with grep.)

In some other cases it might be useful to have an exact line number reported in the message instead of a row number.

@mattdowle mattdowle reopened this Apr 22, 2018
@mattdowle mattdowle added this to the v1.11.2 milestone Apr 22, 2018
@jangorecki jangorecki modified the milestones: 1.12.0, 1.11.6 Jun 6, 2018
@mattdowle mattdowle modified the milestones: 1.11.6, 1.12.0 Sep 20, 2018
@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 6, 2019
@jangorecki jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 24, 2019
@jangorecki jangorecki removed the dev label Jan 29, 2019
@jangorecki jangorecki modified the milestones: 1.12.4, 1.13.0 Sep 17, 2019
@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@jangorecki jangorecki removed this from the 1.16.0 milestone Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants