fread does not stop when input is unsupported (e.g. binary file) #2834

qgeissmann · 2018-05-04T11:19:19Z

For instance, if user put a random png file as input, no error is raised:

dt <- fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/icons/sticker.png")

We get:

Warning message:
In fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/icons/sticker.png") :
  Stopped early on line 3. Expected 1 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<>>

We even get a data table:

> dt
   \x89PNG
1:    \032

Note sure I should, but I would expect failure instead.

The text was updated successfully, but these errors were encountered:

st-pasha · 2018-05-04T17:01:41Z

This is a valid concern of course.

Current behavior is caused by the recent change in fread's logic. Previously upon encountering an error in the file (such inconsistent number of fields) we'd throw an outright error telling the user that something went wrong on a particular line. However now we issue a warning instead, and return the data.table that was parsed up to the point of failure. Presumably this could be useful to some users in certain circumstances.
If you're wondering why the message says Expected 1 fields but found 1 -- this is because of issue fread error reading Latin-1 file containing NUL byte <0x00> #2435 (fread stops reading a file upon encountering a NUL byte).
Unfortunately, fread was not designed to handle binary inputs. There is not a single test with a binary input for example. It is also hard to define (and therefore detect) what constitutes a binary input. The file has invalid utf8 sequences? Well, maybe it's just in a different encoding. The file has control characters? Some CSV files use them as field separators. Moreover, it is perfectly legal to have a CSV with some binary data embedded (although that data better be properly quoted!).

Overall, this is not as simple as it looks. One possible way to approach this problem is to sniff popular file formats from the initial bytes, similar to how we detect a BOM mark currently. But even then care must be taken not to fall into the same sin as Excel with its ID column bug.

This could be a good place to start.

MichaelChirico · 2018-05-05T00:17:42Z

agree we should err strongly on the side of false positives for identifying valid input.

is there any security/crash vulnerability from allowing arbitrary binary files to be parsed? if not, I don't see much cost to delegating this to the user.

it seems almost certain to wind up with a 1x1 data.table as output -- perhaps in such cases, we can append a proviso to the warning message "Are you sure you've provided valid input?"

what would happen if fill = TRUE?

MichaelChirico · 2024-04-13T06:25:30Z

Closed pending further motivation to support anything here.

qgeissmann mentioned this issue May 4, 2018

linking fails on single file rethomics/scopr#13

Closed

st-pasha added feature request fread labels May 4, 2018

st-pasha mentioned this issue May 4, 2018

Master task for fread bugs / proposals #2247

Closed

MichaelChirico closed this as completed Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread does not stop when input is unsupported (e.g. binary file) #2834

fread does not stop when input is unsupported (e.g. binary file) #2834

qgeissmann commented May 4, 2018 •

edited

Loading

st-pasha commented May 4, 2018 •

edited

Loading

MichaelChirico commented May 5, 2018 •

edited

Loading

MichaelChirico commented Apr 13, 2024

fread does not stop when input is unsupported (e.g. binary file) #2834

fread does not stop when input is unsupported (e.g. binary file) #2834

Comments

qgeissmann commented May 4, 2018 • edited Loading

st-pasha commented May 4, 2018 • edited Loading

MichaelChirico commented May 5, 2018 • edited Loading

MichaelChirico commented Apr 13, 2024

qgeissmann commented May 4, 2018 •

edited

Loading

st-pasha commented May 4, 2018 •

edited

Loading

MichaelChirico commented May 5, 2018 •

edited

Loading