Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread does not stop when input is unsupported (e.g. binary file) #2834

Closed
qgeissmann opened this issue May 4, 2018 · 3 comments
Closed

fread does not stop when input is unsupported (e.g. binary file) #2834

qgeissmann opened this issue May 4, 2018 · 3 comments

Comments

@qgeissmann
Copy link

qgeissmann commented May 4, 2018

For instance, if user put a random png file as input, no error is raised:

dt <- fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/icons/sticker.png")

We get:

Warning message:
In fread("https://raw.githubusercontent.com/wiki/Rdatatable/data.table/icons/sticker.png") :
  Stopped early on line 3. Expected 1 fields but found 1. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<>>

We even get a data table:

> dt
   \x89PNG
1:    \032

Note sure I should, but I would expect failure instead.

@st-pasha
Copy link
Contributor

st-pasha commented May 4, 2018

This is a valid concern of course.

  • Current behavior is caused by the recent change in fread's logic. Previously upon encountering an error in the file (such inconsistent number of fields) we'd throw an outright error telling the user that something went wrong on a particular line. However now we issue a warning instead, and return the data.table that was parsed up to the point of failure. Presumably this could be useful to some users in certain circumstances.
  • If you're wondering why the message says Expected 1 fields but found 1 -- this is because of issue fread error reading Latin-1 file containing NUL byte <0x00> #2435 (fread stops reading a file upon encountering a NUL byte).
  • Unfortunately, fread was not designed to handle binary inputs. There is not a single test with a binary input for example. It is also hard to define (and therefore detect) what constitutes a binary input. The file has invalid utf8 sequences? Well, maybe it's just in a different encoding. The file has control characters? Some CSV files use them as field separators. Moreover, it is perfectly legal to have a CSV with some binary data embedded (although that data better be properly quoted!).

Overall, this is not as simple as it looks. One possible way to approach this problem is to sniff popular file formats from the initial bytes, similar to how we detect a BOM mark currently. But even then care must be taken not to fall into the same sin as Excel with its ID column bug.

This could be a good place to start.

@MichaelChirico
Copy link
Member

MichaelChirico commented May 5, 2018

agree we should err strongly on the side of false positives for identifying valid input.

is there any security/crash vulnerability from allowing arbitrary binary files to be parsed? if not, I don't see much cost to delegating this to the user.

it seems almost certain to wind up with a 1x1 data.table as output -- perhaps in such cases, we can append a proviso to the warning message "Are you sure you've provided valid input?"

what would happen if fill = TRUE?

@MichaelChirico
Copy link
Member

Closed pending further motivation to support anything here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants