Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend functionality of fill parameter in fread #2848

Open
st-pasha opened this issue May 7, 2018 · 2 comments
Open

Extend functionality of fill parameter in fread #2848

st-pasha opened this issue May 7, 2018 · 2 comments

Comments

@st-pasha
Copy link
Contributor

st-pasha commented May 7, 2018

This FR was suggested by @HughParsonage in #2524. Reposting it here so that it doesn't get lost.

Currently fill can be either TRUE or FALSE (default). When true, all incomplete lines in the input will be padded with NAs. If false, any incomplete line in the input will cause a warning to be shown (previously it was an error).

Sometimes, however, users might want to fill with something else than NA. We could consider the following:

  • fill=c(class1=value1, class2=value2, ...): fill columns of type class1 with value1, columns of type class2 with value2, ..., all other columns are still filled with NAs.
  • fill=value: same as fill=c(class(value) = value).
  • fill=c(col1=value1, col2=value2, ...): fill column named col1 with values value1, column named col2 with values value2, ..., and all other columns with NAs.
  • fill=c(value1, value2, ...): fill first column with value1, second column with value2, etc.
  • fill=c(col1=c(value1=repl1, value2=repl2, ...), ...): in column col1 replace value1 with repl1, value2 with repl2, etc. This variant merges na.strings with fill.

We might also want to consider a different parameter name here. Right now fill controls the behavior of fread when the rows are ragged (i.e. different number of values in each row). It seems like a more natural extension for this functionality (but not the name) is to allow more choices what to do when rows are ragged: fill-with-NAs, fill-with-NAs-and-warn, error, warn-and-stop, etc.

On the other hand, the question of what to replace the missing values with seems to be orthogonal to the treatment of ragged rows. In particular, it is perfectly reasonable to ask for strict behavior (i.e. current fill=TRUE) but to fill all NAs in integer columns with say -999.

@MichaelChirico
Copy link
Member

#5119 extended fill= so that fill=integer means "I know there are at most integer columns in the table". I haven't read the OP carefully enough to say for sure, but at a high level it looks like setnafill() works well to achieve what it's after. @ben-schwen could you PTAL and rule on whether this can be closed as out-of-scope? Or is there more functionality worth exploring here?

@ben-schwen
Copy link
Member

I think the only corner case where this might give more functionality over fread and setnafill after reading is if I want to distinguish between missing and another na.string.

As Michael mentioned this interferes with fill=integer providing a user-guess which was an often requested feature and implemented in #5119

I would keep it as an FR but with a different parameter name, which might be put into ... and then get evaluated. Definitely only worth implementing if requested by more users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants