-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
na.strings getOption() added so ,, can be read as NA by default in future #2652
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2652 +/- ##
==========================================
+ Coverage 93.31% 93.31% +<.01%
==========================================
Files 61 61
Lines 12191 12196 +5
==========================================
+ Hits 11376 11381 +5
Misses 815 815
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how fwrite
no longer selects its na
attribute based on the number of columns - makes the output more predictable.
However as for fread
now parsing ,,
as an NA string, I have doubts. On one hand I appreciate how the treatment of ,,
now becomes consistent across types (except booleans?), and it is also compatible with fwrite
which writes empty strings as always quoted. On the other hand, all other CSV writers do not follow such a convention, and unless the user used option quoting="all"
, it will output empty string as ,,
. Most CSV writers don't even have the notion of NA string.
The end result is that this change might be a breaking change for some users who have regular CSV files. If they have code that reads a file, obtains a character column, and then manipulates that column somehow, then having NAs where they used to have empty strings will likely lead to unexpected results.
I do not think such change should be introduced without weighing all pros and cons, and without the usual deprecation cycle.
For booleans both Good points on breakage. Usual I'll sweep through all fread issues and see if any others are in this area. More can be done to auto detect files which have used Views welcome from others and we'll keep this PR open a while. So far I've tried to fix the issues linked at the top. |
There's |
It's more about choice of defaults. I'm finding the choice of |
NEWS.md
Outdated
``` | ||
This option controls how `,,` is read in character columns. It does not affect numeric columns which read `,,` as `NA` regardless. We would like `,,`=>`NA` for consistency with numeric types, and `,"",`=>empty string to be the standard default for `fwrite/fread` character columns so that `fread(fwrite(DT))==DT` without needing any change to any parameters. `fwrite` has never written `NA` as `"NA"`, by default it already writes `,,`. The use of R's `getOption()` allows data.table users to move forward early, or restore old behaviour when the default's default is changed in future. | ||
|
||
2. `fread` now reads a column of all 0's and 1's as `logical` rather than `integer`, for convenience to avoid needing to change the type afterwards or use `colClasses`. The old behaviour can be restored with `options(datatable.logical01=FALSE)`. We felt this default change was ok to make because in all operations there should be no difference: R treats `logical` and `integer` the same. If this change does cause a problem, the option is provided to restore old behaviour while you update your code. Similarly, `fwrite` now writes `logical` columns as `0/1` by default, controlled by the same option. `0/1` is smaller and faster than `"TRUE"/"FALSE"`, which can make a significant difference to space and time the more `logical` columns there are. Further, a column of `TRUE/FALSE`s is ok, as well as `True/False`s and `true/false`s, but mixing styles (e.g. `TRUE/false`) is not and will be read as type `character`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be a bit more careful on the wording:
in all operations there should be no difference: R treats
logical
andinteger
the same.
But that's not true:
DT = data.table(l_int = c(0, 0, 1, 0), l_log = c(FALSE, FALSE, TRUE, FALSE), i = 1:4)
DT[(l_int)]
# l_int l_log i
# 1: 0 FALSE 1
DT[(l_log)]
# l_int l_log i
# 1: 1 TRUE 3
I think (?) more accurate is that all arithmetic expecting integer
and getting logical
will go through as expected (i.e., that sending 0/1
to FALSE/TRUE
should be safe, whereas the reverse would cause more issues). Of course any function running an is.integer
test will fail (and vice versa for is.logical
on integer
columns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same concern. I think as long as we add option for 1.10.6 and change its default from 1.10.8 will be fine.
@@ -157,6 +167,8 @@ the behaviour of `base:::merge.data.frame()`. Thanks to @sritchie73 for reportin | |||
|
|||
35. `CJ()` now fails with proper error message when results would exceed max integer, [#2636](https://github.com/Rdatatable/data.table/issues/2636). | |||
|
|||
36. `NA` in character columns now display as `<NA>` just like base R to distinguish from `""` and `"NA"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice, no need for quote = TRUE
argument by default 👍
LGTM, don't see an option to approve the PR anywhere though 🤔 |
If we do 1/0 instead of TRUE/FALSE we could also make #1656, at least as option @mattdowle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistency of fwrite and fread should be most important, then speed and options to customize.
@MichaelChirico For me it was: from this page, select the |
@HughParsonage thanks, I thought I remembered it on this (Conversation) tab |
… comments and reflected at the top of NEWS.
Closes #2106 (again)
Closes #2217
Closes #2214
Closes #2281
Closes #1159
#2524 is related and in discussion as to whether filled character columns should have NA always independently of
na.strings
. Can address that separately to this PR.Standardizing
fread
's default :,,
meansNA
for all types consistently (in particular in string columns).,"",
means empty string as written byfwrite
by default (change made in dev some months back).See also comment in reopened issue here that this PR reverts the
fwrite
change in dev for 1-column DTs back to the same consistent default in v1.10.4 as on CRAN.For all input data (i.e. all types, NA or "", 1 column or >1 column),
fread(fwrite(DT)) == DT
should be true without needing to change any arguments. This is not true before this PR.TODO:
allow quoted na.strings as per na.strings is too literal when column is quoted on file #2586 and change doc again.Left open issue in this milestone to address separately to this PR. This PR is just about,,
-vs-,"",
default.<NA>
just like base R to distinguish from""
and"NA"