-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fread: use fill with integer as ncol guess #5119
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5119 +/- ##
==========================================
+ Coverage 97.50% 97.51% +0.01%
==========================================
Files 80 80
Lines 14884 14913 +29
==========================================
+ Hits 14513 14543 +30
+ Misses 371 370 -1 ☔ View full report in Codecov by Sentry. |
It feels to me, on first glance, like |
@mattdowle Indeed, providing an estimate for the number of columns seems a better choice than reading the file twice. I guess the only thing left to do is to delete the overallocated column or is that a user problem? |
@ben-schwen Wow - great! Will look. Overallocated columns can be left there. They only take up 8 bytes per slot (or 4 bytes on 32bit) and they're not normally user visible. |
inst/tests/tests.Rraw
Outdated
@@ -18330,3 +18330,22 @@ if (test_bit64) { | |||
apple = data.table(id = c("a", "b", "b"), time = c(1L, 1L, 2L), y = i64v[1:3]) | |||
test(2248, dcast(apple, id ~ time, value.var = "y"), data.table(id = c('a', 'b'), `1` = i64v[1:2], `2` = i64v[4:3], key='id')) | |||
} | |||
|
|||
# fread(...,fill) can also be used to specify a guess on the maximum number of columns #2691 #1812 #4130 #3436 #2727 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see one example here derived from the closing issues, but I think it'd be better if we included more of those as specific regression tests before closing the linked issue.
Looks pretty good overall! One question, I see all of the linked issues marked as "closed" -- does that mean there's no other way to fulfill those requests? I.e., we might see this issue as providing a workaround for the linked issues, not necessarily addressing them directly. WDYT? |
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Good point. I will recheck all issues and update the first post on whether we address it directly or provide an acceptable solution. AFAIR I couldn't come up with a better solution than user guesses without double reading (which is something we wanted to avoid) |
isTRUEorFALSE(verbose), isTRUEorFALSE(check.names), isTRUEorFALSE(logical01), isTRUEorFALSE(keepLeadingZeros), isTRUEorFALSE(yaml), | ||
isTRUEorFALSE(stringsAsFactors) || (is.double(stringsAsFactors) && length(stringsAsFactors)==1L && 0.0<=stringsAsFactors && stringsAsFactors<=1.0), | ||
is.numeric(nrows), length(nrows)==1L | ||
) | ||
fill=as.integer(fill) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this means that fill = 1L
is handled identically to fill=TRUE
right? I am not sure there's a use case for the former, but maybe it should be documented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's handle in follow-up if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but the handling is anyway that it uses the max of data.table estimate and user guess.
I'm going to go ahead and merge. Please do follow up with the linked issues and (suggestions, not a checklist) (1) file PRs for associated tests (2) comment on the issue itself if you see fit, could be "this issue can be fixed by supplying your own upper bound guess on the # of columns" "please file a follow-up if you're convinced the workaround with |
Great stuff! Not sure we've ever had a PR close seven issues at once before 🎉 |
Closes #2727
Closes #1812
Closes #2691
Closes #5378
Closes #4130
Closes #3436
Closes #2691
Changes:
data.table
internally makes a guess about the number of columns (ncol). This PR changes that the user can provide a guess for ncol and uses ncol = max(user_guess, data.table_estimate) whenfill=TRUE
orfill>0
.Now the user guess can also be too high, so we would end up with a
data.table
with a lot of "empty" boolean NA columns. Therefore, we count the maximum number of columns during reading and afterwards clean up the overallocated empty columns.Moreover, the error messages for
fill=TRUE
andfill=int
are modified to nudge the user into the direction of using a guess or a higher guess