Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread: use fill with integer as ncol guess #5119

Merged
merged 39 commits into from
Mar 21, 2024
Merged

fread: use fill with integer as ncol guess #5119

merged 39 commits into from
Mar 21, 2024

Conversation

ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Aug 28, 2021

Closes #2727
Closes #1812
Closes #2691
Closes #5378
Closes #4130
Closes #3436
Closes #2691

Changes:
data.table internally makes a guess about the number of columns (ncol). This PR changes that the user can provide a guess for ncol and uses ncol = max(user_guess, data.table_estimate) when fill=TRUE or fill>0.
Now the user guess can also be too high, so we would end up with a data.table with a lot of "empty" boolean NA columns. Therefore, we count the maximum number of columns during reading and afterwards clean up the overallocated empty columns.

Moreover, the error messages for fill=TRUE and fill=int are modified to nudge the user into the direction of using a guess or a higher guess

@codecov
Copy link

codecov bot commented Aug 28, 2021

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.51%. Comparing base (3eefbca) to head (5b96e1b).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5119      +/-   ##
==========================================
+ Coverage   97.50%   97.51%   +0.01%     
==========================================
  Files          80       80              
  Lines       14884    14913      +29     
==========================================
+ Hits        14513    14543      +30     
+ Misses        371      370       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mattdowle
Copy link
Member

mattdowle commented Aug 30, 2021

It feels to me, on first glance, like fill=TRUE could and should just work rather than adding a new argument. If we did add this new argument now it will be hard to take away in future once it is in use. The new argument works by reading every row in the file to know the max number of columns, iiuc, which isn't ideal (although it is better than current situation).
Since data.table is over-allocated, and fread is creating a new one for the first time, it should be possible to add new columns as they are encountered out-of-sample when fill=TRUE. I saw someone suggested that fill=20 could be used if the user knows there are 20 columns, which seems a good way to control the over-allocation too; e.g. if user knows there are not more than 100 columns, then they could specify fill=100 and if only 23 were found, that would be fine and the result would have 23 columns. If there are 1,000,000 columns and the sample finds only 20, then fill=TRUE (i.e. user not specifying an upper bound) could result in a few shallow copies internally as it found more and more columns out-of-sample and the over-allocated amount kept getting filled. Those shallow copies could be avoided if the user specified a very large upper bound fill=1e7 for example. But that's just an extreme edge case.

@mattdowle mattdowle added the WIP label Aug 30, 2021
@ben-schwen ben-schwen changed the title fread: turn off sampling for fill fread: use fill with integer as ncol guess Aug 31, 2021
@ben-schwen ben-schwen removed the WIP label Aug 31, 2021
@ben-schwen ben-schwen added the WIP label Aug 31, 2021
@ben-schwen
Copy link
Member Author

@mattdowle Indeed, providing an estimate for the number of columns seems a better choice than reading the file twice. I guess the only thing left to do is to delete the overallocated column or is that a user problem?

@mattdowle
Copy link
Member

@ben-schwen Wow - great! Will look. Overallocated columns can be left there. They only take up 8 bytes per slot (or 4 bytes on 32bit) and they're not normally user visible.

@mattdowle mattdowle removed the WIP label Aug 31, 2021
NEWS.md Outdated Show resolved Hide resolved
R/fread.R Outdated Show resolved Hide resolved
@@ -18330,3 +18330,22 @@ if (test_bit64) {
apple = data.table(id = c("a", "b", "b"), time = c(1L, 1L, 2L), y = i64v[1:3])
test(2248, dcast(apple, id ~ time, value.var = "y"), data.table(id = c('a', 'b'), `1` = i64v[1:2], `2` = i64v[4:3], key='id'))
}

# fread(...,fill) can also be used to specify a guess on the maximum number of columns #2691 #1812 #4130 #3436 #2727
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see one example here derived from the closing issues, but I think it'd be better if we included more of those as specific regression tests before closing the linked issue.

man/fread.Rd Outdated Show resolved Hide resolved
src/fread.c Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

Looks pretty good overall! One question, I see all of the linked issues marked as "closed" -- does that mean there's no other way to fulfill those requests? I.e., we might see this issue as providing a workaround for the linked issues, not necessarily addressing them directly. WDYT?

ben-schwen and others added 5 commits March 21, 2024 00:16
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
@ben-schwen
Copy link
Member Author

Looks pretty good overall! One question, I see all of the linked issues marked as "closed" -- does that mean there's no other way to fulfill those requests? I.e., we might see this issue as providing a workaround for the linked issues, not necessarily addressing them directly. WDYT?

Good point. I will recheck all issues and update the first post on whether we address it directly or provide an acceptable solution. AFAIR I couldn't come up with a better solution than user guesses without double reading (which is something we wanted to avoid)

NEWS.md Outdated Show resolved Hide resolved
isTRUEorFALSE(verbose), isTRUEorFALSE(check.names), isTRUEorFALSE(logical01), isTRUEorFALSE(keepLeadingZeros), isTRUEorFALSE(yaml),
isTRUEorFALSE(stringsAsFactors) || (is.double(stringsAsFactors) && length(stringsAsFactors)==1L && 0.0<=stringsAsFactors && stringsAsFactors<=1.0),
is.numeric(nrows), length(nrows)==1L
)
fill=as.integer(fill)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that fill = 1L is handled identically to fill=TRUE right? I am not sure there's a use case for the former, but maybe it should be documented?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's handle in follow-up if needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the handling is anyway that it uses the max of data.table estimate and user guess.

@MichaelChirico
Copy link
Member

Looks pretty good overall! One question, I see all of the linked issues marked as "closed" -- does that mean there's no other way to fulfill those requests? I.e., we might see this issue as providing a workaround for the linked issues, not necessarily addressing them directly. WDYT?

Good point. I will recheck all issues and update the first post on whether we address it directly or provide an acceptable solution. AFAIR I couldn't come up with a better solution than user guesses without double reading (which is something we wanted to avoid)

I'm going to go ahead and merge. Please do follow up with the linked issues and (suggestions, not a checklist) (1) file PRs for associated tests (2) comment on the issue itself if you see fit, could be "this issue can be fixed by supplying your own upper bound guess on the # of columns" "please file a follow-up if you're convinced the workaround with fill= can be further improved", etc. (3) possibly re-open if you see fit.

@MichaelChirico
Copy link
Member

Great stuff! Not sure we've ever had a PR close seven issues at once before 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants