fread: use fill with integer as ncol guess #5119

ben-schwen · 2021-08-28T15:47:55Z

Closes #2727
Closes #1812
Closes #2691
Closes #5378
Closes #4130
Closes #3436
Closes #2691

Changes:
data.table internally makes a guess about the number of columns (ncol). This PR changes that the user can provide a guess for ncol and uses ncol = max(user_guess, data.table_estimate) when fill=TRUE or fill>0.
Now the user guess can also be too high, so we would end up with a data.table with a lot of "empty" boolean NA columns. Therefore, we count the maximum number of columns during reading and afterwards clean up the overallocated empty columns.

Moreover, the error messages for fill=TRUE and fill=int are modified to nudge the user into the direction of using a guess or a higher guess

…ill_sample

codecov · 2021-08-28T15:55:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.51%. Comparing base (3eefbca) to head (5b96e1b).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5119      +/-   ##
==========================================
+ Coverage   97.50%   97.51%   +0.01%     
==========================================
  Files          80       80              
  Lines       14884    14913      +29     
==========================================
+ Hits        14513    14543      +30     
+ Misses        371      370       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mattdowle · 2021-08-30T21:17:20Z

It feels to me, on first glance, like fill=TRUE could and should just work rather than adding a new argument. If we did add this new argument now it will be hard to take away in future once it is in use. The new argument works by reading every row in the file to know the max number of columns, iiuc, which isn't ideal (although it is better than current situation).
Since data.table is over-allocated, and fread is creating a new one for the first time, it should be possible to add new columns as they are encountered out-of-sample when fill=TRUE. I saw someone suggested that fill=20 could be used if the user knows there are 20 columns, which seems a good way to control the over-allocation too; e.g. if user knows there are not more than 100 columns, then they could specify fill=100 and if only 23 were found, that would be fine and the result would have 23 columns. If there are 1,000,000 columns and the sample finds only 20, then fill=TRUE (i.e. user not specifying an upper bound) could result in a few shallow copies internally as it found more and more columns out-of-sample and the over-allocated amount kept getting filled. Those shallow copies could be avoided if the user specified a very large upper bound fill=1e7 for example. But that's just an extreme edge case.

ben-schwen · 2021-08-31T20:06:10Z

@mattdowle Indeed, providing an estimate for the number of columns seems a better choice than reading the file twice. I guess the only thing left to do is to delete the overallocated column or is that a user problem?

mattdowle · 2021-08-31T21:34:28Z

@ben-schwen Wow - great! Will look. Overallocated columns can be left there. They only take up 8 bytes per slot (or 4 bytes on 32bit) and they're not normally user visible.

NEWS.md

R/fread.R

MichaelChirico · 2024-03-15T19:06:21Z

inst/tests/tests.Rraw

@@ -18330,3 +18330,22 @@ if (test_bit64) {
  apple = data.table(id = c("a", "b", "b"), time = c(1L, 1L, 2L), y = i64v[1:3])
  test(2248, dcast(apple, id ~ time, value.var = "y"), data.table(id = c('a', 'b'), `1` = i64v[1:2], `2` = i64v[4:3], key='id'))
 }
+
+# fread(...,fill) can also be used to specify a guess on the maximum number of columns #2691 #1812 #4130 #3436 #2727


I see one example here derived from the closing issues, but I think it'd be better if we included more of those as specific regression tests before closing the linked issue.

inst/tests/tests.Rraw

man/fread.Rd

src/fread.c

MichaelChirico · 2024-03-15T19:31:31Z

Looks pretty good overall! One question, I see all of the linked issues marked as "closed" -- does that mean there's no other way to fulfill those requests? I.e., we might see this issue as providing a workaround for the linked issues, not necessarily addressing them directly. WDYT?

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

ben-schwen · 2024-03-20T23:44:08Z

Looks pretty good overall! One question, I see all of the linked issues marked as "closed" -- does that mean there's no other way to fulfill those requests? I.e., we might see this issue as providing a workaround for the linked issues, not necessarily addressing them directly. WDYT?

Good point. I will recheck all issues and update the first post on whether we address it directly or provide an acceptable solution. AFAIR I couldn't come up with a better solution than user guesses without double reading (which is something we wanted to avoid)

NEWS.md

MichaelChirico · 2024-03-21T06:32:45Z

R/fread.R

    isTRUEorFALSE(verbose), isTRUEorFALSE(check.names), isTRUEorFALSE(logical01), isTRUEorFALSE(keepLeadingZeros), isTRUEorFALSE(yaml),
    isTRUEorFALSE(stringsAsFactors) || (is.double(stringsAsFactors) && length(stringsAsFactors)==1L && 0.0<=stringsAsFactors && stringsAsFactors<=1.0),
    is.numeric(nrows), length(nrows)==1L
  )
+  fill=as.integer(fill)


this means that fill = 1L is handled identically to fill=TRUE right? I am not sure there's a use case for the former, but maybe it should be documented?

Let's handle in follow-up if needed

Yes but the handling is anyway that it uses the max of data.table estimate and user guess.

MichaelChirico · 2024-03-21T06:37:25Z

Looks pretty good overall! One question, I see all of the linked issues marked as "closed" -- does that mean there's no other way to fulfill those requests? I.e., we might see this issue as providing a workaround for the linked issues, not necessarily addressing them directly. WDYT?

Good point. I will recheck all issues and update the first post on whether we address it directly or provide an acceptable solution. AFAIR I couldn't come up with a better solution than user guesses without double reading (which is something we wanted to avoid)

I'm going to go ahead and merge. Please do follow up with the linked issues and (suggestions, not a checklist) (1) file PRs for associated tests (2) comment on the issue itself if you see fit, could be "this issue can be fixed by supplying your own upper bound guess on the # of columns" "please file a follow-up if you're convinced the workaround with fill= can be further improved", etc. (3) possibly re-open if you see fit.

MichaelChirico · 2024-03-21T06:38:20Z

Great stuff! Not sure we've ever had a PR close seven issues at once before 🎉

ben-schwen added 7 commits August 28, 2021 16:54

fread: turn off sampling for fill

93f6db2

fixed stop

23ce31e

add stopf

117bc4b

fread: turn off sampling for fill

cb3d03b

Merge branch 'fill_sample' of github.com:Rdatatable/data.table into f…

d47da4a

…ill_sample

fread: turn off sampling for fill

c4fdc2e

fread turn off sampling for fill

1044f98

ben-schwen added 2 commits August 28, 2021 21:34

added coverage

6dc2c9d

coverage

a3e5864

mattdowle added the WIP label Aug 30, 2021

ben-schwen added 4 commits August 31, 2021 18:33

revert additional argument

9b6bdb3

fill upperbound

96f6a8d

fixed comment

79874c4

integer as fill argument

99303e2

ben-schwen changed the title ~~fread: turn off sampling for fill~~ fread: use fill with integer as ncol guess Aug 31, 2021

ben-schwen added 3 commits August 31, 2021 20:58

fix typo

7bc34e3

fix L

62ea4e7

add NEWS

c12bb77

ben-schwen removed the WIP label Aug 31, 2021

update verbose

a189b73

ben-schwen added the WIP label Aug 31, 2021

undo verbose

de8ff85

mattdowle removed the WIP label Aug 31, 2021

mcol reviewed Sep 22, 2021

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

ben-schwen added 2 commits October 31, 2021 17:47

init cleanup

d363f94

merge master

fbc2027

Refine NEWS

7ec8dc8