dcast only computes default fill if necessary #5549

tdhock · 2022-11-30T18:48:13Z

Closes #5512
Closes #5390

In current master, dcast(fill=NULL) always computes a default fill value, even when there are no missing cells. For example this is the result of a new test case using current master

> DT <- data.table(chr=c("a","b","b"), int=1:3)[, num := as.numeric(int)]
> dcast(DT, num ~ chr, min, value.var="int")
Key: <num>
     num     a     b
   <num> <int> <int>
1:     1     1    NA
2:     2    NA     2
3:     3    NA     3
Warning message:
In dcast.data.table(DT, num ~ chr, min, value.var = "int") :
  NAs introduced by coercion to integer range

In the code above it is normal to compute the fill value (fill_value=as.integer(min(integer())) which is NA) because it is used three times.

However the code below gives the following result using current master, indicating that a default fill value is computed, even though it is not used:

> dcast(DT, . ~ chr, min, value.var="int")
Key: <.>
        .     a     b
   <char> <int> <int>
1:      .     1     2
Warning message:
In dcast.data.table(DT, . ~ chr, min, value.var = "int") :
  NAs introduced by coercion to integer range

Using this branch we get the output below (no warning), indicating that no default fill value was computed, because it is not necessary:

> dcast(DT, . ~ chr, min, value.var="int")
Key: <.>
        .     a     b
   <char> <int> <int>
1:      .     1     2
>

codecov · 2022-11-30T18:56:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.47%. Comparing base (15c127e) to head (4b96d35).
Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5549      +/-   ##
==========================================
- Coverage   97.49%   97.47%   -0.02%     
==========================================
  Files          80       80              
  Lines       14861    14873      +12     
==========================================
+ Hits        14488    14498      +10     
- Misses        373      375       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ben-schwen · 2022-11-30T20:30:38Z

src/fcast.c

      for (int j=0; j<ncols; ++j) {
        SET_VECTOR_ELT(ans, nlhs+j+i*ncols, target=allocVector(thistype, nrows) );
        int *itarget = INTEGER(target);
        copyMostAttrib(thiscol, target);
        for (int k=0; k<nrows; ++k) {
          int thisidx = idx[k*ncols + j];
-          itarget[k] = (thisidx == NA_INTEGER) ? ithisfill[0] : ithiscol[thisidx-1];
+          itarget[k] = (thisidx == NA_INTEGER) ? INTEGER(thisfill)[0] : INTEGER(thiscol)[thisidx-1];


can't we just use the same code for INTSXP, LGLSXP, REALSXP and CPLXSXP with coerceAs?

This is my first time hacking on this code, so I'm not sure, but I also had the feeling that it would be desirable to avoid the repeated logic in these switch cases. About usage of coerceAs, would that introduce unwanted overhead / performance penalty? I was thinking of solving that via a C macro. Anyway I would suggest saving that for another PR, though.

Value added from coerceAs is handling attributes and therefore classes like int64, not sure if relevant here

shrektan

Thanks. It looks good to me.

src/fcast.c

NEWS.md

src/fcast.c

Co-authored-by: Xianying Tan <shrektan@126.com>

MichaelChirico · 2024-03-01T19:06:23Z

From #4586, let's include this as a test case & make sure it doesn't warn:

test(xxxxxxx, dcast(data.table(a = 1, b = 2, c = 3), a ~ b, value.var = 'c', fill = '2'),
              data.table(a=1, `2`=3, key='a'))

…to fix5512

tdhock · 2024-03-04T18:06:50Z

I added that test case alongside a similar one.

MichaelChirico · 2024-03-08T04:16:29Z

@tdhock is this PR ready for review? I'm wary of the conflict

inst/tests/tests.Rraw

src/fcast.c

MichaelChirico · 2024-03-12T00:53:05Z

Basically ready to go, some more minor feedback this round. Thanks!

inst/tests/tests.Rraw

R/fcast.R

inst/tests/tests.Rraw

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

R/fcast.R

Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>

MichaelChirico

thanks for your efforts!

tdhock · 2024-03-14T15:09:31Z

you are welcome! thanks very much for your feedback, too!
However we have one failing test, and I've figured out which commit we need to revert, 747c76c but not sure why, any ideas?

(base) tdhock@tdhock-MacBook:~/R/data.table(fix5512)$ git checkout 747c76cd8a65e46294dd42edf69294a60f3b9e94 && R CMD INSTALL . && R -e "library(data.table);dcast(data.table(id=1, grp=1, e=expression(1)), id ~ grp, value.var='e')"
HEAD est maintenant sur 747c76cd dat_for_default_fill is zero-row dt
Le chargement a nécessité le package : grDevices
* installing to library ‘/home/tdhock/lib/R/library’
* installing *source* package ‘data.table’ ...
** using staged installation
gcc 12.3.0
zlib 1.2.11 is available ok
R CMD SHLIB supports OpenMP without any extra hint
** libs
using C compiler: ‘gcc (GCC) 12.3.0’
gcc -shared -L/home/tdhock/lib/R/lib -L/usr/local/lib -o data.table.so assign.o between.o bmerge.o chmatch.o cj.o coalesce.o dogroups.o fastmean.o fcast.o fifelse.o fmelt.o forder.o frank.o fread.o freadR.o froll.o frollR.o frolladaptive.o fsort.o fwrite.o fwriteR.o gsumm.o idatetime.o ijoin.o init.o inrange.o nafill.o negate.o nqrecreateindices.o openmp-utils.o programming.o quickselect.o rbindlist.o reorder.o shift.o snprintf.o subset.o transpose.o types.o uniqlist.o utils.o vecseq.o wrappers.o -fopenmp -lz -L/home/tdhock/lib/R/lib -lR
PKG_CFLAGS = -fopenmp
PKG_LIBS = -fopenmp -lz
if [ "data.table.so" != "data_table.so" ]; then mv data.table.so data_table.so; fi
if [ "" != "Windows_NT" ] && [ `uname -s` = 'Darwin' ]; then install_name_tool -id data_table.so data_table.so; fi
installing to /home/tdhock/lib/R/library/00LOCK-data.table/00new/data.table/libs
** R
** inst
** byte-compile and prepare package for lazy loading
Le chargement a nécessité le package : grDevices
** help
*** installing help indices
** building package indices
Le chargement a nécessité le package : grDevices
** installing vignettes
** testing if installed package can be loaded from temporary location
Le chargement a nécessité le package : grDevices
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
Le chargement a nécessité le package : grDevices
** testing if installed package keeps a record of temporary installation path
* DONE (data.table)

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R est un logiciel libre livré sans AUCUNE GARANTIE.
Vous pouvez le redistribuer sous certaines conditions.
Tapez 'license()' ou 'licence()' pour plus de détails.

R est un projet collaboratif avec de nombreux contributeurs.
Tapez 'contributors()' pour plus d'information et
'citation()' pour la façon de le citer dans les publications.

Tapez 'demo()' pour des démonstrations, 'help()' pour l'aide
en ligne ou 'help.start()' pour obtenir l'aide au format HTML.
Tapez 'q()' pour quitter R.

Le chargement a nécessité le package : grDevices
> library(data.table);dcast(data.table(id=1, grp=1, e=expression(1)), id ~ grp, value.var='e')
Erreur dans `[.data.table`(dat, 0) : 
  Internal error: column type 'expression' not supported by data.table subset. All known types are supported so please report as bug.
Appels : dcast -> dcast.data.table -> [ -> [.data.table
Exécution arrêtée
(base) tdhock@tdhock-MacBook:~/R/data.table((HEAD détachée sur 747c76cd))$ git checkout ee93c5fc7b7fde45b37ebf95d06ddf7f58af1c5e && R CMD INSTALL . && R -e "library(data.table);dcast(data.table(id=1, grp=1, e=expression(1)), id ~ grp, value.var='e')"
La position précédente de HEAD était sur 747c76cd dat_for_default_fill is zero-row dt
HEAD est maintenant sur ee93c5fc Update R/fcast.R
Le chargement a nécessité le package : grDevices
* installing to library ‘/home/tdhock/lib/R/library’
* installing *source* package ‘data.table’ ...
** using staged installation
gcc 12.3.0
zlib 1.2.11 is available ok
R CMD SHLIB supports OpenMP without any extra hint
** libs
using C compiler: ‘gcc (GCC) 12.3.0’
gcc -shared -L/home/tdhock/lib/R/lib -L/usr/local/lib -o data.table.so assign.o between.o bmerge.o chmatch.o cj.o coalesce.o dogroups.o fastmean.o fcast.o fifelse.o fmelt.o forder.o frank.o fread.o freadR.o froll.o frollR.o frolladaptive.o fsort.o fwrite.o fwriteR.o gsumm.o idatetime.o ijoin.o init.o inrange.o nafill.o negate.o nqrecreateindices.o openmp-utils.o programming.o quickselect.o rbindlist.o reorder.o shift.o snprintf.o subset.o transpose.o types.o uniqlist.o utils.o vecseq.o wrappers.o -fopenmp -lz -L/home/tdhock/lib/R/lib -lR
PKG_CFLAGS = -fopenmp
PKG_LIBS = -fopenmp -lz
if [ "data.table.so" != "data_table.so" ]; then mv data.table.so data_table.so; fi
if [ "" != "Windows_NT" ] && [ `uname -s` = 'Darwin' ]; then install_name_tool -id data_table.so data_table.so; fi
installing to /home/tdhock/lib/R/library/00LOCK-data.table/00new/data.table/libs
** R
** inst
** byte-compile and prepare package for lazy loading
Le chargement a nécessité le package : grDevices
** help
*** installing help indices
** building package indices
Le chargement a nécessité le package : grDevices
** installing vignettes
** testing if installed package can be loaded from temporary location
Le chargement a nécessité le package : grDevices
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
Le chargement a nécessité le package : grDevices
** testing if installed package keeps a record of temporary installation path
* DONE (data.table)

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R est un logiciel libre livré sans AUCUNE GARANTIE.
Vous pouvez le redistribuer sous certaines conditions.
Tapez 'license()' ou 'licence()' pour plus de détails.

R est un projet collaboratif avec de nombreux contributeurs.
Tapez 'contributors()' pour plus d'information et
'citation()' pour la façon de le citer dans les publications.

Tapez 'demo()' pour des démonstrations, 'help()' pour l'aide
en ligne ou 'help.start()' pour obtenir l'aide au format HTML.
Tapez 'q()' pour quitter R.

Le chargement a nécessité le package : grDevices
> library(data.table);dcast(data.table(id=1, grp=1, e=expression(1)), id ~ grp, value.var='e')
Erreur dans dcast.data.table(data.table(id = 1, grp = 1, e = expression(1)),  : 
  Unsupported column type in fcast val: 'expression'
Appels : dcast -> dcast.data.table
Exécution arrêtée

…e test failure

MichaelChirico · 2024-03-14T16:11:05Z

I think the codecov thing is spurious, merging.

jangorecki · 2024-03-14T18:23:27Z

R/fcast.R

    }
-    dat = dat[, eval(fun.call), by=c(varnames)]
+    dat = dat[, maybe_err(eval(fun.call)), by=c(varnames)]


I think this could possibly affect the code path in [.
We check for eval in j to handle that specially, not sure if we are checking for nested eval.
I would go with new env argument, or eventually modify fun.call by prefixing it with maybe_err call.

Are you thinking for efficiency? Otherwise passing tests ensures your concern is moot right?

possibly for efficiency but just touching the edge like that raises some concerns

jangorecki · 2024-03-14T18:23:48Z

R/fcast.R

+    some_fill = anyNA(idx)
+    fill.default = if (run_agg_funs && is.null(fill) && some_fill) dat_for_default_fill[, maybe_err(eval(fun.call))]
+    if (run_agg_funs && is.null(fill) && some_fill) {
+      fill.default = dat_for_default_fill[0L][, maybe_err(eval(fun.call))]


tdhock added 3 commits November 30, 2022 11:08

delete old commented code

2886c4f

new test for no warning fails

90f0647

only compute default fill if missing cells present

26745f4

any_NA_int helper

03dc91d

tdhock requested review from mattdowle and shrektan November 30, 2022 19:15

ben-schwen reviewed Nov 30, 2022

View reviewed changes

bugfix #5512

258befb

shrektan approved these changes Dec 3, 2022

View reviewed changes

src/fcast.c Show resolved Hide resolved

src/fcast.c Outdated Show resolved Hide resolved

src/fcast.c Outdated Show resolved Hide resolved

mattdowle requested changes Dec 3, 2022

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

src/fcast.c Outdated Show resolved Hide resolved

mattdowle reviewed Dec 3, 2022

View reviewed changes

src/fcast.c Show resolved Hide resolved

tdhock and others added 6 commits December 4, 2022 09:32

Update src/fcast.c

360ba9d

Co-authored-by: Xianying Tan <shrektan@126.com>

Update src/fcast.c

75102bf

Co-authored-by: Xianying Tan <shrektan@126.com>

mention warning text

6225799

const int args

5055306

add back ithiscol

6a93cb1

get pointer before for loop

a40d969

tdhock added this to the 1.16.0 milestone Jan 5, 2024

tdhock added the reshape dcast melt label Jan 5, 2024

Merge branch 'master' into fix5512

2019a5c

MichaelChirico mentioned this pull request Feb 21, 2024

coerceAs applied to fill in dcast #4586

Merged

tdhock and others added 3 commits March 4, 2024 11:04

Merge branch 'master' into fix5512

c46cfaa

add test case from Michael

1a8ba9c

Merge branch 'fix5512' of https://github.com/Rdatatable/data.table in…

47d735e

…to fix5512

merge

7198d08

tdhock commented Mar 8, 2024

View reviewed changes