Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dcast only computes default fill if necessary #5549

Merged
merged 32 commits into from
Mar 14, 2024
Merged

dcast only computes default fill if necessary #5549

merged 32 commits into from
Mar 14, 2024

Conversation

tdhock
Copy link
Member

@tdhock tdhock commented Nov 30, 2022

Closes #5512
Closes #5390

In current master, dcast(fill=NULL) always computes a default fill value, even when there are no missing cells. For example this is the result of a new test case using current master

> DT <- data.table(chr=c("a","b","b"), int=1:3)[, num := as.numeric(int)]
> dcast(DT, num ~ chr, min, value.var="int")
Key: <num>
     num     a     b
   <num> <int> <int>
1:     1     1    NA
2:     2    NA     2
3:     3    NA     3
Warning message:
In dcast.data.table(DT, num ~ chr, min, value.var = "int") :
  NAs introduced by coercion to integer range

In the code above it is normal to compute the fill value (fill_value=as.integer(min(integer())) which is NA) because it is used three times.

However the code below gives the following result using current master, indicating that a default fill value is computed, even though it is not used:

> dcast(DT, . ~ chr, min, value.var="int")
Key: <.>
        .     a     b
   <char> <int> <int>
1:      .     1     2
Warning message:
In dcast.data.table(DT, . ~ chr, min, value.var = "int") :
  NAs introduced by coercion to integer range

Using this branch we get the output below (no warning), indicating that no default fill value was computed, because it is not necessary:

> dcast(DT, . ~ chr, min, value.var="int")
Key: <.>
        .     a     b
   <char> <int> <int>
1:      .     1     2
> 

@codecov
Copy link

codecov bot commented Nov 30, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.47%. Comparing base (15c127e) to head (4b96d35).
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5549      +/-   ##
==========================================
- Coverage   97.49%   97.47%   -0.02%     
==========================================
  Files          80       80              
  Lines       14861    14873      +12     
==========================================
+ Hits        14488    14498      +10     
- Misses        373      375       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tdhock tdhock requested review from mattdowle and shrektan November 30, 2022 19:15
src/fcast.c Outdated
for (int j=0; j<ncols; ++j) {
SET_VECTOR_ELT(ans, nlhs+j+i*ncols, target=allocVector(thistype, nrows) );
int *itarget = INTEGER(target);
copyMostAttrib(thiscol, target);
for (int k=0; k<nrows; ++k) {
int thisidx = idx[k*ncols + j];
itarget[k] = (thisidx == NA_INTEGER) ? ithisfill[0] : ithiscol[thisidx-1];
itarget[k] = (thisidx == NA_INTEGER) ? INTEGER(thisfill)[0] : INTEGER(thiscol)[thisidx-1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just use the same code for INTSXP, LGLSXP, REALSXP and CPLXSXP with coerceAs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my first time hacking on this code, so I'm not sure, but I also had the feeling that it would be desirable to avoid the repeated logic in these switch cases. About usage of coerceAs, would that introduce unwanted overhead / performance penalty? I was thinking of solving that via a C macro. Anyway I would suggest saving that for another PR, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value added from coerceAs is handling attributes and therefore classes like int64, not sure if relevant here

Copy link
Member

@shrektan shrektan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. It looks good to me.

src/fcast.c Show resolved Hide resolved
src/fcast.c Outdated Show resolved Hide resolved
src/fcast.c Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
src/fcast.c Outdated Show resolved Hide resolved
src/fcast.c Show resolved Hide resolved
tdhock and others added 6 commits December 4, 2022 09:32
@tdhock tdhock added this to the 1.16.0 milestone Jan 5, 2024
@tdhock tdhock added the reshape dcast melt label Jan 5, 2024
@MichaelChirico
Copy link
Member

From #4586, let's include this as a test case & make sure it doesn't warn:

test(xxxxxxx, dcast(data.table(a = 1, b = 2, c = 3), a ~ b, value.var = 'c', fill = '2'),
              data.table(a=1, `2`=3, key='a'))

@tdhock
Copy link
Member Author

tdhock commented Mar 4, 2024

I added that test case alongside a similar one.

@MichaelChirico
Copy link
Member

@tdhock is this PR ready for review? I'm wary of the conflict

src/fcast.c Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

Basically ready to go, some more minor feedback this round. Thanks!

R/fcast.R Outdated Show resolved Hide resolved
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
R/fcast.R Outdated Show resolved Hide resolved
Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your efforts!

@tdhock
Copy link
Member Author

tdhock commented Mar 14, 2024

you are welcome! thanks very much for your feedback, too!
However we have one failing test, and I've figured out which commit we need to revert, 747c76c but not sure why, any ideas?

(base) tdhock@tdhock-MacBook:~/R/data.table(fix5512)$ git checkout 747c76cd8a65e46294dd42edf69294a60f3b9e94 && R CMD INSTALL . && R -e "library(data.table);dcast(data.table(id=1, grp=1, e=expression(1)), id ~ grp, value.var='e')"
HEAD est maintenant sur 747c76cd dat_for_default_fill is zero-row dt
Le chargement a nécessité le package : grDevices
* installing to library ‘/home/tdhock/lib/R/library’
* installing *source* package ‘data.table’ ...
** using staged installation
gcc 12.3.0
zlib 1.2.11 is available ok
R CMD SHLIB supports OpenMP without any extra hint
** libs
using C compiler: ‘gcc (GCC) 12.3.0’
gcc -shared -L/home/tdhock/lib/R/lib -L/usr/local/lib -o data.table.so assign.o between.o bmerge.o chmatch.o cj.o coalesce.o dogroups.o fastmean.o fcast.o fifelse.o fmelt.o forder.o frank.o fread.o freadR.o froll.o frollR.o frolladaptive.o fsort.o fwrite.o fwriteR.o gsumm.o idatetime.o ijoin.o init.o inrange.o nafill.o negate.o nqrecreateindices.o openmp-utils.o programming.o quickselect.o rbindlist.o reorder.o shift.o snprintf.o subset.o transpose.o types.o uniqlist.o utils.o vecseq.o wrappers.o -fopenmp -lz -L/home/tdhock/lib/R/lib -lR
PKG_CFLAGS = -fopenmp
PKG_LIBS = -fopenmp -lz
if [ "data.table.so" != "data_table.so" ]; then mv data.table.so data_table.so; fi
if [ "" != "Windows_NT" ] && [ `uname -s` = 'Darwin' ]; then install_name_tool -id data_table.so data_table.so; fi
installing to /home/tdhock/lib/R/library/00LOCK-data.table/00new/data.table/libs
** R
** inst
** byte-compile and prepare package for lazy loading
Le chargement a nécessité le package : grDevices
** help
*** installing help indices
** building package indices
Le chargement a nécessité le package : grDevices
** installing vignettes
** testing if installed package can be loaded from temporary location
Le chargement a nécessité le package : grDevices
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
Le chargement a nécessité le package : grDevices
** testing if installed package keeps a record of temporary installation path
* DONE (data.table)

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R est un logiciel libre livré sans AUCUNE GARANTIE.
Vous pouvez le redistribuer sous certaines conditions.
Tapez 'license()' ou 'licence()' pour plus de détails.

R est un projet collaboratif avec de nombreux contributeurs.
Tapez 'contributors()' pour plus d'information et
'citation()' pour la façon de le citer dans les publications.

Tapez 'demo()' pour des démonstrations, 'help()' pour l'aide
en ligne ou 'help.start()' pour obtenir l'aide au format HTML.
Tapez 'q()' pour quitter R.

Le chargement a nécessité le package : grDevices
> library(data.table);dcast(data.table(id=1, grp=1, e=expression(1)), id ~ grp, value.var='e')
Erreur dans `[.data.table`(dat, 0) : 
  Internal error: column type 'expression' not supported by data.table subset. All known types are supported so please report as bug.
Appels : dcast -> dcast.data.table -> [ -> [.data.table
Exécution arrêtée
(base) tdhock@tdhock-MacBook:~/R/data.table((HEAD détachée sur 747c76cd))$ git checkout ee93c5fc7b7fde45b37ebf95d06ddf7f58af1c5e && R CMD INSTALL . && R -e "library(data.table);dcast(data.table(id=1, grp=1, e=expression(1)), id ~ grp, value.var='e')"
La position précédente de HEAD était sur 747c76cd dat_for_default_fill is zero-row dt
HEAD est maintenant sur ee93c5fc Update R/fcast.R
Le chargement a nécessité le package : grDevices
* installing to library ‘/home/tdhock/lib/R/library’
* installing *source* package ‘data.table’ ...
** using staged installation
gcc 12.3.0
zlib 1.2.11 is available ok
R CMD SHLIB supports OpenMP without any extra hint
** libs
using C compiler: ‘gcc (GCC) 12.3.0’
gcc -shared -L/home/tdhock/lib/R/lib -L/usr/local/lib -o data.table.so assign.o between.o bmerge.o chmatch.o cj.o coalesce.o dogroups.o fastmean.o fcast.o fifelse.o fmelt.o forder.o frank.o fread.o freadR.o froll.o frollR.o frolladaptive.o fsort.o fwrite.o fwriteR.o gsumm.o idatetime.o ijoin.o init.o inrange.o nafill.o negate.o nqrecreateindices.o openmp-utils.o programming.o quickselect.o rbindlist.o reorder.o shift.o snprintf.o subset.o transpose.o types.o uniqlist.o utils.o vecseq.o wrappers.o -fopenmp -lz -L/home/tdhock/lib/R/lib -lR
PKG_CFLAGS = -fopenmp
PKG_LIBS = -fopenmp -lz
if [ "data.table.so" != "data_table.so" ]; then mv data.table.so data_table.so; fi
if [ "" != "Windows_NT" ] && [ `uname -s` = 'Darwin' ]; then install_name_tool -id data_table.so data_table.so; fi
installing to /home/tdhock/lib/R/library/00LOCK-data.table/00new/data.table/libs
** R
** inst
** byte-compile and prepare package for lazy loading
Le chargement a nécessité le package : grDevices
** help
*** installing help indices
** building package indices
Le chargement a nécessité le package : grDevices
** installing vignettes
** testing if installed package can be loaded from temporary location
Le chargement a nécessité le package : grDevices
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
Le chargement a nécessité le package : grDevices
** testing if installed package keeps a record of temporary installation path
* DONE (data.table)

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R est un logiciel libre livré sans AUCUNE GARANTIE.
Vous pouvez le redistribuer sous certaines conditions.
Tapez 'license()' ou 'licence()' pour plus de détails.

R est un projet collaboratif avec de nombreux contributeurs.
Tapez 'contributors()' pour plus d'information et
'citation()' pour la façon de le citer dans les publications.

Tapez 'demo()' pour des démonstrations, 'help()' pour l'aide
en ligne ou 'help.start()' pour obtenir l'aide au format HTML.
Tapez 'q()' pour quitter R.

Le chargement a nécessité le package : grDevices
> library(data.table);dcast(data.table(id=1, grp=1, e=expression(1)), id ~ grp, value.var='e')
Erreur dans dcast.data.table(data.table(id = 1, grp = 1, e = expression(1)),  : 
  Unsupported column type in fcast val: 'expression'
Appels : dcast -> dcast.data.table
Exécution arrêtée

@MichaelChirico
Copy link
Member

I think the codecov thing is spurious, merging.

@MichaelChirico MichaelChirico merged commit f92aee6 into master Mar 14, 2024
2 of 3 checks passed
@MichaelChirico MichaelChirico deleted the fix5512 branch March 14, 2024 16:11
}
dat = dat[, eval(fun.call), by=c(varnames)]
dat = dat[, maybe_err(eval(fun.call)), by=c(varnames)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could possibly affect the code path in [.
We check for eval in j to handle that specially, not sure if we are checking for nested eval.
I would go with new env argument, or eventually modify fun.call by prefixing it with maybe_err call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking for efficiency? Otherwise passing tests ensures your concern is moot right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly for efficiency but just touching the edge like that raises some concerns

some_fill = anyNA(idx)
fill.default = if (run_agg_funs && is.null(fill) && some_fill) dat_for_default_fill[, maybe_err(eval(fun.call))]
if (run_agg_funs && is.null(fill) && some_fill) {
fill.default = dat_for_default_fill[0L][, maybe_err(eval(fun.call))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reshape dcast melt
Projects
None yet
6 participants