na.fill long call times #259

ckatsulis · 2018-07-25T16:59:39Z

na.fill() takes much longer than other methods to fill NA values. I expected near constant time to find and fill NA values.

Reproducible timings are below. I will send testData via email independently as it is too large for this forum.

testData[is.na(testData)] = 0
Unit: milliseconds
                     expr      min       lq     mean   median       uq      max neval
testData[is.na(testData)] 26.52425 35.31587 37.19453 37.23217 38.80555 46.61541    10

> microbenchmark::microbenchmark(na.fill(testData,0),times = 1)
Unit: seconds
                 expr      min       lq     mean   median       uq      max neval
 na.fill(testData, 0) 150.9054 150.9054 150.9054 150.9054 150.9054 150.9054     1

Unit: seconds
                           expr      min       lq     mean   median       uq      max neval
 na.fill(coredata(testData), 0) 52.90975 52.90975 52.90975 52.90975 52.90975 52.90975     1

Session Info

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8         
 [8] LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] roll_1.0.7                 PerformanceAnalytics_1.5.2 xts_0.10-2.2               zoo_1.8-2                  TTR_0.23-3                 dygraphs_1.1.1.5          

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17         magrittr_1.5         xtable_1.8-2         lattice_0.20-35      R6_2.2.2             quadprog_1.5-5       xlsx_0.6.1           tools_3.5.1          grid_3.5.1           data.table_1.11.4    qmao_1.6.11         
[12] R.oo_1.22.0          htmltools_0.3.6      yaml_2.1.19          RcppParallel_4.4.0   digest_0.6.15        RJSONIO_1.3-0        shiny_1.0.5          rJava_0.9-10         later_0.7.3          microbenchmark_1.4-4 promises_1.0.1      
[23] bitops_1.0-6         htmlwidgets_0.9      R.utils_2.6.0        xlsxjars_0.6.1       RCurl_1.95-4.10      curl_3.2             mime_0.5             pander_0.6.2         compiler_3.5.1       R.methodsS3_1.7.1    jsonlite_1.5        
[34] httpuv_1.4.4.2

The text was updated successfully, but these errors were encountered:

joshuaulrich · 2018-07-25T18:43:33Z

Note that there is no na.fill.xts() method, so zoo:::na.fill.zoo() is dispatched. We can probably add an xts method, and only dispatch to the zoo method if fill is not a scalar.

ckatsulis · 2018-07-25T18:45:58Z

That would be consistent, just curious why the zoo method is so much slower than simply specifying the condition in brackets.

…

On Wed, Jul 25, 2018 at 1:43 PM, Joshua Ulrich ***@***.***> wrote: Note that there is no na.fill.xts() method, so zoo:::na.fill.zoo() is dispatched. We can probably add an xts method, and only dispatch to the zoo method if fill is not a scalar. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#259 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHKwqnDtyu2BaFf5gx1klzac3DjMVSF3ks5uKLxXgaJpZM4Vgcoi> .

joshuaulrich · 2018-08-11T12:30:10Z

@ckatsulis it looks like a lot of the time is spent in creating symbol names from the ... arguments passed to merge.xts(). This is related to #248.

The column names of a merge() result can be monstrous for raw, unnamed numeric objects. Something similar to dput() for the entire column is used, but zoo deparses the dput() output and limits the deparse length to ~60 characters. Use a modified version of makeNames() (defined in merge.zoo()) to make the column names. Major differences are: - Use substitute(alist(...)) to avoid evaluating '...' - Use deparse(x, nlines = 1L) to deparse only the first expression NOTE: this may break existing code that relies on behavior for integer objects. For example, current behavior is: x <- .xts(1:5, 1:5) lx <- list(x, x) do.call(merge, lx) X1.5 X1.5.1 1969-12-31 18:00:01 1 1 1969-12-31 18:00:02 2 2 1969-12-31 18:00:03 3 3 1969-12-31 18:00:04 4 4 1969-12-31 18:00:05 5 5 New behavior has column names: structure.1.5...Dim...c.5L..1L...index...structure.1.5..tclass...c..POSIXct... structure.1.5...Dim...c.5L..1L...index...structure.1.5..tclass...c.. POSIXct....1 Fixes #248. See #259.

joshuaulrich · 2018-11-03T11:58:51Z

Hi @ckatsulis, the fix for #248 improves performance for this case fairly dramatically. Though it's still not nearly as fast as x[is.na(x)] <- 0. Is the new performance sufficient to close this issue?

x <- .xts(1:1e7, 1:1e7)
is.na(x) <- sample(1e7, 1e6) 
microbenchmark::microbenchmark(na.fill(x, 0), times = 1) 
# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#  na.fill(x, 0) 1.944266 1.944266 1.944266 1.944266 1.944266 1.944266     1
y <- coredata(x)
microbenchmark::microbenchmark(na.fill(y, 0), times = 1) 
# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#  na.fill(y, 0) 6.566733 6.566733 6.566733 6.566733 6.566733 6.566733     1
microbenchmark::microbenchmark(X[is.na(X)] <- 0, times = 1, setup = X <- x)
# Unit: milliseconds
#              expr      min       lq     mean   median       uq      max neval
#  X[is.na(X)] <- 0 176.9659 176.9659 176.9659 176.9659 176.9659 176.9659     1

harvey131 · 2018-12-12T17:29:47Z

Performance using Rcpp ifelse()

library(xts)
Rcpp::cppFunction("
NumericMatrix na_fill(const NumericMatrix x, const double fill) {
  NumericMatrix r = clone(x);
  for (int j=0; j < r.ncol(); j++) {
    r(_,j) = ifelse( is_na(r(_,j)), fill, r(_,j) );
  }
  return r;
}")
f1 = function(x, fill) {
  x[is.na(x)] <- fill
  x
}
x <- .xts(1:1e7, 1:1e7)
is.na(x) <- sample(1e7, 1e6) 
microbenchmark::microbenchmark(na.fill(x, 0), 
                               na_fill(x, 0), 
                               f1(x, 0),
                               times=100)
Unit: milliseconds
          expr        min         lq       mean     median        uq       max neval
 na.fill(x, 0) 1082.35488 1146.39916 1178.56886 1175.79914 1203.5146 1309.6379   100
 na_fill(x, 0)   71.08533   76.49017   97.88448   89.81974  110.8743  166.9555   100
      f1(x, 0)   69.06708   78.01604   98.97025   93.17469  105.8702  177.6743   100

ckatsulis · 2018-12-12T19:18:53Z

Two orders of magnitude better is more than enough for me. Is na_fill exported in the new xts?

…

On Wed, Dec 12, 2018 at 11:29 AM harvey131 ***@***.***> wrote: Performance using Rcpp ifelse() library(xts) Rcpp::cppFunction(" NumericMatrix na_fill(const NumericMatrix x, const double fill) { NumericMatrix r = clone(x); for (int j=0; j < r.ncol(); j++) { r(_,j) = ifelse( is_na(r(_,j)), fill, r(_,j) ); } return r; }") f1 = function(x, fill) { x[is.na(x)] <- fill x } x <- .xts(1:1e7, 1:1e7)is.na(x) <- sample(1e7, 1e6) microbenchmark::microbenchmark(na.fill(x, 0), na_fill(x, 0), f1(x, 0), times=100) Unit: milliseconds expr min lq mean median uq max neval na.fill(x, 0) 1082.35488 1146.39916 1178.56886 1175.79914 1203.5146 1309.6379 100 na_fill(x, 0) 71.08533 76.49017 97.88448 89.81974 110.8743 166.9555 100 f1(x, 0) 69.06708 78.01604 98.97025 93.17469 105.8702 177.6743 100 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#259 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHKwqp_LxDNh7N6dHScaIaDp8N7vtM8-ks5u4T0MgaJpZM4Vgcoi> .

joshuaulrich · 2018-12-13T13:23:48Z

@ckatsulis the code @harvey131 posted isn't in xts, let alone exported. I'm not likely to add a dependency on compiled code unless there's significant performance improvement over a native R solution. In this case, the Rcpp code is roughly the same as x[is.na(x)] <- 0.

I was also hoping to make the change upstream in zoo, but time is a constraint, as always.

ckatsulis · 2018-12-13T13:56:48Z

Agreed. No need if the existing call is the same speed for all intents and purposes. Chris Sent from my iPhone

…

On Dec 13, 2018, at 07:23, Joshua Ulrich ***@***.***> wrote: @ckatsulis the code @harvey131 posted isn't in xts, let alone exported. I'm not likely to add a dependency on compiled code unless there's significant performance improvement over a native R solution. In this case, the Rcpp code is roughly the same as x[is.na(x)] <- 0. I was also hoping to make the change upstream in zoo, but time is a constraint, as always. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

joshuaulrich · 2022-10-12T22:14:15Z

Idea: zoo::na.fill0() is significantly faster in this case. Consider updating zoo:::na.fill.zoo() to call na.fill0() in this type of special case, or create a na.fill.xts() method that does it. I'd prefer to keep it all in zoo, so it's in one place.

joshuaulrich added the feature request New features label Jul 25, 2018

joshuaulrich added this to the 0.12.3 milestone Oct 12, 2022

joshuaulrich closed this as completed in fb91144 Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

na.fill long call times #259

na.fill long call times #259

ckatsulis commented Jul 25, 2018 •

edited by joshuaulrich

Loading

joshuaulrich commented Jul 25, 2018

ckatsulis commented Jul 25, 2018 via email

joshuaulrich commented Aug 11, 2018 •

edited

Loading

joshuaulrich commented Nov 3, 2018

harvey131 commented Dec 12, 2018

ckatsulis commented Dec 12, 2018 via email

joshuaulrich commented Dec 13, 2018

ckatsulis commented Dec 13, 2018 via email

joshuaulrich commented Oct 12, 2022

na.fill long call times #259

na.fill long call times #259

Comments

ckatsulis commented Jul 25, 2018 • edited by joshuaulrich Loading

Session Info

joshuaulrich commented Jul 25, 2018

ckatsulis commented Jul 25, 2018 via email

joshuaulrich commented Aug 11, 2018 • edited Loading

joshuaulrich commented Nov 3, 2018

harvey131 commented Dec 12, 2018

ckatsulis commented Dec 12, 2018 via email

joshuaulrich commented Dec 13, 2018

ckatsulis commented Dec 13, 2018 via email

joshuaulrich commented Oct 12, 2022

ckatsulis commented Jul 25, 2018 •

edited by joshuaulrich

Loading

joshuaulrich commented Aug 11, 2018 •

edited

Loading