Big performance regression in subsetting in v0.11 #251

TomAndrews · 2018-07-18T17:10:28Z

Description

There seems to be a big performance regression when subsetting an xts object using a vector of dates. In the example below it goes from 41 milliseconds to 17 seconds to perform the subsetting.

Using `v0.10-2`:

library(xts)
library(microbenchmark)
ts <- .xts(1:1e5, 1:1e5)
dates <- .POSIXct(1:1e5)
microbenchmark(ts[dates])

 Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
  ts[dates] 40.54739 40.67655 41.37819 40.96016 41.41218 45.32007   100

Session Info

R version 3.5.1 (2018-07-02)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Debian GNU/Linux 9 (stretch)

 Matrix products: default
 BLAS: /usr/lib/libblas/libblas.so.3.7.0
 LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

 locale:
  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 other attached packages:
 [1] xts_0.10-2           zoo_1.8-3            microbenchmark_1.4-4

 loaded via a namespace (and not attached):
 [1] compiler_3.5.1  tools_3.5.1     grid_3.5.1      lattice_0.20-35

Using `v0.11`:

library(xts)
library(microbenchmark)
ts <- .xts(1:1e5, 1:1e5)
dates <- .POSIXct(1:1e5)
microbenchmark(ts[dates], times=1)  # fewer iterations, otherwise it takes too long

Unit: seconds
       expr      min       lq     mean   median       uq      max neval
  ts[dates] 17.16381 17.16381 17.16381 17.16381 17.16381 17.16381     1

Session Info

R version 3.5.1 (2018-07-02)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Debian GNU/Linux 9 (stretch)

 Matrix products: default
 BLAS: /usr/lib/libblas/libblas.so.3.7.0
 LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

 locale:
  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 other attached packages:
 [1] microbenchmark_1.4-4 xts_0.11-0           zoo_1.8-3

 loaded via a namespace (and not attached):
 [1] compiler_3.5.1  tools_3.5.1     grid_3.5.1      lattice_0.20-35

Obviously this is a bit of a stupid example since I'm just fetching all the dates in ts. In my actual problem, dates is arbitrary but large. Should I be re-writing the subsetting in a different way?

The text was updated successfully, but these errors were encountered:

TomAndrews · 2018-07-18T17:27:17Z

Thinking about it some more, it looks like I should be using merge.xts:

library(xts)
library(microbenchmark)
ts <- .xts(1:1e5, 1:1e5)
dates <- .POSIXct(1:1e5)
microbenchmark(ts[dates], merge.xts(ts, xts(, dates), join="right", retside=c(TRUE,FALSE)), times=1)  # fewer iterations, otherwise it takes too long

 Unit: milliseconds
                                                                        expr
                                                                   ts[dates]
  merge.xts(ts, xts(, dates), join = "right", retside = c(TRUE,      FALSE))
           min           lq         mean       median           uq          max
  18403.239120 18403.239120 18403.239120 18403.239120 18403.239120 18403.239120
      2.835482     2.835482     2.835482     2.835482     2.835482     2.835482

joshuaulrich · 2018-07-19T22:24:10Z

A quick profiling session indicates that the bottleneck is in window_idx(). Specifically, the call to c() in the for and repeat loop inside the block below. @corwinjoy, do you remember if you wrote test cases for the situation handled in this code block? I quickly looked at the tests in runit.xts.methods.R, but didn't see anything that mentioned testing duplicate index values.

  if(usr_idx && !is.null(firstlast)) {
    # Translate from user .index to xts index
    # We get back upper bound of index as per findInterval
    tmp <- base_idx[firstlast]
    
    # Iterate in reverse to grab all matches
    # We have to do this to handle duplicate dates in the xts index.
    tmp <- rev(tmp)
    res <- NULL
    for(i in tmp) {
      dt <- idx[i]
      j <- i
      repeat {
        res <- c(res, j)
        j <- j -1
        if(j < 1 || idx[j] != dt) break
      }
    }
    firstlast <- rev(res)
  }

I'd like a solution that is more performant with duplicate values, but a quick fix for the non-duplicate case would be to add an anyDuplicated() check.

corwinjoy · 2018-07-20T05:10:01Z

Let's see:
I wrote a test for duplicated dates in the xts series. This is on line 190 of runit.xts.methods.R:

# Test index parameter with repeated dates in xts series
  idx <- sort(rep(1:5, 5))
  x <- xts(1:length(idx), as.Date("1999-12-31")+idx)
  bin <- window(x, index = as.Date("1999-12-31")+c(1,3,5))

So here bin gives the duplicated dates in the xts series as expected.
Looking further I don't see a test for what happens when the index is duplicated.
So, the test might be something like:
bin <- window(x, index = as.Date("1999-12-31")+c(1,1,2))
We should add an explicit test for validation.

Now we did have a discussion about this because there are two aspects:

Are duplicate values allowed, we said yes in keeping with what R does.
But, the returned values are sorted by date, unlike normal subsetting.
(See Add support for window function #100 (comment))

Now, back to the performance of this routine. I agree it is slow. It is trying to do a fast subset of the dates, but making it work with duplicated dates in the xts index slows everything down.
Options:

If we don't have duplicated dates the above routine can be fast. It just needs to return base_idx[firstlast]. No iteration needed. So maybe a flag indicating if the xts data has duplicated dates?
I tried to solve this using just R functions and findInterval(). At the time I thought it was cool, but really it is probably too complicated. Maybe there is a better way to do this in R?
A better approach is probably just to do this in C. It is not quite a standard set intersection algorithm because we can have duplicated dates in the index data. But it is quite close, just choose all the data from the xts series where we match. Here is the standard linear time algorithm (https://www.geeksforgeeks.org/union-and-intersection-of-two-sorted-arrays-2/)
I don't think we need the O(sqrt(n)) time algorithm like Baeza-Yates unless folks are doing just huge date sets.

Looking further, you could say that 3) is already implemented by merge.xts(). However, despite the design discussion, it does not correctly handle joins with duplicates. In a normal inner join you can have duplicates and this is OK. (https://stackoverflow.com/questions/40218138/will-inner-join-allow-duplicates). But when I look at merge.xts I see:

x <- xts(4:10, Sys.Date()+c(4,4,4, 7:10))
y <- xts(1:6, Sys.Date()+1:6)
merge(x,y, join='inner')
#            x y
# 2018-07-23 4 4

Instead, the inner join should return 3 records matching the values in x. (Here is a more correct version of a sort-merge-join that allows for duplicate values: http://www.dcs.ed.ac.uk/home/tz/phd/thesis/node20.htm).
So, probably the best solution here is to fix the logic in merge.xts() and then just call it to do an inner join once we have found the range bounded by start and end date.

TomAndrews · 2018-07-20T10:50:21Z

I'm not at all familiar with the working of xts subsetting, so I may have misunderstood what you are discussing in which case I apologise.

I just wanted to point out that in the example above there are no duplicates in the index of ts nor in dates. It seems to be the sheer number of entries in dates that is causing the slowdown.

joshuaulrich · 2018-07-21T14:26:09Z

Thanks for your detailed comments, @corwinjoy!

It looks like window.xts() currently behaves different from window.zoo() when index. contains duplicate values. window.zoo() seems to ignore the duplicates, while window.xts() returns all the matching .index() values multiple times. For example:

xidx <- as.Date("1999-12-31") + rep(1:3, each = 2)
# duplicate values in xts index, and duplicate values in window index.
x <- xts(seq_along(xidx), xidx)
widx <- as.Date("1999-12-31") + c(1, 1)
window(x, widx)
#            [,1]
# 2000-01-01    1
# 2000-01-01    2
# 2000-01-01    1
# 2000-01-01    2
zoo:::window.zoo(x, widx)
#            [,1]
# 2000-01-01    1
# 2000-01-01    2

I lean more toward consistency with zoo, but I do realize that this is a bit of an edge case.

I agree with your assessment of the result of inner joins performed by merge.xts(). I created issue #106 to document it, but have not attempted a fix. I plan to refactor the merge C code and address the issue afterward.

@TomAndrews the slowdown is caused by re-allocating the res vector inside the for and repeat loops. You don't need to worry about the details of xts subsetting. The benchmark tests you provided are helpful by themselves.

corwinjoy · 2018-07-23T19:30:52Z

Sounds good Josh. Hopefully you can just refactor the merge routine like what you did for binsearch. That is, just create an == operation that works over multiple types. For subsetting, it is not too clear what the behavior should be. Instead of looking at zoo, I modelled standard R behavior. E.g. x<- 1:5 x[c(1,1)] Gives 1, 1 This is in line with what standard inner join does. Either way, the behavior should be documented. Best, Corwin

…

On Sat, Jul 21, 2018, 7:26 AM Joshua Ulrich ***@***.***> wrote: Thanks for your detailed comments, @corwinjoy <https://github.com/corwinjoy>! It looks like window.xts() currently behaves different from window.zoo() when index. contains duplicate values. window.zoo() seems to ignore the duplicates, while window.xts() returns all the matching .index() values multiple times. For example: xidx <- as.Date("1999-12-31") + rep(1:3, each = 2)# duplicate values in xts index, and duplicate values in window index.x <- xts(seq_along(xidx), xidx)widx <- as.Date("1999-12-31") + c(1, 1) window(x, widx)# [,1]# 2000-01-01 1# 2000-01-01 2# 2000-01-01 1# 2000-01-01 2zoo:::window.zoo(x, widx)# [,1]# 2000-01-01 1# 2000-01-01 2 I lean more toward consistency with zoo, but I do realize that this is a bit of an edge case. I agree with your assessment of the result of inner joins performed by merge.xts(). I created issue #106 <#106> to document it, but have not attempted a fix. I plan to refactor the merge C code and address the issue afterward. @TomAndrews <https://github.com/TomAndrews> the slowdown is caused by re-allocating the res vector inside the for and repeat loops. You don't need to worry about the details of xts subsetting. The benchmark tests you provided are helpful by themselves. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#251 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMJeauIVkcicbchV9_EJW1kgc0Vn_BK_ks5uIzoCgaJpZM4VU-p2> .

This basically moves the R code into C, with minor changes. The C code pre-(and over-)allocates the result and loops over the locations in reverse order. Corwin suggested better long-term solutions (e.g. fix merge.xts() to handle duplicate index values correctly), but this this commit addresses the immediate performance regression. Also add more benchmarks. See #251.

joshuaulrich · 2018-07-30T18:28:24Z

@TomAndrews, thanks again for the report! My timings suggest this solution is 2-3x faster than 0.10-2. Hopefully you see similar results.

Moving the duplicate index value handling to C created an error when the index type was integer. This was not caught in the unit tests. Add a loop around the window/subset unit tests that run them on both index types (double and integer). Add support for both index types to the fill_window_dups_rev() C function. Change the scalar '1' to '1L' in the pmax() call after findInterval() to ensure that 'base_index' is not coerced to double from integer. See #251.

joshuaulrich self-assigned this Jul 30, 2018

joshuaulrich added the bug label Jul 30, 2018

joshuaulrich closed this as completed in 61975ad Jul 30, 2018

joshuaulrich added this to the Release 0.11-1 milestone Jul 30, 2018

joshuaulrich mentioned this issue Oct 11, 2018

Wrong results when date-indexed xts with duplicate index values is subset by its own index #275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big performance regression in subsetting in v0.11 #251

Big performance regression in subsetting in v0.11 #251

TomAndrews commented Jul 18, 2018

TomAndrews commented Jul 18, 2018

joshuaulrich commented Jul 19, 2018 •

edited

Loading

corwinjoy commented Jul 20, 2018 •

edited by joshuaulrich

Loading

TomAndrews commented Jul 20, 2018

joshuaulrich commented Jul 21, 2018

corwinjoy commented Jul 23, 2018 via email

joshuaulrich commented Jul 30, 2018

Big performance regression in subsetting in v0.11 #251

Big performance regression in subsetting in v0.11 #251

Comments

TomAndrews commented Jul 18, 2018

Description

Using v0.10-2:

Session Info

Using v0.11:

Session Info

TomAndrews commented Jul 18, 2018

joshuaulrich commented Jul 19, 2018 • edited Loading

corwinjoy commented Jul 20, 2018 • edited by joshuaulrich Loading

TomAndrews commented Jul 20, 2018

joshuaulrich commented Jul 21, 2018

corwinjoy commented Jul 23, 2018 via email

joshuaulrich commented Jul 30, 2018

Using `v0.10-2`:

Using `v0.11`:

joshuaulrich commented Jul 19, 2018 •

edited

Loading

corwinjoy commented Jul 20, 2018 •

edited by joshuaulrich

Loading