Add support for window function #100

corwinjoy · 2015-06-03T22:44:06Z

I propose that a window function be added to xts so that users can have binary lookup over a range of dates.
This function is the same as in the base package zoo, that is window.zoo, however we use binary search over the start and end date.

I propose that a window function be added to xts so that users can have binary lookup over a range of dates. This function is the same as in the base package zoo, that is window.zoo, however we use binary search over the start and end date.

Also, window_bin probably can't be easily reused in .subset.xts - too much overhead relative to a two line binary search. If I wasn't lazy - it could actually be done as follows: 1. Have window_bin take a .index parameter like window.zoo and return a set of index numbers (possibly NULL for no match). Then .subset.xts could re-use. Even better, subset.xts should be rewritten to do binary search when it is passed a set of dates. How? Actually, R has a nice built-in function for this findInterval. This uses binary search, but even better, it starts at the last matched value on subsequent searches. This ends up being O(n) on the number of dates to lookup. Much faster than O(n*N) with the current match logic.

We need to make sure we are accessing the internal POSIXct index, so switch the call to .index. The index function (potentially) converts to an end-user index.

joshuaulrich · 2015-06-05T19:06:31Z

Thank you very much for taking the initiative on this! A couple things I would like to see done (either by you or me; let me know your preference):

Import stats::window, rename window_bin to window.xts and register it as a S3 method
Remove all the test code from R/xts.methods.R
- xts does have RUnit-based tests, but they've not been run in ages. I would like to to get them running again.
Write man/window.xts.Rd
Add index. argument to match window.zoo definition (since xts extends zoo)
Possibly be more defensive about potential values for start and end
- I'm thinking about situations where start or end are already POSIXct, but have a different timezone. Or if they're some other time-based class (e.g. timeDate).

corwinjoy · 2015-06-05T21:46:32Z

@joshuaulrich wrote a number of nice comments as reproduced below. Basically I'm open to all of these ideas but would like to iron out specifics. Here are my thoughts:

Import stats::window, rename window_bin to window.xts and register it as a S3 method.
- The renaming makes sense. In terms of registering an S3 method, I'm not sure what is needed. zoo already defines window as a generic - does window.xts need to be added to the package export list?
Remove all the test code from R/xts.methods.R
xts does have RUnit-based tests, but they've not been run in ages. I would like to to get them running again.
- Will do - let me get closer to the final form before I rip this out since I want to be careful and not make a mistake here. I do like mini-tests to help make sure the obvious cases are covered.
Write man/window.xts.Rd
- Can I pass the buck here? I think this is mostly a case of copying the zoo docs.
Add index. argument to match window.zoo definition (since xts extends zoo)
- Maybe? I'd like some feedback on how this would work since I don't have a deep understanding of zoo and xts. What does it mean if you pass in an index that is a different time index than what is used in xts? Does it have to follow the rules for the index in the xts constructor. (Unique, ordered, time-based class). If we get away from this life gets hard (duplicates should return duplicates?), (non-ordered means no binary search / or do we sort?) I suppose xts has conversion functions to the native index. For a first pass I left it out since I was afraid of getting this wrong.
Possibly be more defensive about potential values for start and end
I'm thinking about situations where start or end are already POSIXct, but have a different timezone. Or if they're some other time-based class (e.g. timeDate).
- This is worth checking carefully. If we were in C the function could take only POSIXct dates and all would be well. R users are a bit spoiled. I know that I often like to use window with a pair of string dates to inspect a series. window.zoo is pretty loose and says only that the index must be comparable to the start and end dates (which may be character if the index support character comparisons). I went for a standard more in line with the subset operator and said the dates needed to be covertible to POSIXct.

The .subset.xts function has the following code:

if (timeBased(i)) {
      if(inherits(i, "POSIXct")) {
        i <- which(!is.na(match(.index(x), i)))
      } else if(inherits(i, "Date")) {
        i <- which(!is.na(match(.index(x), as.POSIXct(as.character(i),tz=indexTZ(x)))))
      } else {
        # force all other time classes to be POSIXct
        i <- which(!is.na(match(.index(x), as.POSIXct(i,tz=indexTZ(x)))))
      }
      i[is.na(i)] <- 0
    }

Instead, I simply ask for a conversion to POSIXct and fail if there is not an unambiguous conversion.
start <- as.POSIXct(start, tz=indexTZ(x)) # If start is already POSIXct this does nothing.

I think this is cleaner, but let's look at the cases. First, imagine they pass in a POSIXct date with a different time zone. I think this is OK because the internal representation will be the same. For example:

x <- as.POSIXct("2015-06-05", tz = "GMT")
y <- x
attr(y, "tzone") = ""

# The timezone changes the pretty printing
> x
[1] "2015-06-05 GMT"

# My default time zone is PDT
> y
[1] "2015-06-04 17:00:00 PDT"

# The time zone attribute does not change the underlying representation because this is always in UTC.
# From the docs for POSIXct:
# Class "POSIXct" represents the (signed) number of seconds since the beginning of 1970 (in the UTC timezone) as a numeric vector. 

> unclass(x)
[1] 1433462400
attr(,"tzone")
[1] "GMT"

> unclass(y)
[1] 1433462400
attr(,"tzone")
[1] ""

Conversion from a Date and conversion from other time based classes tries to trust the conversion operator. Here I think

start <- as.POSIXct(start, tz=indexTZ(x))

is safer than

 else if(inherits(i, "Date")) {
        i <- which(!is.na(match(.index(x), as.POSIXct(as.character(i),tz=indexTZ(x)))))
      }

What if as.character(i) converts the date to one time zone (the default) then as.POSIXct reads this as another? Then I think we get the wrong date.

Really this time zone, indexTZ, may only be used if start is a string.

Finally, another option would be to emulate window.zoo and call the .binsearch function. Looking at the code this compares index > key, index < key and index != key. This would be a lower standard, but I don't like it. These comparisons may be slow and if we can't get an unambiguous conversion to our index type I think we should make the user clarify.

window.xts. Make .index efficient by using findInterval. 2. Get unit tests running again. Had to disable the reclass tests for now because the attributes don't match in checkIdentical. Disabled with the prefix no_test. 3. Fixed class order for POSIXct in timeBasedSeq.R so the result matches POSIXct (and passes unit tests). 4. Added window.xts to the namespace.

when the xts series has repeated dates in series. 2. Add support for dates of the form I(character dates). The default conversion did not work. 3. Get a clean R CMD check result. To do this I had to fix the line endings for the source files (via dos2unix) and makefiles. Check also complained about missing Author and Maintainer fields in the DESCRIPTION file. 4. Add documentation for window.xts

corwinjoy · 2015-06-11T18:25:47Z

Josh,
As requested I've gone ahead and made these changes. I think the new
window function is ready for you to review. Here is what I have done:

Add support for the .index parameter and make subset use window for sets
of dates.
Add unit tests for window.xts and subset.xts. Rename the old
runit.xts.methods.R to runit.bind.R.
Get all the unit tests working again. For the moment, I had to disable
the reclass tests since they seem to be broken.
Get R CMD check to run cleanly. This involved fixing line endings to
be Unix style and adding a couple missing fields to the DESCRIPTION file.
Write man/window.xts.Rd.

I think the window code is pretty solid now and ready for you to review
when you get a chance. Also, in the process of working on this I noticed a
feature of subset that surprised me a bit. That is, returned subsets are
always forced to be in ascending order, not the order given by the user.

> x <- xts(1:1000, as.Date("2000-01-01")+1:1000)

> x[c(3,2,1),]  # Result is sorted
           [,1]
2000-01-02    1
2000-01-03    2
2000-01-04    3

> x[c(1,1,2,3),]  # Duplicates are allowed
           [,1]
2000-01-02    1
2000-01-02    1
2000-01-03    2
2000-01-04    3

Now, you are probably thinking "Of course, you idiot, the returned subset
must be an xts series and therefore the dates must be sorted.". That's
fair, but this was still a bit of a surprise to me since it differs from
normal subsetting. Anyway, I might not be the only idiot out there and I'm
wondering if we should add something explicit to the subset documentation
about this.

Thanks again. I ended up touching a bit more code than I planned on for
this feature, but I wanted a clean build and I think the changes make sense.

Corwin

Rewrite binary search routine to make the code clearer and more maintainable.

joshuaulrich · 2018-04-20T15:16:57Z

Ugh, I think I closed this because I was working on merging it... then got distracted and completely lost track of it. Reopening because these are useful contributions.

corwinjoy · 2018-04-20T16:12:39Z

Awesome! Hopefully some of these ideas can make it into the project, I certainly found them useful in some of the code I have. At any rate, I think the idea of generalizing the window function and cleaning up the binary search function that was in there is the right direction.

…

On Fri, Apr 20, 2018 at 8:16 AM, Joshua Ulrich ***@***.***> wrote: Reopened #100 <#100>. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#100 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMJeasl9C93RfctIBAOFMhqJPBYzLhUMks5tqfvqgaJpZM4E3OE1> .

Lines 62-74, 94-104, and 114-116 had no test coverage, and even basic coverage is better than nothing when refactoring. These tests pass on the current code, so they should indicate any serious errors with Corwin's binary search/subset refactor. See #100.

There were conflicts in basically every file, largely due to the time between the last branch commit and the merge. The branch also contained LF (Windows) line endings, plus the archival of the 'its' package, and the xts PR that ensured all headers had "GPL 2" contributed to number of conflicts. I also reverted all the changes to unit tests that were made to stop them from running. The unit tests were not being run at all when Corwin started his patch, but are running successfully now. Other than what's mentioned above, I merged the branch as-is, so this merge commit will be followed by some cleanup commits. Fixes #100.

joshuaulrich · 2018-04-23T17:03:04Z

Bummer... looks like there's performance regression in at least one subsetting case. It looks like the bulk of it is from not having an integer binary search branch (i.e. the as.double() calls).

I will investigate and patch, unless anyone else wants to get to it quickly.

I would also appreciate any contributed benchmark cases (not just for subsetting)!

> cat xts/inst/benchmarks/benchmark.subset.R
x <- xts::.xts(1:2e7, 1.0*seq(86400*365*20, 86400*365*40, length.out = 2e7))
# warmup
for(i in 1:2) {
  x["1990/2000"]
}
# profile
Rprof()
for(i in 1:25) {
  x["1990/2000"]
}
Rprof(NULL)
(srp <- summaryRprof())

> # Using current master (1716103af1d3c7ede69e6abda986daf2bc74e90c)
> Rscript xts/inst/benchmarks/benchmark.subset.R
$by.self
           self.time self.pct total.time total.pct
".Call"         1.86    47.69       1.86     47.69
"c"             1.68    43.08       1.68     43.08
"seq.int"       0.34     8.72       0.34      8.72
"strptime"      0.02     0.51       0.02      0.51

$by.total
                total.time total.pct self.time self.pct
"["                   3.90    100.00      0.00     0.00
"[.xts"               3.90    100.00      0.00     0.00
".Call"               1.86     47.69      1.86    47.69
"c"                   1.68     43.08      1.68    43.08
"isOrdered"           0.76     19.49      0.00     0.00
"seq.int"             0.34      8.72      0.34     8.72
"strptime"            0.02      0.51      0.02     0.51
"<Anonymous>"         0.02      0.51      0.00     0.00
"as.POSIXct"          0.02      0.51      0.00     0.00
"as.POSIXlt"          0.02      0.51      0.00     0.00
"do.call"             0.02      0.51      0.00     0.00
"ISOdatetime"         0.02      0.51      0.00     0.00
".parseISO8601"       0.02      0.51      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 3.9

> # Using current window-xts HEAD (bfedeb00cc5c35e31fb7f71519c840a36fd85437)
> Rscript xts/inst/benchmarks/benchmark.subset.R
            self.time self.pct total.time total.pct
"as.double"      3.00    34.97       3.00     34.97
"[.xts"          2.90    33.80       8.58    100.00
".Call"          1.46    17.02       1.46     17.02
"c"              0.80     9.32       0.80      9.32
"seq.int"        0.38     4.43       0.38      4.43
"attr"           0.02     0.23       0.02      0.23
"gsub"           0.02     0.23       0.02      0.23

$by.total
                     total.time total.pct self.time self.pct
"[.xts"                    8.58    100.00      2.90    33.80
"["                        8.58    100.00      0.00     0.00
"index_bsearch"            3.38     39.39      0.00     0.00
"as.double"                3.00     34.97      3.00    34.97
"binsearch"                3.00     34.97      0.00     0.00
".Call"                    1.46     17.02      1.46    17.02
"c"                        0.80      9.32      0.80     9.32
"seq.int"                  0.38      4.43      0.38     4.43
"isOrdered"                0.38      4.43      0.00     0.00
".parseISO8601"            0.04      0.47      0.00     0.00
"attr"                     0.02      0.23      0.02     0.23
"gsub"                     0.02      0.23      0.02     0.23
"as.list"                  0.02      0.23      0.00     0.00
"as_numeric"               0.02      0.23      0.00     0.00
"as.POSIXct"               0.02      0.23      0.00     0.00
"as.POSIXct.POSIXlt"       0.02      0.23      0.00     0.00
"as.POSIXlt"               0.02      0.23      0.00     0.00
"do.call"                  0.02      0.23      0.00     0.00
"parse.side"               0.02      0.23      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 8.58

corwinjoy · 2018-04-23T18:11:16Z

Yeah, I would agree. It looks like the cast to as.double is a big hit here so maybe we need an int version of the binsearch which should be straightforward. Also something is up with the "[.xts" call, I'm guessing that the date constructor here is very inefficient.

…

On Mon, Apr 23, 2018 at 10:03 AM, Joshua Ulrich ***@***.***> wrote: Bad news... looks like there's performance regression in at least one subsetting case. It looks like the bulk of it is from not having an integer binary search branch (i.e. the as.double() calls). I will investigate and patch, unless anyone else wants to get to it quickly. I would also appreciate any contributed benchmark cases (not just for subsetting)! > cat xts/inst/benchmarks/benchmark.subset.R x <- xts::.xts(1:2e7, 1.0*seq(86400*365*20, 86400*365*40, length.out = 2e7))# warmup for(i in 1:2) { x["1990/2000"] }# profileRprof() for(i in 1:25) { x["1990/2000"] } Rprof(NULL) (srp <- summaryRprof()) > # Using current master (1716103)> Rscript xts/inst/benchmarks/benchmark.subset.R$by.self self.time self.pct total.time total.pct".Call" 1.86 47.69 1.86 47.69"c" 1.68 43.08 1.68 43.08"seq.int" 0.34 8.72 0.34 8.72"strptime" 0.02 0.51 0.02 0.51 $by.total total.time total.pct self.time self.pct"[" 3.90 100.00 0.00 0.00"[.xts" 3.90 100.00 0.00 0.00".Call" 1.86 47.69 1.86 47.69"c" 1.68 43.08 1.68 43.08"isOrdered" 0.76 19.49 0.00 0.00"seq.int" 0.34 8.72 0.34 8.72"strptime" 0.02 0.51 0.02 0.51"<Anonymous>" 0.02 0.51 0.00 0.00"as.POSIXct" 0.02 0.51 0.00 0.00"as.POSIXlt" 0.02 0.51 0.00 0.00"do.call" 0.02 0.51 0.00 0.00"ISOdatetime" 0.02 0.51 0.00 0.00".parseISO8601" 0.02 0.51 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 3.9 > # Using current window-xts HEAD (bfedeb0)> Rscript xts/inst/benchmarks/benchmark.subset.R self.time self.pct total.time total.pct"as.double" 3.00 34.97 3.00 34.97"[.xts" 2.90 33.80 8.58 100.00".Call" 1.46 17.02 1.46 17.02"c" 0.80 9.32 0.80 9.32"seq.int" 0.38 4.43 0.38 4.43"attr" 0.02 0.23 0.02 0.23"gsub" 0.02 0.23 0.02 0.23 $by.total total.time total.pct self.time self.pct"[.xts" 8.58 100.00 2.90 33.80"[" 8.58 100.00 0.00 0.00"index_bsearch" 3.38 39.39 0.00 0.00"as.double" 3.00 34.97 3.00 34.97"binsearch" 3.00 34.97 0.00 0.00".Call" 1.46 17.02 1.46 17.02"c" 0.80 9.32 0.80 9.32"seq.int" 0.38 4.43 0.38 4.43"isOrdered" 0.38 4.43 0.00 0.00".parseISO8601" 0.04 0.47 0.00 0.00"attr" 0.02 0.23 0.02 0.23"gsub" 0.02 0.23 0.02 0.23"as.list" 0.02 0.23 0.00 0.00"as_numeric" 0.02 0.23 0.00 0.00"as.POSIXct" 0.02 0.23 0.00 0.00"as.POSIXct.POSIXlt" 0.02 0.23 0.00 0.00"as.POSIXlt" 0.02 0.23 0.00 0.00"do.call" 0.02 0.23 0.00 0.00"parse.side" 0.02 0.23 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 8.58 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#100 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMJeauGgsE9Gmyb4TxxNwf4NmuLcm6HDks5trglKgaJpZM4E3OE1> .

joshuaulrich · 2018-04-24T22:16:57Z

Appears we were both wrong. Removing the as.double() call and adding an integer branch only took ~2s off the run time. The main culprit was always calling i <- i[i != 0] instead of only calling it if there was at least one zero in i.

corwinjoy · 2018-04-24T22:20:30Z

Bah! That's performance profiling for you! Sometimes the most innocuous looking lines are the stinkers.... Thanks...

…

On Tue, Apr 24, 2018 at 3:16 PM, Joshua Ulrich ***@***.***> wrote: Appears we were both wrong. Removing the as.double() call and adding an integer branch only took ~2s off the run time. The main culprit was always calling i <- i[i != 0] instead of only calling it if there was at least one zero in i. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#100 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMJeaq8xIHQ55qsMTBLDdHNCuuHDQ23vks5tr6RagaJpZM4E3OE1> .

The binsearch C function now branches on double and integer types, so no conversion is done if 'key' and 'vec' are the same type. Cast both to double if they're not the same type. Create a struct that contains int and double fields, so we can pass it to a generic compare function. Create 4 compare functions, one for each type/start combination. All these components are static inline, with the hope the compiler can optimize them. Removing the as.double() call means ISO-8601 range-based subsetting is nearly twice as fast for integer vectors and keys. See #100.

The binsearch C function now branches on double and integer types, so no conversion is done if 'key' and 'vec' are the same type. Cast both to double if they're not the same type. Remove the bound() function and call lower_bound() and upper_bound() directly from binsearch(). Also move the edge case checks into the respective *_bound() functions, since we need to do the checks inside the switch() statement. Removing the as.double() call means ISO-8601 range-based subsetting is nearly twice as fast for integer vectors and keys. See #100.

Commit 988d513 removed this check. It isn't strictly necessary, but it's considerably faster than always trying to remove 'i' equal to zero at any location in 'i'. Also add benchmark for subsetting by ISO8601 range. The benchmark results before this commit are: Unit: milliseconds expr min lq mean median uq max neval x[rng, ] 371.8408 384.7454 409.4091 399.3384 426.9196 486.1258 10 After restoring the check, the benchmark results are: Unit: milliseconds expr min lq mean median uq max neval x[rng, ] 280.5653 287.1048 304.1365 303.6579 316.7202 333.2147 10 See #100.

The binsearch C function now branches on double and integer types, so no conversion is done if 'key' and 'vec' are the same type. Cast both to double if they're not the same type. Create a struct that contains int and double fields, so we can pass it to a generic compare function. Create 4 compare functions, one for each type/start combination. All these components are static inline, with the hope the compiler can optimize them. Removing the as.double() call means ISO-8601 range-based subsetting is nearly twice as fast for integer vectors and keys. See #100.

joshuaulrich · 2018-04-30T20:41:26Z

Progress update. Restoring the zero-search before trying to remove i equal to zero gets the run time down to ~2.5 seconds.

> # from commit 7463da7a636e87f0958add9e64ef27f6c929d077
> Rscript xts/inst/benchmarks/benchmark.subset.R
Loading required package: microbenchmark
$by.self
            self.time self.pct total.time total.pct
"as.double"      1.42    56.35       1.42     56.35
".Call"          0.60    23.81       0.60     23.81
"c"              0.34    13.49       0.34     13.49
"seq.int"        0.16     6.35       0.16      6.35

$by.total
                total.time total.pct self.time self.pct
"["                   2.52    100.00      0.00     0.00
"[.xts"               2.52    100.00      0.00     0.00
"as.double"           1.42     56.35      1.42    56.35
"binsearch"           1.42     56.35      0.00     0.00
"index_bsearch"       1.30     51.59      0.00     0.00
".Call"               0.60     23.81      0.60    23.81
"c"                   0.34     13.49      0.34    13.49
"seq.int"             0.16      6.35      0.16     6.35
"isOrdered"           0.14      5.56      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 2.52

Adding the integer branch gets it down to ~1s.

> # from branch 'one' or 'two' (results are similar)
> Rscript xts/inst/benchmarks/benchmark.subset.R
Loading required package: microbenchmark
$by.self
             self.time self.pct total.time total.pct
".Call"           0.58    53.70       0.58     53.70
"c"               0.34    31.48       0.34     31.48
"seq.int"         0.14    12.96       0.14     12.96
"as.numeric"      0.02     1.85       0.02      1.85

$by.total
                total.time total.pct self.time self.pct
"["                   1.08    100.00      0.00     0.00
"[.xts"               1.08    100.00      0.00     0.00
".Call"               0.58     53.70      0.58    53.70
"c"                   0.34     31.48      0.34    31.48
"isOrdered"           0.16     14.81      0.00     0.00
"seq.int"             0.14     12.96      0.14    12.96
"index_bsearch"       0.14     12.96      0.00     0.00
"as.numeric"          0.02      1.85      0.02     1.85
"as.list"             0.02      1.85      0.00     0.00
"as_numeric"          0.02      1.85      0.00     0.00
"as.POSIXlt"          0.02      1.85      0.00     0.00
"do.call"             0.02      1.85      0.00     0.00
".parseISO8601"       0.02      1.85      0.00     0.00
"parse.side"          0.02      1.85      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 1.08

I have implemented the integer branch two different ways. One took @corwinjoy's code and essentially moved stuff around so the integer/double logic could live in one place. The other is a much heavier refactor, but I wanted to try and reduce the code duplication inside the switch() statements and while() loops.

I would appreciate any reviews, thoughts, comments, etc.

corwinjoy · 2018-04-30T22:24:06Z

HIi @joshuaulrich . I really like what you did with the branch with the comparison functions. I think this cleans up the code very nicely. I think that the only part I would change is the end part of the binary search function which handles the edge cases. It has been several months since I looked at this code, so I found it hard to remember / follow the logic. Here is what I propose which simplifies and explains more of what is going on: From line 87ff ( https://github.com/joshuaulrich/xts/blob/421366cb264b333313d0bada20f4eb32746f28a3/src/binsearch.c#L87 ) /* lo is the least index for which cmp_func is true */ /* handle edge cases where item may be at the lo/hi end of array */ if (use_start) { /* cmp_func is: vec[lo] >= k */ /* if start == true, return lowest index s.t. vals[index] >= key */ if (!cmp_func(data, lo)) { /* entire array < key */ return NA_INTEGER; } } else { /* cmp_func is: vec[lo] > k */ /* lo is the least index for which vec[index] > k */ /* if start == false, return highest index s.t. vals[index] <= key */ if (lo > 0 && cmp_func(data, lo)) { // ?? Do we really need another call to cmp_func above?? I think this probably should be: if(lo>0) lo--; } if (cmp_func(data, lo)) { /* entire array > key */ return NA_INTEGER; } } /* Convert from 0 based index to 1 based index */ lo++; return ScalarInteger(lo); Note that I haven't compiled this, so I may have made an error. However, I think this clarifies the last section for anyone else using this code. Thanks for cleaning up my previous code and taking the time to review this! Cheers! Corwin

…

On Mon, Apr 30, 2018 at 1:41 PM, Joshua Ulrich ***@***.***> wrote: Progress update. Restoring the zero-search before trying to remove i equal to zero gets the run time down to ~2.5 seconds. > Rscript xts/inst/benchmarks/benchmark.subset.RLoading required package: microbenchmark$by.self self.time self.pct total.time total.pct"as.double" 1.42 56.35 1.42 56.35".Call" 0.60 23.81 0.60 23.81"c" 0.34 13.49 0.34 13.49"seq.int" 0.16 6.35 0.16 6.35 $by.total total.time total.pct self.time self.pct"[" 2.52 100.00 0.00 0.00"[.xts" 2.52 100.00 0.00 0.00"as.double" 1.42 56.35 1.42 56.35"binsearch" 1.42 56.35 0.00 0.00"index_bsearch" 1.30 51.59 0.00 0.00".Call" 0.60 23.81 0.60 23.81"c" 0.34 13.49 0.34 13.49"seq.int" 0.16 6.35 0.16 6.35"isOrdered" 0.14 5.56 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 2.52 Adding the integer branch gets it down to ~1s. > Rscript xts/inst/benchmarks/benchmark.subset.RLoading required package: microbenchmark$by.self self.time self.pct total.time total.pct".Call" 0.58 53.70 0.58 53.70"c" 0.34 31.48 0.34 31.48"seq.int" 0.14 12.96 0.14 12.96"as.numeric" 0.02 1.85 0.02 1.85 $by.total total.time total.pct self.time self.pct"[" 1.08 100.00 0.00 0.00"[.xts" 1.08 100.00 0.00 0.00".Call" 0.58 53.70 0.58 53.70"c" 0.34 31.48 0.34 31.48"isOrdered" 0.16 14.81 0.00 0.00"seq.int" 0.14 12.96 0.14 12.96"index_bsearch" 0.14 12.96 0.00 0.00"as.numeric" 0.02 1.85 0.02 1.85"as.list" 0.02 1.85 0.00 0.00"as_numeric" 0.02 1.85 0.00 0.00"as.POSIXlt" 0.02 1.85 0.00 0.00"do.call" 0.02 1.85 0.00 0.00".parseISO8601" 0.02 1.85 0.00 0.00"parse.side" 0.02 1.85 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 1.08 I have implemented the integer branch two different ways. One took @corwinjoy <https://github.com/corwinjoy>'s code and essentially moved stuff around so the integer/double logic could live in one place. The other is a much heavier refactor, but I wanted to try and reduce the code duplication inside the switch() statements and while() loops. I would appreciate any reviews, thoughts, comments, etc. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#100 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMJeapDjTmz7VtkFD7wXA_u1bSSM4vdrks5tt3b4gaJpZM4E3OE1> .

joshuaulrich · 2018-05-01T16:28:38Z

Thanks for the feedback! I agree that section was quite terse and not particularly easy to follow. I've incorporated your changes and comments. I also forgot to include a bunch of specific binsearch unit tests that I wrote. I'm going to amend the commit and add them, so here's the diff for posterity:

diff --git a/src/binsearch.c b/src/binsearch.c
index 67c0632..cb3b2eb 100644
--- a/src/binsearch.c
+++ b/src/binsearch.c
@@ -84,23 +91,35 @@ SEXP binsearch(SEXP key, SEXP vec, SEXP start)
     }
   }

-  /* handle edge cases where item may be at the lo/hi end of array */
+  /* 'lo' contains the smallest index where cmp_func() is true, but we need
+   * to handle edge cases where 'lo' is at the max/min end of the vector.
+   */
   if (use_start) {
-    /* cmp_func is: vec[lo] >= k */
-    if (cmp_func(data, lo)) {
-      lo++;
-    } else {
-      lo = NA_INTEGER;
+    /* cmp_func() := vector[index] >= key when start == true, and we need
+     * to return the smallest index subject to vector[index] >= key.
+     */
+    if (!cmp_func(data, length(vec)-1)) {
+      /* entire vector < key */
+      return ScalarInteger(NA_INTEGER);
     }
   } else {
-    if (lo > 0 && cmp_func(data, lo)) lo--;
-    /* cmp_func is: vec[lo] > k */
+    /* cmp_func() := vector[index] > key when start == false, and we need
+     * to return the largest index subject to vector[index] <= key.
+     */
     if (cmp_func(data, lo)) {
-      lo = NA_INTEGER;
-    } else {
-      lo++;
+      /* previous index value must satisfy vector[index] <= key, unless
+       * current index value is zero.
+       */
+      lo--;
+      if (lo < 0) {
+        /* entire vector > key */
+        return ScalarInteger(NA_INTEGER);
+      }
     }
   }

+  /* Convert from 0-based index to 1-based index */
+  lo++;
+
   return ScalarInteger(lo);
 }

corwinjoy · 2018-05-01T17:28:48Z

Thanks for this! This looks good. The only line I had to really think about was this: if (!cmp_func(data, length(vec)-1)) {...} Originally, I had written if (!cmp_func(data, lo)) {...} The idea is that in your binary search, it could happen that your predicate (cmp_func) is never true. So, you have to verify that you have a valid value for lo before you return it. The test you have is a bit less direct, but I can see where you are coming from, it directly reflects the comment on the next line /* entire vector < key */ so the test you have should also be true. I guess which test you want is a matter of taste, I will leave it up to you. Thanks again for looking at this, hopefully this improved binary search and indexing gives a good speedup in the end. It certainly made a big difference for the code I was running where I used subsets quite a bit. Corwin

…

On Tue, May 1, 2018 at 9:28 AM, Joshua Ulrich ***@***.***> wrote: Thanks for the feedback! I agree that section was quite terse and not particularly easy to follow. I've incorporated your changes and comments. I also forgot to include a bunch of specific binsearch unit tests that I wrote. I'm going to amend the commit and add them, so here's the diff for posterity: diff --git a/src/binsearch.c b/src/binsearch.c index 67c0632..cb3b2eb 100644--- a/src/binsearch.c+++ b/src/binsearch.c@@ -84,23 +91,35 @@ SEXP binsearch(SEXP key, SEXP vec, SEXP start) } } - /* handle edge cases where item may be at the lo/hi end of array */+ /* 'lo' contains the smallest index where cmp_func() is true, but we need+ * to handle edge cases where 'lo' is at the max/min end of the vector.+ */ if (use_start) {- /* cmp_func is: vec[lo] >= k */- if (cmp_func(data, lo)) {- lo++;- } else {- lo = NA_INTEGER;+ /* cmp_func() := vector[index] >= key when start == true, and we need+ * to return the smallest index subject to vector[index] >= key.+ */+ if (!cmp_func(data, length(vec)-1)) {+ /* entire vector < key */+ return ScalarInteger(NA_INTEGER); } } else {- if (lo > 0 && cmp_func(data, lo)) lo--;- /* cmp_func is: vec[lo] > k */+ /* cmp_func() := vector[index] > key when start == false, and we need+ * to return the largest index subject to vector[index] <= key.+ */ if (cmp_func(data, lo)) {- lo = NA_INTEGER;- } else {- lo++;+ /* previous index value must satisfy vector[index] <= key, unless+ * current index value is zero.+ */+ lo--;+ if (lo < 0) {+ /* entire vector > key */+ return ScalarInteger(NA_INTEGER);+ } } } + /* Convert from 0-based index to 1-based index */+ lo++;+ return ScalarInteger(lo); } — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#100 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMJeapUKzWUv_eiddPGOWofIgjzzYTXzks5tuI04gaJpZM4E3OE1> .

The binsearch C function now branches on double and integer types, so no conversion is done if 'key' and 'vec' are the same type. Cast both to double if they're not the same type. Create a struct that contains int and double fields, so we can pass it to a generic compare function. Create 4 compare functions, one for each type/start combination. All these components are static inline, with the hope the compiler can optimize them. Removing the as.double() call means ISO-8601 range-based subsetting is nearly twice as fast for integer vectors and keys. And the integer branch is faster still. See #100.

joshuaulrich · 2018-05-01T18:09:11Z

I waffled back and forth between those two alternatives. I think the former is clearer for that specific case, but the latter is more general. I couldn't think of a test that would fail the former and pass the latter. I may end up changing it back anyway.

Speaking of tests, quite a few of the binsearch unit tests are commented because the new function doesn't do what the old function did in various edge cases. I'm more inclined to the new function's results, because the old results seem unusual and/or wrong. Any feedback you have there would be phenomenal, as would be any tests you add.

There is no need to hope that your changes have improved things. I've already seen at least one case (ISO8601 subsetting) where this has improved performance. And your thoughts and review are very valuable! It's extremely hard to write good code without input from others (at least for me).

corwinjoy · 2018-05-01T21:13:03Z

Sure, I can take a look at the binsearch tests and see what is going on / see if the edge cases are being properly tested. If you don't mind, just to clarify: 1. Where are the binsearch tests? 2. What branch are you working in at this point? Is it window-xts? Then I guess I can just branch off of that for any proposed changes. Thanks again! Corwin

…

On Tue, May 1, 2018 at 11:09 AM, Joshua Ulrich ***@***.***> wrote: I waffled back and forth between those two alternatives. I think the former is clearer for that specific case, but the latter is more general. I couldn't think of a test that would fail the former and pass the latter. I may end up changing it back anyway. Speaking of tests, quite a few of the binsearch unit tests are commented because the new function doesn't do what the old function did in various edge cases. I'm more inclined to the new function's results, because the old results seem unusual and/or wrong. Any feedback you have there would be phenomenal, as would be any tests you add. There is no need to hope that your changes have improved things. I've already seen at least one case (ISO8601 subsetting) where this has improved performance. And your thoughts and review are very valuable! It's extremely hard to write good code without input from others (at least for me). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#100 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMJeavevw0d765a91HfjHSuqCcg5XZ3Jks5tuKTIgaJpZM4E3OE1> .

joshuaulrich · 2018-05-01T21:24:00Z

The latest is in the window-xts branch, commit 6ef8014. The tests are in inst/unitTests/runit.binsearch.R.

All the cases where start = NULL are commented, since the new function doesn't support that value.
Best I can understand, start = NULL was used to force binsearch() to return NA if an exact match couldn't be found.

Some cases where the key is outside the vector are also commented, depending on the value of start. The original binsearch() would return an index value outside the bounds of the search vector, which seems very undesirable. I think returning NA makes more sense.

There are several tests with FIXME / TODO because it isn't clear what the correct / best behavior is (e.g. what should happen if either the key or vector are zero-length?).

One thing to keep in mind is that binsearch() is unexported and is likely to remain that way, so we only need to worry about our internal use-cases.

Fix up edge cases in binsearch unit tests Remove the start = NULL test cases, since it will no longer be supported. Always return NA if key is NA or zero-length. Fix compiler warning in binsearch.c ('static' is unnecessary as the struct is already local to the file). See joshuaulrich#100.

Remove the start = NULL test cases, since it will no longer be supported. Always return NA if key is NA or zero-length. Fix compiler warning in binsearch.c ('static' is unnecessary as the struct is already local to the file). See joshuaulrich#100.

joshuaulrich · 2018-05-03T11:50:01Z

Thanks for the fixes in #240. Now that it's merged, I'm going to run reverse dependency checks to ensure nothing downstream is broken. I don't usually do this until preparing for a CRAN release, but these changes are fairly significant. I'd prefer to address any issues now, while the code is still fresh in my mind. Assuming those reverse dependency checks go well, I think this is ready to merge.

joshuaulrich · 2018-05-03T18:56:04Z

And merged! You win the award for most patient PR ever. ;-)

corwinjoy added 6 commits June 3, 2015 15:02

Add support for window function

9bf9d16

I propose that a window function be added to xts so that users can have binary lookup over a range of dates. This function is the same as in the base package zoo, that is window.zoo, however we use binary search over the start and end date.

Correct assignment syntax

b1cb96a

Add support for start = NULL, or end = NULL as per window.zoo

6f41215

Add name to contributors.

4de3c8d

Use .index in window function

12e7dd2

We need to make sure we are accessing the internal POSIXct index, so switch the call to .index. The index function (potentially) converts to an end-user index.

Add comment/proposed minor change to binsearch

28e811a

corwinjoy added 2 commits June 10, 2015 10:24

corwinjoy added 3 commits June 11, 2015 11:42

Remove extra declaration of binsearch that was added for debugging.

4fb1b58

Upgrade binary_search

988d513

Rewrite binary search routine to make the code clearer and more maintainable.

Fix indenting in binsearch

508d679

joshuaulrich closed this Oct 11, 2015

joshuaulrich changed the base branch from develop to master April 20, 2018 15:16

joshuaulrich reopened this Apr 20, 2018

joshuaulrich merged commit 508d679 into joshuaulrich:master May 3, 2018

joshuaulrich mentioned this pull request Jul 16, 2018

Replace the binary search function in utils.R #101

Closed

corwinjoy mentioned this pull request Jul 20, 2018

Big performance regression in subsetting in v0.11 #251

Closed

joshuaulrich added this to the Release 0.11-0 milestone Jul 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for window function #100

Add support for window function #100

corwinjoy commented Jun 3, 2015

joshuaulrich commented Jun 5, 2015

corwinjoy commented Jun 5, 2015

corwinjoy commented Jun 11, 2015

joshuaulrich commented Apr 20, 2018

corwinjoy commented Apr 20, 2018 via email

joshuaulrich commented Apr 23, 2018 •

edited

Loading

corwinjoy commented Apr 23, 2018 via email

joshuaulrich commented Apr 24, 2018

corwinjoy commented Apr 24, 2018 via email

joshuaulrich commented Apr 30, 2018 •

edited

Loading

corwinjoy commented Apr 30, 2018 via email

joshuaulrich commented May 1, 2018

corwinjoy commented May 1, 2018 via email

joshuaulrich commented May 1, 2018

corwinjoy commented May 1, 2018 via email

joshuaulrich commented May 1, 2018 •

edited

Loading

joshuaulrich commented May 3, 2018

joshuaulrich commented May 3, 2018

Add support for window function #100

Add support for window function #100

Conversation

corwinjoy commented Jun 3, 2015

joshuaulrich commented Jun 5, 2015

corwinjoy commented Jun 5, 2015

corwinjoy commented Jun 11, 2015

joshuaulrich commented Apr 20, 2018

corwinjoy commented Apr 20, 2018 via email

joshuaulrich commented Apr 23, 2018 • edited Loading

corwinjoy commented Apr 23, 2018 via email

joshuaulrich commented Apr 24, 2018

corwinjoy commented Apr 24, 2018 via email

joshuaulrich commented Apr 30, 2018 • edited Loading

corwinjoy commented Apr 30, 2018 via email

joshuaulrich commented May 1, 2018

corwinjoy commented May 1, 2018 via email

joshuaulrich commented May 1, 2018

corwinjoy commented May 1, 2018 via email

joshuaulrich commented May 1, 2018 • edited Loading

joshuaulrich commented May 3, 2018

joshuaulrich commented May 3, 2018

joshuaulrich commented Apr 23, 2018 •

edited

Loading

joshuaulrich commented Apr 30, 2018 •

edited

Loading

joshuaulrich commented May 1, 2018 •

edited

Loading