Faster $driver$get_hash() for the RDS driver #96

wlandau · 2019-01-08T12:34:15Z

I am trying to speed up drake (ref: ropensci/drake#647, ropensci/drake#655). Note where self_hash() appears in the middle of this flame graph. Candidate alternatives to readLines(): https://gist.github.com/hadley/6353939.

Would you be open to a PR that dropped down into C?

The text was updated successfully, but these errors were encountered:

wlandau · 2019-01-08T16:15:57Z

Hmm... the speed differences from https://gist.github.com/hadley/6353939 do seem really nice at first glance.

path <- file.path(R.home("doc"), "COPYING")
size <- file.info(path)$size
microbenchmark::microbenchmark(
  old = readLines(path),
  new = readChar(path, size, useBytes = TRUE)
)
#> Unit: microseconds
#>  expr     min       lq      mean   median       uq     max neval cld
#>   old 224.137 226.2785 243.32306 232.1250 255.6135 302.260   100   b
#>   new  61.374  64.3230  70.43012  67.8185  72.3140 124.196   100  a

However, at least for readChar(), the difference is less impressive for the small key files that storr uses. Part of the speedup below could just be from pre-computing the hash length.

writeLines(digest::digest(NA, "md5"), "key")
path <- "key"
microbenchmark::microbenchmark(
  old = readLines(path),
  new = readChar(path, nchars = 32, useBytes = FALSE)
)
#> Unit: microseconds
#>  expr    min      lq     mean  median      uq     max neval cld
#>   old 32.795 35.1310 37.86379 36.4545 37.7230  59.820   100   b
#>   new 24.267 25.8445 28.56502 26.5890 27.6045 171.191   100  a

I will still open a PR in case you think it is worth the change.

wlandau · 2019-01-08T16:25:34Z

The first C++ solution from https://gist.github.com/hadley/6353939 is much faster.

writeLines(digest::digest(NA, "md5"), "key")
path <- "key"
Rcpp::sourceCpp("~/read-file.cpp")
microbenchmark::microbenchmark(
  old = readLines(path),
  new = readChar(path, nchars = 32, useBytes = FALSE),
  cpp1 = read_file_cpp1(path),
  cpp2 = read_file_cpp1(path)
)
#> Unit: microseconds
#>  expr    min      lq      mean  median      uq       max neval cld
#>   old 31.072 33.9355  36.03173 35.1105 36.6300   112.651   100   a
#>   new 22.445 23.9945  25.72207 25.5470 26.4130    44.078   100   a
#>  cpp1  6.665  7.1135   9.42972  7.8900 11.1230    42.210   100   a
#>  cpp2  6.679  7.3545 139.04901 10.5105 11.8915 12931.184   100   a

wlandau · 2019-01-08T17:57:02Z

One thing I wondered: is it possible to go any faster if we drop even farther down to pure C? Overly optimistic C function:

// read.c
void readkey_c() {
  char buf[32];
  FILE *fp;
  fp = fopen("key", "rb");
  fread(buf, 32, 1, fp);
  //  printf("%s", buf);
  fclose(fp);
  return;
}

Benchmarks:

writeLines(digest::digest(NA, "md5"), "key")
path <- "key"
Rcpp::sourceCpp("~/read-file.cpp")
dyn.load("~/read.so") # R CMD SHLIB read.c -i read.so
microbenchmark::microbenchmark(
  old = readLines(path),
  new = readChar(path, nchars = 32, useBytes = FALSE),
  cpp1 = read_file_cpp1(path),
  cpp2 = read_file_cpp1(path),
  c = .C("readkey_c")
)
#> Unit: microseconds
#>  expr    min     lq      mean  median      uq       max neval cld
#>   old 35.464 38.750  43.38168 40.7825 43.3590   129.943   100   a
#>   new 25.399 29.286  31.87871 30.7565 31.9520    56.339   100   a
#>  cpp1  7.128  8.148 161.75054 11.0595 12.4700 15076.276   100   a
#>  cpp2  7.196  8.290  11.45663 10.9160 12.4450    81.264   100   a
#>     c  6.968  9.398  10.89305 10.4880 11.1485    32.054   100   a

^{Created on 2019-01-08 by the reprex package (v0.2.1)}

The performance bump from dropping down to C is negligible. I think the best we can do is this:

#include <fstream>
#include <sstream>
#include <string>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
CharacterVector read_file_cpp1(std::string path) {
  std::ifstream t(path.c_str());
  std::stringstream ss;
  ss << t.rdbuf();
  return ss.str();
}

richfitz · 2019-01-14T13:58:19Z

Fixed in #98

wlandau mentioned this issue Jan 8, 2019

Slight speed improvement in $get_hash() for RDS storrs #97

Closed

This was referenced Jan 9, 2019

Memoize config$cache$get_hash() on up-to-date targets ropensci/drake#660

Merged

Try to speed up dependency_hash() ropensci/drake#655

Closed

richfitz closed this as completed Jan 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster $driver$get_hash() for the RDS driver #96

Faster $driver$get_hash() for the RDS driver #96

wlandau commented Jan 8, 2019

wlandau commented Jan 8, 2019 •

edited

Loading

wlandau commented Jan 8, 2019

wlandau commented Jan 8, 2019

richfitz commented Jan 14, 2019

Faster $driver$get_hash() for the RDS driver #96

Faster $driver$get_hash() for the RDS driver #96

Comments

wlandau commented Jan 8, 2019

wlandau commented Jan 8, 2019 • edited Loading

wlandau commented Jan 8, 2019

wlandau commented Jan 8, 2019

richfitz commented Jan 14, 2019

wlandau commented Jan 8, 2019 •

edited

Loading