Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster $driver$get_hash() for the RDS driver #96

Closed
wlandau opened this issue Jan 8, 2019 · 4 comments
Closed

Faster $driver$get_hash() for the RDS driver #96

wlandau opened this issue Jan 8, 2019 · 4 comments

Comments

@wlandau
Copy link
Contributor

wlandau commented Jan 8, 2019

I am trying to speed up drake (ref: ropensci/drake#647, ropensci/drake#655). Note where self_hash() appears in the middle of this flame graph. Candidate alternatives to readLines(): https://gist.github.com/hadley/6353939.

Would you be open to a PR that dropped down into C?

@wlandau
Copy link
Contributor Author

wlandau commented Jan 8, 2019

Hmm... the speed differences from https://gist.github.com/hadley/6353939 do seem really nice at first glance.

path <- file.path(R.home("doc"), "COPYING")
size <- file.info(path)$size
microbenchmark::microbenchmark(
  old = readLines(path),
  new = readChar(path, size, useBytes = TRUE)
)
#> Unit: microseconds
#>  expr     min       lq      mean   median       uq     max neval cld
#>   old 224.137 226.2785 243.32306 232.1250 255.6135 302.260   100   b
#>   new  61.374  64.3230  70.43012  67.8185  72.3140 124.196   100  a

However, at least for readChar(), the difference is less impressive for the small key files that storr uses. Part of the speedup below could just be from pre-computing the hash length.

writeLines(digest::digest(NA, "md5"), "key")
path <- "key"
microbenchmark::microbenchmark(
  old = readLines(path),
  new = readChar(path, nchars = 32, useBytes = FALSE)
)
#> Unit: microseconds
#>  expr    min      lq     mean  median      uq     max neval cld
#>   old 32.795 35.1310 37.86379 36.4545 37.7230  59.820   100   b
#>   new 24.267 25.8445 28.56502 26.5890 27.6045 171.191   100  a

I will still open a PR in case you think it is worth the change.

@wlandau
Copy link
Contributor Author

wlandau commented Jan 8, 2019

The first C++ solution from https://gist.github.com/hadley/6353939 is much faster.

writeLines(digest::digest(NA, "md5"), "key")
path <- "key"
Rcpp::sourceCpp("~/read-file.cpp")
microbenchmark::microbenchmark(
  old = readLines(path),
  new = readChar(path, nchars = 32, useBytes = FALSE),
  cpp1 = read_file_cpp1(path),
  cpp2 = read_file_cpp1(path)
)
#> Unit: microseconds
#>  expr    min      lq      mean  median      uq       max neval cld
#>   old 31.072 33.9355  36.03173 35.1105 36.6300   112.651   100   a
#>   new 22.445 23.9945  25.72207 25.5470 26.4130    44.078   100   a
#>  cpp1  6.665  7.1135   9.42972  7.8900 11.1230    42.210   100   a
#>  cpp2  6.679  7.3545 139.04901 10.5105 11.8915 12931.184   100   a

@wlandau
Copy link
Contributor Author

wlandau commented Jan 8, 2019

One thing I wondered: is it possible to go any faster if we drop even farther down to pure C? Overly optimistic C function:

// read.c
void readkey_c() {
  char buf[32];
  FILE *fp;
  fp = fopen("key", "rb");
  fread(buf, 32, 1, fp);
  //  printf("%s", buf);
  fclose(fp);
  return;
}

Benchmarks:

writeLines(digest::digest(NA, "md5"), "key")
path <- "key"
Rcpp::sourceCpp("~/read-file.cpp")
dyn.load("~/read.so") # R CMD SHLIB read.c -i read.so
microbenchmark::microbenchmark(
  old = readLines(path),
  new = readChar(path, nchars = 32, useBytes = FALSE),
  cpp1 = read_file_cpp1(path),
  cpp2 = read_file_cpp1(path),
  c = .C("readkey_c")
)
#> Unit: microseconds
#>  expr    min     lq      mean  median      uq       max neval cld
#>   old 35.464 38.750  43.38168 40.7825 43.3590   129.943   100   a
#>   new 25.399 29.286  31.87871 30.7565 31.9520    56.339   100   a
#>  cpp1  7.128  8.148 161.75054 11.0595 12.4700 15076.276   100   a
#>  cpp2  7.196  8.290  11.45663 10.9160 12.4450    81.264   100   a
#>     c  6.968  9.398  10.89305 10.4880 11.1485    32.054   100   a

Created on 2019-01-08 by the reprex package (v0.2.1)

The performance bump from dropping down to C is negligible. I think the best we can do is this:

#include <fstream>
#include <sstream>
#include <string>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
CharacterVector read_file_cpp1(std::string path) {
  std::ifstream t(path.c_str());
  std::stringstream ss;
  ss << t.rdbuf();
  return ss.str();
}

@richfitz
Copy link
Owner

Fixed in #98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants