Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_stan_csv is mandatory in brm() #1331

Closed
neringabr opened this issue Mar 30, 2022 · 15 comments
Closed

read_stan_csv is mandatory in brm() #1331

neringabr opened this issue Mar 30, 2022 · 15 comments

Comments

@neringabr
Copy link

Hello,

brm() function by default uses read_stan_csv() function:

  1. In brm.R script there is a function brm() in which at the end fit_model() function is called.
  2. If backend = cmdstanr, fit_model() calls another function .fit_model_cmdstanr().
  3. fit_model_cmdstanr() function at the end runs rstan::read_stan_csv.

However, saved CSV file, which brm() is trying to read with rstan::read_stan_csv, can reach several gigabytes in size for a large number of parameters. In that case, function read_stan_csv can take up on several hours, even days, to load CSV file. This is because of rstan::read_stan_csv uses slow data reading packages. Moreover, it seems that rstan::read_stan_csv in brm() is mandatory. In my case it tooks >90% of running time. For models with a lot of parameters, I have to kill a proccess when rstan::read_stan_csv is running and read CSV file with a lot faster function read_cmdstan_csv().

My questions/offers:

  1. Could you create in brm() function additional parameter which could disable CSV reading part rstan::read_stan_csv?
  2. Do you consider using read_cmdstan_csv() instead of rstan::read_stan_csv?
@paul-buerkner
Copy link
Owner

I agree it's not ideal. The problem is that I need to translate whatever output we are getting to a stanfit object of the rstan package. Does anybody have an idea how to transform a fitted model object of cmdstanr to a stanfit object without using rstan::read_stan_csv?

@jsocolar
Copy link
Contributor

Hi @paul-buerkner
@SimonCMills just rewrote rstan::read_stan_csv to use data.table::fread and achieved speed-up on large inputs in line with what we expected. That's available here:
https://github.com/SimonCMills/rstan/blob/develop/rstan/rstan/R/stan_csv.R

I've been working with Simon on this, and while we think the best solution is to merge this into Rstan, we aren't sure how Ben feels about a data.table dependency. In the mean time, we'd be happy to put this into brms as a drop-in replacement for rstan::read_stan_csv. The license is a mixture of Rstan's GPL (>=3) and cmdstanr's BSD-3.

The rewrite changes only the first several lines of read_stan_csv to produce the same internal function objects but much faster when inputs have many columns. We verify the equivalency of the output to the first several lines of read_stan_csv on a test suite consisting of:

library(cmdstanr)
setwd(tempdir())

model_code <- "
parameters {
  vector[10] x;
}
model {
  x ~ std_normal();
}
"
stan_file <- cmdstanr::write_stan_file(code = model_code)
mod <- cmdstan_model(stan_file)
fit <- mod$sample(parallel_chains = 4)
fit_warmup <- mod$sample(parallel_chains = 4, save_warmup = T)
fit_dense_warmup <- mod$sample(parallel_chains = 4, metric = "dense_e", save_warmup = T)
fit_optimize <- mod$optimize()
fit_variational <- mod$variational()

test_suite <- list(
  fit$output_files()[[1]], 
  fit$output_files(), 
  fit_warmup$output_files(),
  fit_dense_warmup$output_files(),
  fit_optimize$output_files(),
  fit_variational$output_files()
  )

v1 <- lapply(test_suite, read_initial)
v2 <- lapply(test_suite, read_initial2)

identical(v1[[1]][-2], v2[[1]][-2])
all(v1[[1]][2]$ss_lst[[1]] - v2[[1]][2]$ss_lst[[1]] < 1*10e-15)

Everything is identical except for some floating point issues on the 16th decimal place that apparently are handled inconsistently by fread versus read.csv.

@paul-buerkner
Copy link
Owner

paul-buerkner commented Jun 17, 2022

It would be amazing to have it in brms! Would you mind preparing a PR?

Can we also do selective loading of columns with this? I still need a mechanism in the cmdstanr backend to exclude variables I don't want to save.

@rok-cesnovar
Copy link
Contributor

data.table::fread supports selective loading via the select argument so should be doable. We use that in cmdstanr as well and it works great.

@paul-buerkner
Copy link
Owner

Perfect! I cannot wait for a PR :-)

@jsocolar
Copy link
Contributor

We're waiting to hear back from rstan devs only because pulling out all of the rstan internals behind rstan::read_stan_csv would be a bit of a pain. If rstan is slow or doesn't want the data.table dependency we'll implement directly in brms shortly.

@paul-buerkner
Copy link
Owner

paul-buerkner commented Aug 17, 2022

For some of my current project, I would love to have this feature available in brms. I could copy your code you linked above and make some edits to it in order to put it into brms, but the problem is that this would make brms GPL >= 3 which is something I don't want (brms is GPL2). So basically, do I have @bgoodri's approvement (and yours) to put this into brms under GPL2?

@SimonCMills
Copy link
Contributor

Hi Paul, you are welcome to use any of the code I have written (i.e. the modified version of read_stan_csv). If you'd prefer I'm also happy to implement this myself (in the next week or so) and submit it as a PR to brms. There's very little in the way of new code, and it's mainly a case of ripping out a few rstan internal functions, and (I think) a single cmdstanr internal function, as well as any associated unit tests. Changing to licensing would therefore be up to the rstan and cmdstanr teams, as the additions to brms would consist of the internal functions and unit test taken verbatim from there, plus the modified read_stan_csv function.

@paul-buerkner
Copy link
Owner

Perfect, Thank you! I would very much appreciate a PR from you. And I would be happy to hear from @bgoodri and @jgabry in the meantime about the licensing stuff.

@SimonCMills
Copy link
Contributor

Hi, just to follow up on this: the above approach involves tearing out fairly large quantities of code from rstan and cmdstanr and then making a minor change at the outset of rstan::read_stan_csv() to use data.table::fread(). This is quick to implement, but it isn't the most attractive solution as it brings a large and somewhat complicated mishmash of code within brms, discards all the nice error checking that cmdstanr::read_cmdstan_csv (and associated internals) does, and also introduces a slightly complicated situation wrt licensing.

Better, I think, is to just use cmdstanr::read_cmdstan_csv() to read in the files, and then to repackage these as a stanfit object. This is more involved to write, but is (I think) easier to maintain, and overall a better solution. I've implemented an initial version of this (here), and it seems to work fine. The two downsides to this solution are (1) there are 4 items of information in a stanfit object that are not inherent in the cmdstanr 'metadata' information, and (2) there is possibly a slight memory overhead introduced relative to the above approach as you shuffle the draw info out of the cmdstanr object and into a repackaged stanfit object (though I've not actually checked this yet). Regarding (1), the 4 items of information that are lost are in the @stanargs slot and relate to file locations of the stancflags, diagnostic_file, and metric file (that I don't think are essential). If they are essential, this could be fixed by a second fread call to extract just these four lines, though possibly incurs a speed penalty (but again, not sure, haven't checked yet).

I think the licensing situation is resolved by this, as nothing is taken out of cmdstanr and only a handful of lines are taken out of rstan which would be trivial to rewrite (rest is mine, and happy for it to be licensed as you like).

There's a little bit more work involved in finishing off a couple of smaller details, checking speed/memory on some larger files, and writing some code to more comprehensively check for consistency between the stanfit object this generates and the one that rstan::read_stan_csv() generates. I can do this later this week, but wanted to check that you were happy with the overall approach first! Does this sound alright to you?

@paul-buerkner
Copy link
Owner

The approach you suggested makes total sense to me! Thank you for working on this solution!

@wds15
Copy link
Contributor

wds15 commented Aug 24, 2022

just one thought... as a - very dirty - hack... could one possibly write a function which temporarily redefines the readLines function? Maybe the packages doing mocking could help here? If so, then one could redirect the readLines call possibly to data.table::fread? This way you redefine the function, continue to use the read_stan_csv and that's it.

Even more brutal would be to take the loaded read_stan_csv parse the R code back into an expression ( substitute ), replace the readLines and then evaluate the function magically inside the namespace of rstan? Not sure if it possible, but it would be magic.

(this assumes that swapping out readLines for fread is all what is needed)

@wds15
Copy link
Contributor

wds15 commented Aug 24, 2022

I just tried this approach and I got amazingly far. The culprit is that it's not just about swapping out readLines (which really works line-by-line yack):

library(rstan)

rf <- as.character(body(read_stan_csv))

read_stan_csv

rf2 <- gsub("readLines", "data.table::fread", rf)

pseudoFun <- function(csvfiles, col_major = TRUE) {}

pseudoFun

rstanNamespace <- asNamespace("rstan")

body(pseudoFun) <- parse(text=c(rf2, "}"))

environment(pseudoFun) <-  rstanNamespace

pseudoFun


csvfiles <- dir(system.file('misc', package = 'rstan'),
                pattern = 'rstan_doc_ex_[0-9].csv', full.names = TRUE)

## almost works, but it's more complex than that...
pseudoFun(csvfiles)

maybe this is inspiring to someone. The point is that R is amazing in what it let's you do to the language itself. Essentially one could write your own import function and rely on the internals from rstan by assigning the internal rstan namespace as environment of the function. This way one could be saved from having to rewrite all the internals.

Not sure how CRAN friendly this would be... maybe this is more a fun thing...

@SimonCMills
Copy link
Contributor

A bit slower than promised, but submitted a PR that follows this approach of reading in via cmdstanr and then repackaging into a stanfit object. PR here. The bit that took a bit longer was to include functionality to allow selecting a subset of variables at the read stage, which it now does. The stanfit print method however does not like it if you select a subset of a variable at the read stage (e.g. x[1]). You can still create the stanfit object fine, but it will generate an error whenever it tries to print. Selecting the variable in its entirety (e.g. x rather than x[1]) and there aren't any problems.

@paul-buerkner paul-buerkner added this to the brms 2.17.0++ milestone Sep 1, 2022
@paul-buerkner
Copy link
Owner

Implemented via #1400

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants