`read_stan_csv` is mandatory in `brm()` #1331

neringabr · 2022-03-30T06:55:24Z

Hello,

brm() function by default uses read_stan_csv() function:

In brm.R script there is a function brm() in which at the end fit_model() function is called.
If backend = cmdstanr, fit_model() calls another function .fit_model_cmdstanr().
fit_model_cmdstanr() function at the end runs rstan::read_stan_csv.

However, saved CSV file, which brm() is trying to read with rstan::read_stan_csv, can reach several gigabytes in size for a large number of parameters. In that case, function read_stan_csv can take up on several hours, even days, to load CSV file. This is because of rstan::read_stan_csv uses slow data reading packages. Moreover, it seems that rstan::read_stan_csv in brm() is mandatory. In my case it tooks >90% of running time. For models with a lot of parameters, I have to kill a proccess when rstan::read_stan_csv is running and read CSV file with a lot faster function read_cmdstan_csv().

My questions/offers:

Could you create in brm() function additional parameter which could disable CSV reading part rstan::read_stan_csv?
Do you consider using read_cmdstan_csv() instead of rstan::read_stan_csv?

The text was updated successfully, but these errors were encountered:

paul-buerkner · 2022-03-30T07:03:44Z

I agree it's not ideal. The problem is that I need to translate whatever output we are getting to a stanfit object of the rstan package. Does anybody have an idea how to transform a fitted model object of cmdstanr to a stanfit object without using rstan::read_stan_csv?

jsocolar · 2022-06-16T20:35:45Z

Hi @paul-buerkner
@SimonCMills just rewrote rstan::read_stan_csv to use data.table::fread and achieved speed-up on large inputs in line with what we expected. That's available here:
https://github.com/SimonCMills/rstan/blob/develop/rstan/rstan/R/stan_csv.R

I've been working with Simon on this, and while we think the best solution is to merge this into Rstan, we aren't sure how Ben feels about a data.table dependency. In the mean time, we'd be happy to put this into brms as a drop-in replacement for rstan::read_stan_csv. The license is a mixture of Rstan's GPL (>=3) and cmdstanr's BSD-3.

The rewrite changes only the first several lines of read_stan_csv to produce the same internal function objects but much faster when inputs have many columns. We verify the equivalency of the output to the first several lines of read_stan_csv on a test suite consisting of:

library(cmdstanr)
setwd(tempdir())

model_code <- "
parameters {
  vector[10] x;
}
model {
  x ~ std_normal();
}
"
stan_file <- cmdstanr::write_stan_file(code = model_code)
mod <- cmdstan_model(stan_file)
fit <- mod$sample(parallel_chains = 4)
fit_warmup <- mod$sample(parallel_chains = 4, save_warmup = T)
fit_dense_warmup <- mod$sample(parallel_chains = 4, metric = "dense_e", save_warmup = T)
fit_optimize <- mod$optimize()
fit_variational <- mod$variational()

test_suite <- list(
  fit$output_files()[[1]], 
  fit$output_files(), 
  fit_warmup$output_files(),
  fit_dense_warmup$output_files(),
  fit_optimize$output_files(),
  fit_variational$output_files()
  )

v1 <- lapply(test_suite, read_initial)
v2 <- lapply(test_suite, read_initial2)

identical(v1[[1]][-2], v2[[1]][-2])
all(v1[[1]][2]$ss_lst[[1]] - v2[[1]][2]$ss_lst[[1]] < 1*10e-15)

Everything is identical except for some floating point issues on the 16th decimal place that apparently are handled inconsistently by fread versus read.csv.

paul-buerkner · 2022-06-17T07:00:29Z

It would be amazing to have it in brms! Would you mind preparing a PR?

Can we also do selective loading of columns with this? I still need a mechanism in the cmdstanr backend to exclude variables I don't want to save.

rok-cesnovar · 2022-06-17T07:02:46Z

data.table::fread supports selective loading via the select argument so should be doable. We use that in cmdstanr as well and it works great.

paul-buerkner · 2022-06-17T07:04:04Z

Perfect! I cannot wait for a PR :-)

jsocolar · 2022-06-21T01:39:34Z

We're waiting to hear back from rstan devs only because pulling out all of the rstan internals behind rstan::read_stan_csv would be a bit of a pain. If rstan is slow or doesn't want the data.table dependency we'll implement directly in brms shortly.

paul-buerkner · 2022-08-17T14:21:57Z

For some of my current project, I would love to have this feature available in brms. I could copy your code you linked above and make some edits to it in order to put it into brms, but the problem is that this would make brms GPL >= 3 which is something I don't want (brms is GPL2). So basically, do I have @bgoodri's approvement (and yours) to put this into brms under GPL2?

SimonCMills · 2022-08-18T11:03:05Z

Hi Paul, you are welcome to use any of the code I have written (i.e. the modified version of read_stan_csv). If you'd prefer I'm also happy to implement this myself (in the next week or so) and submit it as a PR to brms. There's very little in the way of new code, and it's mainly a case of ripping out a few rstan internal functions, and (I think) a single cmdstanr internal function, as well as any associated unit tests. Changing to licensing would therefore be up to the rstan and cmdstanr teams, as the additions to brms would consist of the internal functions and unit test taken verbatim from there, plus the modified read_stan_csv function.

paul-buerkner · 2022-08-18T11:44:12Z

Perfect, Thank you! I would very much appreciate a PR from you. And I would be happy to hear from @bgoodri and @jgabry in the meantime about the licensing stuff.

SimonCMills · 2022-08-24T13:06:24Z

Hi, just to follow up on this: the above approach involves tearing out fairly large quantities of code from rstan and cmdstanr and then making a minor change at the outset of rstan::read_stan_csv() to use data.table::fread(). This is quick to implement, but it isn't the most attractive solution as it brings a large and somewhat complicated mishmash of code within brms, discards all the nice error checking that cmdstanr::read_cmdstan_csv (and associated internals) does, and also introduces a slightly complicated situation wrt licensing.

Better, I think, is to just use cmdstanr::read_cmdstan_csv() to read in the files, and then to repackage these as a stanfit object. This is more involved to write, but is (I think) easier to maintain, and overall a better solution. I've implemented an initial version of this (here), and it seems to work fine. The two downsides to this solution are (1) there are 4 items of information in a stanfit object that are not inherent in the cmdstanr 'metadata' information, and (2) there is possibly a slight memory overhead introduced relative to the above approach as you shuffle the draw info out of the cmdstanr object and into a repackaged stanfit object (though I've not actually checked this yet). Regarding (1), the 4 items of information that are lost are in the @stanargs slot and relate to file locations of the stancflags, diagnostic_file, and metric file (that I don't think are essential). If they are essential, this could be fixed by a second fread call to extract just these four lines, though possibly incurs a speed penalty (but again, not sure, haven't checked yet).

I think the licensing situation is resolved by this, as nothing is taken out of cmdstanr and only a handful of lines are taken out of rstan which would be trivial to rewrite (rest is mine, and happy for it to be licensed as you like).

There's a little bit more work involved in finishing off a couple of smaller details, checking speed/memory on some larger files, and writing some code to more comprehensively check for consistency between the stanfit object this generates and the one that rstan::read_stan_csv() generates. I can do this later this week, but wanted to check that you were happy with the overall approach first! Does this sound alright to you?

paul-buerkner · 2022-08-24T13:59:59Z

The approach you suggested makes total sense to me! Thank you for working on this solution!

wds15 · 2022-08-24T14:10:11Z

just one thought... as a - very dirty - hack... could one possibly write a function which temporarily redefines the readLines function? Maybe the packages doing mocking could help here? If so, then one could redirect the readLines call possibly to data.table::fread? This way you redefine the function, continue to use the read_stan_csv and that's it.

Even more brutal would be to take the loaded read_stan_csv parse the R code back into an expression ( substitute ), replace the readLines and then evaluate the function magically inside the namespace of rstan? Not sure if it possible, but it would be magic.

(this assumes that swapping out readLines for fread is all what is needed)

wds15 · 2022-08-24T16:03:02Z

I just tried this approach and I got amazingly far. The culprit is that it's not just about swapping out readLines (which really works line-by-line yack):

library(rstan)

rf <- as.character(body(read_stan_csv))

read_stan_csv

rf2 <- gsub("readLines", "data.table::fread", rf)

pseudoFun <- function(csvfiles, col_major = TRUE) {}

pseudoFun

rstanNamespace <- asNamespace("rstan")

body(pseudoFun) <- parse(text=c(rf2, "}"))

environment(pseudoFun) <-  rstanNamespace

pseudoFun


csvfiles <- dir(system.file('misc', package = 'rstan'),
                pattern = 'rstan_doc_ex_[0-9].csv', full.names = TRUE)

## almost works, but it's more complex than that...
pseudoFun(csvfiles)

maybe this is inspiring to someone. The point is that R is amazing in what it let's you do to the language itself. Essentially one could write your own import function and rely on the internals from rstan by assigning the internal rstan namespace as environment of the function. This way one could be saved from having to rewrite all the internals.

Not sure how CRAN friendly this would be... maybe this is more a fun thing...

SimonCMills · 2022-08-31T17:27:54Z

A bit slower than promised, but submitted a PR that follows this approach of reading in via cmdstanr and then repackaging into a stanfit object. PR here. The bit that took a bit longer was to include functionality to allow selecting a subset of variables at the read stage, which it now does. The stanfit print method however does not like it if you select a subset of a variable at the read stage (e.g. x[1]). You can still create the stanfit object fine, but it will generate an error whenever it tries to print. Selecting the variable in its entirety (e.g. x rather than x[1]) and there aren't any problems.

paul-buerkner · 2022-09-01T14:48:42Z

Implemented via #1400

paul-buerkner added efficiency feature post-processing and removed feature labels Mar 30, 2022

SimonCMills mentioned this issue Jun 20, 2022

Introducing data.table dependency to significantly improve read_stan_csv read times stan-dev/rstan#1018

Open

paul-buerkner added this to the brms 2.17.0++ milestone Sep 1, 2022

paul-buerkner closed this as completed Sep 1, 2022

helske mentioned this issue Apr 4, 2024

Faster extraction of posterior samples from Stan output ropensci/dynamite#78

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_stan_csv` is mandatory in `brm()` #1331

`read_stan_csv` is mandatory in `brm()` #1331

neringabr commented Mar 30, 2022

paul-buerkner commented Mar 30, 2022

jsocolar commented Jun 16, 2022

paul-buerkner commented Jun 17, 2022 •

edited

Loading

rok-cesnovar commented Jun 17, 2022

paul-buerkner commented Jun 17, 2022

jsocolar commented Jun 21, 2022

paul-buerkner commented Aug 17, 2022 •

edited

Loading

SimonCMills commented Aug 18, 2022

paul-buerkner commented Aug 18, 2022

SimonCMills commented Aug 24, 2022

paul-buerkner commented Aug 24, 2022

wds15 commented Aug 24, 2022

wds15 commented Aug 24, 2022

SimonCMills commented Aug 31, 2022

paul-buerkner commented Sep 1, 2022

read_stan_csv is mandatory in brm() #1331

read_stan_csv is mandatory in brm() #1331

Comments

neringabr commented Mar 30, 2022

paul-buerkner commented Mar 30, 2022

jsocolar commented Jun 16, 2022

paul-buerkner commented Jun 17, 2022 • edited Loading

rok-cesnovar commented Jun 17, 2022

paul-buerkner commented Jun 17, 2022

jsocolar commented Jun 21, 2022

paul-buerkner commented Aug 17, 2022 • edited Loading

SimonCMills commented Aug 18, 2022

paul-buerkner commented Aug 18, 2022

SimonCMills commented Aug 24, 2022

paul-buerkner commented Aug 24, 2022

wds15 commented Aug 24, 2022

wds15 commented Aug 24, 2022

SimonCMills commented Aug 31, 2022

paul-buerkner commented Sep 1, 2022

`read_stan_csv` is mandatory in `brm()` #1331

`read_stan_csv` is mandatory in `brm()` #1331

paul-buerkner commented Jun 17, 2022 •

edited

Loading

paul-buerkner commented Aug 17, 2022 •

edited

Loading