-
-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing data.table
dependency to significantly improve read_stan_csv
read times
#1018
Comments
Just to chime in that by the time you get up to 1e+5 parameters, the |
Is this still under development or has there been other solutions somewhere which I have missed? This is quite a big issue with large models, where sometimes the actual sampling is faster than returning the object to R. |
No action on the rstan side, but a function that achieves exactly this has been implemented in |
Unfortunately, I do not have access to |
If you can't make use of |
Yeah, I wrote a function for reading the CSVs with |
read_stan_csv
becomes very slow once models have thousands of parameters, with the bottleneck occurring at the read stage (see e.g. paul-buerkner/brms#1331). A very simple solution would be to alter the code block running lines 143:161 in stan_csv.R section with code that instead reads in the csv viadata.table::fread()
. I've worked on this a bit with @jsocolar and found that thereadLines()
approach thatrstan
currently uses is faster up to around ~1000 parameters, at which point it becomes increasingly slow relative tofread()
. Across the range of csv sizes for which thereadLines()
approach is faster,fread()
times are however also fast (<1-2 seconds for a single model), which I think is probably a trivial slowdown for most purposes? Conversely, by the time you are reading a csv with around 5000 parameters you are saving >10 seconds by usingfread
(~30% saving relative toreadLines()
) with the proportional and absolute savings continuing to widen with increasing number of parameters.Would you be willing to consider introducing a
data.table
dependency in order to achieve speedups?Code checking timings and equivalence of two methods below (comparing just the initial code block which is then used downstream in the rest of the function). Checking equivalence of two methods are complicated by occasional minor floating point differences between
data.table
and base R (discussed e.g. here). New code owes heavily to thecmdstanr
implementation of this function.The text was updated successfully, but these errors were encountered: