-
-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient saving/reading of data frames containing rvars? #307
Comments
Ah, that's not good! Is the data frame you are saving a Looking into this, it looks like an issue with the caching that rvars use so that they can be used efficiently with {vctrs} code (such as that used by For example: library(posterior)
set.seed(1234)
df = data.frame(x = rvar_rng(rnorm, 10))
saveRDS(df, "df.rds")
# about 300 kB
file.size("df.rds") / 1024
Same thing, but with a tibble: library(posterior)
set.seed(1234)
df = tibble(x = rvar_rng(rnorm, 10))
saveRDS(df, "df_tibble.rds")
# about 3 MB!
file.size("df_tibble.rds") / 1024
How can we prevent this? One option is to stick to data frames, but that is obviously unsatisfactory. One option would be to convert to a data frame before outputting. You'll also have to clear any rvar caches after conversion: set.seed(1234)
df = tibble(x = rvar_rng(rnorm, 10))
# convert to a data frame and clear rvar caches
df = as.data.frame(df)
rvar_i = sapply(df, is_rvar)
df[, rvar_i] = lapply(df[, rvar_i, drop = FALSE], posterior:::invalidate_rvar_cache)
# to avoid using the internal invalidate_rvar_cache function, you could
# also just apply an operation to the rvar that does nothing, like adding 0; e.g.:
# df[, rvar_i] = lapply(df[, rvar_i, drop = FALSE], \(x) x + 0)
saveRDS(df, "df_tibble.rds")
# about 300 kB again
file.size("df_tibble.rds") / 1024
The final way is a bit more complicated, but does allow you to keep using tibbles. We can use the set.seed(1234)
df = tibble(x = rvar_rng(rnorm, 10))
saveRDS(df, "df_tibble.rds", refhook = \(x) if (any(c("vec_proxy", "vec_proxy_equal") %in% names(x))) "")
# about 300 kB again
file.size("df_tibble.rds") / 1024
Reading back in, we must supply a new_df = readRDS("df_tibble.rds", refhook = \(x) new.env())
all.equal(df, new_df)
|
I suppose we should probably document this somewhere and/or make it easier to do, e.g. by including the refhook functions above in the package. |
The data frame in question is indeed a |
Hmm yeah. Might be able to do something that walks an object tree and just clears all rvar caches for saving. |
Thank you so much for this fantastic package, especially the rvars data type which I really enjoy using. I've recently run into a situation where I would like to save a data frame (about 700 rows by 10 variables) that contains a single 1d rvar column from one script and read it in for use in another script. When I try to save the data frame with
saveRDS
, however, it takes 6+ minutes to complete the operation and the resulting object, which is ~20 MB in R, is 1.5+ GB in size. While this is workable, I was wondering if there are know solutions or recommended alternatives tosaveRDS
when working with rvars in this way. Thank you!The text was updated successfully, but these errors were encountered: