Efficient saving/reading of data frames containing rvars? #307

kthayashi · 2023-10-31T16:46:50Z

Thank you so much for this fantastic package, especially the rvars data type which I really enjoy using. I've recently run into a situation where I would like to save a data frame (about 700 rows by 10 variables) that contains a single 1d rvar column from one script and read it in for use in another script. When I try to save the data frame with saveRDS, however, it takes 6+ minutes to complete the operation and the resulting object, which is ~20 MB in R, is 1.5+ GB in size. While this is workable, I was wondering if there are know solutions or recommended alternatives to saveRDS when working with rvars in this way. Thank you!

The text was updated successfully, but these errors were encountered:

mjskay · 2023-11-01T00:35:01Z

Ah, that's not good! Is the data frame you are saving a tibble?

Looking into this, it looks like an issue with the caching that rvars use so that they can be used efficiently with {vctrs} code (such as that used by tibbles). This cache contains a number of references to the same rvar object, which the serializer serializes into a bunch of copies of that object, hence the large size of the output.

For example:

library(posterior)

set.seed(1234)
df = data.frame(x = rvar_rng(rnorm, 10))
saveRDS(df, "df.rds")

# about 300 kB
file.size("df.rds") / 1024

[1] 300.3018

Same thing, but with a tibble:

library(posterior)

set.seed(1234)
df = tibble(x = rvar_rng(rnorm, 10))
saveRDS(df, "df_tibble.rds")

# about 3 MB!
file.size("df_tibble.rds") / 1024

[1] 3301.285

How can we prevent this? One option is to stick to data frames, but that is obviously unsatisfactory. One option would be to convert to a data frame before outputting. You'll also have to clear any rvar caches after conversion:

set.seed(1234)
df = tibble(x = rvar_rng(rnorm, 10))

# convert to a data frame and clear rvar caches
df = as.data.frame(df)
rvar_i = sapply(df, is_rvar)
df[, rvar_i] = lapply(df[, rvar_i, drop = FALSE], posterior:::invalidate_rvar_cache)
# to avoid using the internal invalidate_rvar_cache function, you could
# also just apply an operation to the rvar that does nothing, like adding 0; e.g.:
# df[, rvar_i] = lapply(df[, rvar_i, drop = FALSE], \(x) x + 0)

saveRDS(df, "df_tibble.rds")

# about 300 kB again
file.size("df_tibble.rds") / 1024

[1] 300.3018

The final way is a bit more complicated, but does allow you to keep using tibbles. We can use the refhook argument to saveRDS and readRDS to make it so that the cache environments inside rvars are not saved out (this has no impact on rvar usage, as the rvar will regenerate the cache values automatically). Something like this should work:

set.seed(1234)
df = tibble(x = rvar_rng(rnorm, 10))

saveRDS(df, "df_tibble.rds", refhook = \(x) if (any(c("vec_proxy", "vec_proxy_equal") %in% names(x))) "")
# about 300 kB again
file.size("df_tibble.rds") / 1024

[1] 300.3076

Reading back in, we must supply a refhook as well:

new_df = readRDS("df_tibble.rds", refhook = \(x) new.env())
all.equal(df, new_df)

[1] TRUE

mjskay · 2023-11-01T00:37:12Z

I suppose we should probably document this somewhere and/or make it easier to do, e.g. by including the refhook functions above in the package.

kthayashi · 2023-11-01T22:37:49Z

The data frame in question is indeed a tibble (apologies, I forgot to clarify that upfront). Thank you so much for walking through the issue and the suggested workarounds. It looks like sticking to a data frame might best suit my needs for now, but I agree that having some sort of function in the package that helps handle this (sort of like how cmdstanr has the $save_object() method) would be very convenient!

mjskay · 2023-11-01T23:48:32Z

Hmm yeah. Might be able to do something that walks an object tree and just clears all rvar caches for saving.

mjskay mentioned this issue May 22, 2024

Bypass vec_proxy if it cannot be implemented in constant time? r-lib/vctrs#1411

Open

mjskay mentioned this issue Dec 18, 2024

Standard interface for vectors in S7 (vs vctrs) RConsortium/S7#514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient saving/reading of data frames containing rvars? #307

Efficient saving/reading of data frames containing rvars? #307

kthayashi commented Oct 31, 2023

mjskay commented Nov 1, 2023

mjskay commented Nov 1, 2023

kthayashi commented Nov 1, 2023

mjskay commented Nov 1, 2023

Efficient saving/reading of data frames containing rvars? #307

Efficient saving/reading of data frames containing rvars? #307

Comments

kthayashi commented Oct 31, 2023

mjskay commented Nov 1, 2023

mjskay commented Nov 1, 2023

kthayashi commented Nov 1, 2023

mjskay commented Nov 1, 2023