qs file format as cache file format #304

gisler · 2020-11-23T10:58:02Z

Types of change

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Other (documentation etc)

Pull request checklist

Add functionality
Add tests
Update documentation in man
Update website documentation

Proposed changes

As shortly discussed in #225, I implemented "qs" as a possible cache file format (https://CRAN.R-project.org/package=qs). The default stays the "RData" file format for now though.

Design considerations:

The cache function doesn't require a loaded project to work. I tried to keep it like this. This, however, has the drawback that using the cache function without a loaded project might not consider possible config overrides.
Switching cache file format is possible without affecting the other cache file format at all. So if .RData files already exist and the file format is changed to .qs, new .qs files are created upon loading the project. This includes the .hash files. The following list shows the file extensions of the different cache file formats:
- RData: .RData and .hash
- qs: .qs and .qs.hash

The latter also implies that clear.cache only deletes the files of the format specified in the config and leaves the other files untouched (might need a test there, but first want to know your opinion about this).

…e for qs files) RData: .RData and .hash qs: .qs and .qs.hash

KentonWhite · 2020-11-24T16:26:07Z

Thanks @gisler ! I'll have time over the weekend to review in detail. Please don't think I'm ignoring this...

gisler · 2020-11-24T19:47:47Z

No worries, we all have our duties, but I appreciate your message nonetheless.

KentonWhite · 2020-11-29T15:35:45Z

Thanks for this @gisler, And thanks for writing clear tests as well!

I noticed that you are using ifelse statements for implementing the cashing strategy. This works if we never are going to support other caching formats other than qs. This might not be the case. What do you think about implementing a strategy pattern with a little bit of meta-programming? What I mean is using a pattern like is used for loading different data types. This is a simple list with the cache type prefix followed by a function that implements the caching strategy.

For example, in load,.project.R lines 228-232 could use a data structure like:

cache.stratagies <- list(
  "qs" = "assign(variable, qs::qread(cache.file), .TargetEnv)",
  "R"   = "load(cache.file, envir = .TargetEnv)"
)

Then the code could read:

eval(parse(cache.stratagies[config$cache_file_format]))

Of course I mean this a pseudo-code and it is not guaranteed to work, etc....

What do you think?

gisler · 2020-11-29T16:49:24Z

Hi, you're welcome, glad you like it, @KentonWhite. And yes, I did include tests, although I must say that I more or less only copied and adapted existing tests.

Anyway, while I very much like metaprogramming (I use it quite often in my DTSg package), I don't think it is necessary here. One could either use an if ... else if ... else statement to include more formats in the future or probably more elegant the rather unknown switch() function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/switch.html

It achives about what you suggest, but without the parsing of strings. If you like, I can change the statements so that they make use of this function for future reference.

KentonWhite · 2020-11-29T16:57:10Z

@gisler The reason I was thinking about meta programming was eliminating the need for someone to find all of the places where a cache file format is used. With meta programming adding a new format is as easy as changing the table with a new function. I didn’t look for all of the places where there is a switch. If their is only 1 place in the code where this switch is made then the current way is fine. If there are multiple places meta programming might be easier maintenance. Is there only the 1 place where a code switch is needed? Kenton White

…

On Nov 29, 2020, at 11:49 AM, Gerold Hepp ***@***.***> wrote: Hi, you're welcome, glad you like it, @KentonWhite. And yes, I did include tests, although I must say that I more or less only copied and adapted existing tests. Anyway, while I very much like metaprogramming (I use it quite often in my DTSg package), I don't think it is necessary here. One could either use an if ... else if ... else statement to include more formats in the future or probably more elegant the rather unknown switch() function: https://stat.ethz.ch/R-manual/R-devel/library/base/html/switch.html It achives about what you suggest, but without the parsing of strings. If you like, I can change the statements so that they make use of this function for future reference. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

gisler · 2020-11-29T17:10:19Z

Ah, I see. Well, there is one for saving, one for loading and one defining the file extensions. If the config equals the file extension (which it actually does as it is now), we can easily erase the need to intervene there. This is a good idea of yours.

Erasing the rest is a bit more complicated though, as there are two expressions involved and for future formats usually also the requirement to load a package. It should be possible nonetheless using expression(). I'll look into it next week.

gisler · 2020-12-03T08:39:32Z

Hi @KentonWhite,

I committed a rough implementation of your idea (at least I hope so) yesterday. It still requires some polishing and explanatory comments though:

ProjectTemplate/R/cache.R

Lines 171 to 184 in 88343b8

    
           .cache.formats <- list( 
        
             RData = expression( 
        
               package = "base", 
        
               file_ext = "RData", 
        
               load_expr = load(cache.file, envir = .TargetEnv), 
        
               save_expr = save(list = variable, envir = .TargetEnv, file = cache_filename$obj, ...) 
        
             ), 
        
             qs = expression( 
        
               package = "qs", 
        
               file_ext = "qs", 
        
               load_expr = assign(variable, qs::qread(cache.file), .TargetEnv), 
        
               save_expr =  qs::qsave(get(variable, envir = .TargetEnv), file = cache_filename$obj, ...) 
        
             ) 
        
           )

So with this, cache file formats could in future be specified by a named list of named expressions. One just has to keep in mind to add potentially required packages to the suggested packages list. One drawback though is that the expressions used for saving and loading depend on objects living in the respective functions only. This requires careful explanation.

KentonWhite · 2020-12-03T15:41:00Z

Thanks @gisler! I'll take a more detailed look over the weekend.

gisler · 2020-12-03T16:46:20Z

Perfect, I finished the polishing and explanations in the meantime.

gisler · 2020-12-04T15:11:20Z

Sorry, had to "kick" in a final commit to this PR. The code should be "cleaner" now.

gisler · 2020-12-04T15:17:53Z

@KentonWhite Oh, before I forget it and since you are preparing a release: https://github.com/KentonWhite/ProjectTemplate/blob/f0f07ec386c70674647b55773cf0974905dd7603/R/project.config.R is missing underscore_variables in its documentation.

I don't want to include a fix in this PR and it's kinda not worth to open a separate Issue or PR for it, but I can do so in case you prefer.

KentonWhite · 2020-12-07T16:50:48Z

Thanks @gisler Looks good! One question and 1 comment before I merge in:

Have you checked that migrate project works as expected? It should and is always nice to confirm 👍

Since I needed to bump the version to 0.9.3 to fix an error with the new R release, can you change the version to 0.10.0 please.

gisler · 2020-12-08T14:40:16Z

Okay, no problem, @KentonWhite. I tested migrating the project and it works, although I wonder why it prints FALSE:

> migrate.project()
Migrating project configuration
[1] FALSE
Your existing project configuration in globals.dcf does not contain up to
date configuration settings in this version 0.10.0 of ProjectTemplate.  They will
be added automatically during migration, but you should review afterward.
Your config.dcf file was missing entries and defaults
have been used.  The missing entries are:
 cache_file_format

And it made me add a suppressWarnings() to .load.config() leading to cleaner messages for users.

gisler · 2020-12-08T16:50:47Z

By the way, I just did a few non-representative benchmarks on my machine:

library(arrow)
library(qs)
library(microbenchmark)
library(data.table)

data(diamonds, package = "ggplot2")
dim(diamonds)

[1] 53940 10

microbenchmark(
  write_parquet(diamonds, "test.parquet", version = "2.0"),
  write_feather(diamonds, "test.feather"),
  qsave(diamonds, "test.qs"),
  save(diamonds, file = "test.RData")
)

Unit: milliseconds
                                                     expr       min        lq      mean    median        uq       max neval
 write_parquet(diamonds, "test.parquet", version = "2.0")  15.31879  15.87638  18.95168  16.46482  18.46235  60.54349   100
                  write_feather(diamonds, "test.feather")  11.07797  12.84449  24.74501  22.12534  30.57151  84.52629   100
                               qsave(diamonds, "test.qs")  25.79260  27.25619  33.19539  27.97628  34.34300  98.84864   100
                      save(diamonds, file = "test.RData") 119.54659 120.99080 122.60779 122.17242 122.56934 190.21427   100

microbenchmark(
  read_parquet("test.parquet"),
  read_feather("test.feather"),
  qread("test.qs"),
  load("test.RData")
)

Unit: milliseconds
                         expr      min       lq      mean   median       uq       max neval
 read_parquet("test.parquet") 3.683474 3.829881  5.178889 3.917530 4.064670  63.03108   100
 read_feather("test.feather") 3.230476 3.459291  4.808098 3.613256 3.767588  62.98305   100
             qread("test.qs") 3.571077 3.724800  9.885440 3.804038 4.074666 247.98589   100
           load("test.RData") 8.170787 8.422276 13.941885 8.486032 8.703510  72.46114   100

n <- 1e6
DT <- data.table(
  integer = 1:n,
  numeric = rnorm(n),
  character = stringi::stri_rand_strings(n, 12)
)

microbenchmark(
  write_parquet(DT, "test.parquet", version = "2.0"),
  write_feather(DT, "test.feather"),
  qsave(DT, "test.qs"),
  save(DT, file = "test.RData"),
  times = 12
)

Unit: milliseconds
                                               expr       min        lq      mean    median        uq       max neval
 write_parquet(DT, "test.parquet", version = "2.0")  699.6421  731.7620  746.7013  751.3195  765.9379  777.2242    12
                  write_feather(DT, "test.feather")  689.9112  719.1069  761.8496  770.7676  798.0216  818.9127    12
                               qsave(DT, "test.qs")  384.6411  392.1583  404.9736  398.8102  420.8910  429.9794    12
                      save(DT, file = "test.RData") 1612.9240 1634.8561 1653.2237 1656.8916 1665.7625 1699.2989    12

microbenchmark(
  read_parquet("test.parquet"),
  read_feather("test.feather"),
  qread("test.qs"),
  load("test.RData"),
  times = 12
)

Unit: milliseconds
                         expr      min       lq     mean   median       uq      max neval
 read_parquet("test.parquet") 240.2281 245.9147 258.9374 247.9435 250.3795 324.1385    12
 read_feather("test.feather") 222.1977 235.2760 325.8992 303.2075 425.3964 487.9860    12
             qread("test.qs") 161.7858 167.5878 213.3876 170.3831 238.4205 426.0841    12
           load("test.RData") 398.0150 401.0278 433.4332 409.4679 475.0042 481.9437    12

I generally used default settings. File sizes are in the case of DT:

Parquet: 25.9 MB
Feather: 26.7 MB
qs: 15.9 MB
RData: 19.2 MB

Scaling DT to 1e7 rows looks like this with three repetitions for writing

Unit: seconds
                                               expr       min        lq      mean    median        uq       max neval
 write_parquet(DT, "test.parquet", version = "2.0")  6.813148  6.888617  6.963231  6.964086  7.038272  7.112459     3
                  write_feather(DT, "test.feather")  7.336708  7.373148  7.465723  7.409587  7.530231  7.650874     3
                               qsave(DT, "test.qs")  4.061056  4.065581  4.068474  4.070105  4.072182  4.074259     3
                      save(DT, file = "test.RData") 15.569769 15.626296 15.741904 15.682823 15.827971 15.973119     3

and like this for reading

Unit: seconds
                         expr      min       lq     mean   median       uq      max neval
 read_parquet("test.parquet") 2.521096 2.525684 2.548139 2.530273 2.561661 2.593049     3
 read_feather("test.feather") 2.329032 2.332010 2.342300 2.334988 2.348934 2.362880     3
             qread("test.qs") 1.609163 1.615720 1.838861 1.622277 1.953710 2.285143     3
           load("test.RData") 3.944234 3.982078 4.316626 4.019922 4.502822 4.985722     3

File sizes scaled approximately linearly.

Leaving one column out at a time, changes the timing, but not the ranking and adding a POSIXct column also doesn't change much:

n <- 1e7
DT <- data.table(
  integer = 1:n,
  numeric = rnorm(n),
  character = stringi::stri_rand_strings(n, 12),
  timestamp = as.POSIXct(1:n, origin = "1970-01-01")
)

microbenchmark(
  write_parquet(DT, "test.parquet", version = "2.0"),
  write_feather(DT, "test.feather"),
  qsave(DT, "test.qs"),
  save(DT, file = "test.RData"),
  times = 3
)

Unit: seconds
                                               expr       min        lq      mean    median        uq       max neval
 write_parquet(DT, "test.parquet", version = "2.0")  7.837889  8.659331  8.979386  9.480773  9.550135  9.619496     3
                  write_feather(DT, "test.feather")  8.493490  8.580802  8.629774  8.668114  8.697916  8.727718     3
                               qsave(DT, "test.qs")  4.232069  4.236450  4.296098  4.240831  4.328113  4.415395     3
                      save(DT, file = "test.RData") 21.064341 21.112579 21.158563 21.160817 21.205675 21.250533     3

Maybe it's different with less random data and non-defaults. I only tried out rbinding the diamonds dataset 200 times resulting in more than ten million rows with ten variables:

Unit: seconds
                                                     expr       min        lq      mean    median        uq       max neval
 write_parquet(diamonds, "test.parquet", version = "2.0")  2.989435  3.051021  3.072574  3.112606  3.114143  3.115681     3
                  write_feather(diamonds, "test.feather")  5.718497  5.860217  6.133362  6.001938  6.340794  6.679651     3
                               qsave(diamonds, "test.qs")  1.916572  1.928476  1.961913  1.940380  1.984583  2.028787     3
                      save(diamonds, file = "test.RData") 24.133917 24.273026 24.345237 24.412135 24.450898 24.489660     3

Anyway, I think qs definitely is a very, very good cache file format. Especially since Parquet and Feather are limited to tabular data only as far as I can see.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252    LC_MONETARY=German_Austria.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Austria.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-7 qs_0.23.4            data.table_1.13.0    arrow_2.0.0         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5          digest_0.6.25       assertthat_0.2.1    R6_2.4.1            formatR_1.7         magrittr_1.5        RcppParallel_5.0.2 
 [8] stringi_1.4.6       rlang_0.4.7         vctrs_0.3.2         stringfish_0.14.2   RApiSerialize_0.1.0 tools_4.0.2         bit64_4.0.2        
[15] glue_1.4.1          purrr_0.3.4         bit_4.0.4           compiler_4.0.2      tidyselect_1.1.0

P.S. I know, it's strange to do such tests once the work is done and not before it, but then, I knew qs would beat RData. Beating Parquet never was the goal.

EDIT: P.P.S. Did some more experimenting: The read and write functions from the feather package outperform those from the arrow package and together with the fst format show about the same performance as qs. But this is a very rough estimate.

KentonWhite · 2020-12-08T19:19:20Z

The migrate test looks bizarre. I will try it on an existing project and see if I get the same issues.

KentonWhite · 2020-12-08T19:34:23Z

It looks good! merging after tests have passed.

gisler · 2020-12-08T20:15:27Z

The migrate test looks bizarre. I will try it on an existing project and see if I get the same issues.

Seems like the function is printing if the cache directory is not existing, which was FALSE in my case, as it was there.

And cool, looking forward to make use of it in production. Hopefully, there are no bugs.

gisler added 8 commits November 22, 2020 20:00

added "cache_file_format" to config

bcabd15

rolled out .cache.file.ext()

986ddf3

implemented caching and loading of cached data via qs

ca3aff3

added tests for qs cache file format and required fixes

ae972e8

cleaned up and added missing information

3476b7b

clear separation of caches including hash files (one for RData and on…

c7974d8

…e for qs files) RData: .RData and .hash qs: .qs and .qs.hash

gsub to sub

50dc0ec

vectorized for loop

07d7de7

gisler added 2 commits November 24, 2020 21:16

no need to capture group

f4bb4dc

one more incremental simplification

34085af

cache file formats can be defined as a list of expressions

88343b8

added explanations and did some polishing

35fafdc

KentonWhite force-pushed the master branch from ef70073 to f0f07ec Compare December 3, 2020 23:22

consequent use of .cache.filename() and "file" to "data"

97866a8

bumped version to 0.10.0

6df7c27

merged upstream

4e87da2

KentonWhite merged commit 6cd3015 into KentonWhite:master Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qs file format as cache file format #304

qs file format as cache file format #304

gisler commented Nov 23, 2020

KentonWhite commented Nov 24, 2020

gisler commented Nov 24, 2020

KentonWhite commented Nov 29, 2020

gisler commented Nov 29, 2020

KentonWhite commented Nov 29, 2020 via email

gisler commented Nov 29, 2020 •

edited

Loading

gisler commented Dec 3, 2020

KentonWhite commented Dec 3, 2020

gisler commented Dec 3, 2020

gisler commented Dec 4, 2020

gisler commented Dec 4, 2020

KentonWhite commented Dec 7, 2020

gisler commented Dec 8, 2020

gisler commented Dec 8, 2020 •

edited

Loading

KentonWhite commented Dec 8, 2020

KentonWhite commented Dec 8, 2020

gisler commented Dec 8, 2020

qs file format as cache file format #304

qs file format as cache file format #304

Conversation

gisler commented Nov 23, 2020

Types of change

Pull request checklist

Proposed changes

KentonWhite commented Nov 24, 2020

gisler commented Nov 24, 2020

KentonWhite commented Nov 29, 2020

gisler commented Nov 29, 2020

KentonWhite commented Nov 29, 2020 via email

gisler commented Nov 29, 2020 • edited Loading

gisler commented Dec 3, 2020

KentonWhite commented Dec 3, 2020

gisler commented Dec 3, 2020

gisler commented Dec 4, 2020

gisler commented Dec 4, 2020

KentonWhite commented Dec 7, 2020

gisler commented Dec 8, 2020

gisler commented Dec 8, 2020 • edited Loading

KentonWhite commented Dec 8, 2020

KentonWhite commented Dec 8, 2020

gisler commented Dec 8, 2020

gisler commented Nov 29, 2020 •

edited

Loading

gisler commented Dec 8, 2020 •

edited

Loading