Adding caching to `get_file_by_id` #135

beniaminogreen · 2024-08-09T19:45:46Z

It would be useful if the package were able to cache results of calls to the DataVerse API to disk. If the same dataset is requested twice, the result can then be served from the disk instead of re-downloading which would save a lot of time.

Here's a sketch of how I think the behavior could work:

Suggested new behavior for `get_file_by_id`:

Check we are asking to download a specific version of a file. If we are, proceed to the next step. If we are asking for the latest version, the result cannot be cached as the latest version might change between function invocations.
Check if caching is disabled through some sort of Environment variable. If it is, don't cache results either.
Download the file and cache the results.

More complex behavior could be added on in the future such as caching the latest version by checking if the file metadata has changed since the last download, and only re-downloading if there is a new version.

I am prototyping the behavior in the cache branch, and would love feedback on the behavior. Right now, I am caching all calls to get_file_id which have a specific version of the file specified. I will shortly add step 2 which checks if the user wants to turn off caching if we think this is a good way to go.

Best,
Ben

The text was updated successfully, but these errors were encountered:

mtmorgan · 2024-09-12T20:55:35Z

Some comments on 5153bb2 meant to be helpful from my past experience; hope they come across in the right spirit --

use tools::R_user_dir() instead of rappdirs to avoid the need for an additional package installation
use the naming convention get_file_by_id() / get_file_by_id_memoized() so that, for instance with auto completion during code development, it's clear that there are different versions of get_file_by_id*().
minimize the 'fancy stuff' that goes on in .onLoad() to be just, e.g., get_file_by_id_meomoized <<- memoised(get_file_by_id_non_memoised). Put everything else in R/get_file_by_id.R where some future developer is most likely to see the code, and where standard tooling (e.g., code coverage, use of variables) will definitely work
```
#' @export
get_file_by_id <- # the code to execute cached-or-not

## comment indicating what's going on here
get_file_by_id_non_memoized <- # current get_file_by_id

get_file_by_id_memoized <- NULL
```
The definition of get_file_by_id() seen by the user now just has arguments ... (see the man page diff), but one really needs to go to the effort of exposing named arguments to users.
The "version" argument is not documented on get_file_by_id; this argument could be added to the function definition (after ...?), which would be informative to the user and simplify the code (no need for names(list(...)) just if (identical(version, ...)) ... or whatever; also one can easily validate that version is an appropriate value when invoked by the user). But actually it seems like a specific value of version would be required ("latest"?) and maybe it is better to have an argument like use_cache = TRUE instead. I would suggest putting use_cache = TRUE after the ... in the function definition, so that the user is given a hint that this is somehow a control argument rather than an argument to get_file_by_id per se. A reasonable default might be Sys.getenv("DATAVERSE_USE_CACHE", TRUE) allowing an environment variable override.

I'd also really encourage keeping the housekeeping (e.g., removing trailing whitespace in the DESCRIPTION) to a separate commit (and eventual pull request), and editing the commit history on the branch so that it does not contain extraneous (commit one does whitespace changes and other things, commit two undoes the whitespace changes) steps.

kuriwaki · 2024-09-13T14:48:26Z

Much appreciated. Ben or I will consider soon.
Properly tracking the version argument has also been an open issue and it needs documentation (with mwe, #78). I imagine defaulting to "latest" will allow the default to automatically detect new versions, but we should test together.

beniaminogreen · 2024-09-13T16:42:32Z

Thanks for contributing. I will have a look at these suggestions and get back to you soon. @kuriwaki not sure if I mentioned this before, but I had turned on caching only if you specified an exact version of the file. I agree that what you are suggesting is a sensible default - it would be nice to be able to cache the lastest version of the file on disk and only re-download when the file is updated on the Dataverse.

kuriwaki · 2024-09-18T02:05:21Z

Thanks @mtmorgan for the PR. I made a few comments. @beniaminogreen you can see if that does everything you were starting to do in your branch. If so, we can try to merge it in after #136

beniaminogreen added the enhancement label Aug 9, 2024

beniaminogreen assigned kuriwaki and beniaminogreen Aug 9, 2024

kuriwaki mentioned this issue Sep 12, 2024

Faster JSON parser #112

Open

mtmorgan mentioned this issue Sep 15, 2024

Issue 112/135 caching memoise api get #137

Merged

6 tasks

kuriwaki linked a pull request Oct 16, 2024 that will close this issue

Issue 112/135 caching memoise api get #137

Merged

6 tasks

kuriwaki closed this as completed in #137 Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding caching to `get_file_by_id` #135

Adding caching to `get_file_by_id` #135

beniaminogreen commented Aug 9, 2024

mtmorgan commented Sep 12, 2024

kuriwaki commented Sep 13, 2024

beniaminogreen commented Sep 13, 2024

kuriwaki commented Sep 18, 2024

Adding caching to get_file_by_id #135

Adding caching to get_file_by_id #135

Comments

beniaminogreen commented Aug 9, 2024

Suggested new behavior for get_file_by_id:

mtmorgan commented Sep 12, 2024

kuriwaki commented Sep 13, 2024

beniaminogreen commented Sep 13, 2024

kuriwaki commented Sep 18, 2024

Adding caching to `get_file_by_id` #135

Adding caching to `get_file_by_id` #135

Suggested new behavior for `get_file_by_id`: