Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding caching to get_file_by_id #135

Closed
beniaminogreen opened this issue Aug 9, 2024 · 4 comments · Fixed by #137
Closed

Adding caching to get_file_by_id #135

beniaminogreen opened this issue Aug 9, 2024 · 4 comments · Fixed by #137
Assignees

Comments

@beniaminogreen
Copy link
Collaborator

It would be useful if the package were able to cache results of calls to the DataVerse API to disk. If the same dataset is requested twice, the result can then be served from the disk instead of re-downloading which would save a lot of time.

Here's a sketch of how I think the behavior could work:

Suggested new behavior for get_file_by_id:

  1. Check we are asking to download a specific version of a file. If we are, proceed to the next step. If we are asking for the latest version, the result cannot be cached as the latest version might change between function invocations.
  2. Check if caching is disabled through some sort of Environment variable. If it is, don't cache results either.
  3. Download the file and cache the results.

More complex behavior could be added on in the future such as caching the latest version by checking if the file metadata has changed since the last download, and only re-downloading if there is a new version.

I am prototyping the behavior in the cache branch, and would love feedback on the behavior. Right now, I am caching all calls to get_file_id which have a specific version of the file specified. I will shortly add step 2 which checks if the user wants to turn off caching if we think this is a good way to go.

Best,
Ben

@mtmorgan
Copy link
Collaborator

Some comments on 5153bb2 meant to be helpful from my past experience; hope they come across in the right spirit --

  • use tools::R_user_dir() instead of rappdirs to avoid the need for an additional package installation
  • use the naming convention get_file_by_id() / get_file_by_id_memoized() so that, for instance with auto completion during code development, it's clear that there are different versions of get_file_by_id*().
  • minimize the 'fancy stuff' that goes on in .onLoad() to be just, e.g., get_file_by_id_meomoized <<- memoised(get_file_by_id_non_memoised). Put everything else in R/get_file_by_id.R where some future developer is most likely to see the code, and where standard tooling (e.g., code coverage, use of variables) will definitely work
    #' @export
    get_file_by_id <- # the code to execute cached-or-not
    
    ## comment indicating what's going on here
    get_file_by_id_non_memoized <- # current get_file_by_id
    
    get_file_by_id_memoized <- NULL
  • The definition of get_file_by_id() seen by the user now just has arguments ... (see the man page diff), but one really needs to go to the effort of exposing named arguments to users.
  • The "version" argument is not documented on get_file_by_id; this argument could be added to the function definition (after ...?), which would be informative to the user and simplify the code (no need for names(list(...)) just if (identical(version, ...)) ... or whatever; also one can easily validate that version is an appropriate value when invoked by the user). But actually it seems like a specific value of version would be required ("latest"?) and maybe it is better to have an argument like use_cache = TRUE instead. I would suggest putting use_cache = TRUE after the ... in the function definition, so that the user is given a hint that this is somehow a control argument rather than an argument to get_file_by_id per se. A reasonable default might be Sys.getenv("DATAVERSE_USE_CACHE", TRUE) allowing an environment variable override.

I'd also really encourage keeping the housekeeping (e.g., removing trailing whitespace in the DESCRIPTION) to a separate commit (and eventual pull request), and editing the commit history on the branch so that it does not contain extraneous (commit one does whitespace changes and other things, commit two undoes the whitespace changes) steps.

@kuriwaki
Copy link
Member

Much appreciated. Ben or I will consider soon.
Properly tracking the version argument has also been an open issue and it needs documentation (with mwe, #78). I imagine defaulting to "latest" will allow the default to automatically detect new versions, but we should test together.

@beniaminogreen
Copy link
Collaborator Author

Thanks for contributing. I will have a look at these suggestions and get back to you soon. @kuriwaki not sure if I mentioned this before, but I had turned on caching only if you specified an exact version of the file. I agree that what you are suggesting is a sensible default - it would be nice to be able to cache the lastest version of the file on disk and only re-download when the file is updated on the Dataverse.

@kuriwaki
Copy link
Member

Thanks @mtmorgan for the PR. I made a few comments. @beniaminogreen you can see if that does everything you were starting to do in your branch. If so, we can try to merge it in after #136

@kuriwaki kuriwaki linked a pull request Oct 16, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants