Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only Read/Write from Disk? #1

Open
billdenney opened this issue Sep 6, 2018 · 3 comments
Open

Only Read/Write from Disk? #1

billdenney opened this issue Sep 6, 2018 · 3 comments

Comments

@billdenney
Copy link

Thank you for the tool!

I'm wanting to use httpcache to help build a data package for United States birth statistics. The files to download to build the package are large (~200MB * ~20 years and ~10-100MB * 30 years).

I'd like to enable the package building tools to cache the download results so that I don't have to re-download the files multiple times if I update the package later.

As I see httpcache, it appears currently designed for in-memory caching. What would you think about adding functionality for an on-disk cache where all queries were cached in a configurable directory?

@nealrichardson
Copy link
Owner

Thanks for the suggestion. I think it's worth considering alternate backends for the cache, as you suggest. I'll give it some thought, and if you have any ideas for how specifically it might be implemented, please let me know. My initial thought is that my other package httptest effectively has a way to cache responses on disk and read from that cache, though it's not quite set up the way you'd want it for this application.

I also wonder whether httpcache as is might support this, albeit awkwardly. If you use httr::write_disk() in your GETs, you would specify where the files are written to. The response object that httpcache would cache would point to that file and not hold the content in memory. So maybe if you had something like this in your .onLoad() function:

.onLoad <- function(lib, pkgname = "yourpackagename") {
    cachefile <- system.file("httpcache.rds", package = pkgname)
    if (file.exists(cachefile)) {
        loadCache(cachefile)
    }
    sys.on.exit(saveCache(cachefile))
}

you might build up the cache of responses that point to the files that have been written out, and that would have the result you want.

@billdenney
Copy link
Author

I’ve been playing around with using httr::write_disk(), and I think that will be the correct direction to go.

For my purposes, that will be fine because I really don’t need to worry about cache invalidation for the most part (historical birth certificate data doesn’t change very much).

For the more general solution, I’m guessing that a write_cache() function would be preferable where it would use a configurable cache directory and auto-generate filenames in the cache.

Then, httpcache would need to be aware of when files were used so that dropping a result from the cache would delete the file, and clearing the cache would delete the directory.

More complex would be: If you save the cache to disk, the directory should be retained with any files that are associated with the saved cache. But if you have additional files saved to disk that are not part of a saved cache, those should be deleted upon closing the R session.

@billdenney
Copy link
Author

I just learned of an even more general solution: the storr package, https://github.com/richfitz/storr/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants