Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuild cache if the underlying data changed #276

Open
2 of 5 tasks
Hugovdberg opened this issue Aug 31, 2018 · 9 comments
Open
2 of 5 tasks

Rebuild cache if the underlying data changed #276

Hugovdberg opened this issue Aug 31, 2018 · 9 comments
Assignees
Milestone

Comments

@Hugovdberg
Copy link
Collaborator

Report an Issue / Request a Feature

I'm submitting a (Check one with "x") :

  • bug report
  • feature request

Issue Severity Classification -

(Check one with "x") :

  • 1 - Severe
  • 2 - Moderate
  • 3 - Low
Expected Behavior

When a file in data/ is changed but the resulting variable exists in the cache the file is not reloaded.

Current Behavior

Currently caching of the data is only done after the variable is loaded into memory, and cached variables are not reloaded if the original file was changed.

Version Information
Possible Solution

Update the cache function to also include a file argument, similar to the depends argument. If the digest of the file has changed reload the file and rebuild the cache. This could be done inside the reader as follows (using the 1.0 reader signature):

csv.reader <- function(file.name, variable.name, ...) {
    cache(variable.name,
          CODE = {
              read.csv(file.name, ...)
          },
          file = file.name
    )
}

This way assigning the variable in global namespace is left to cache, the CODE argument is evaluated as it is normally inside the cache function, and is only updated if the dependency in the file argument changed.

How do you guys feel about this?

@Hugovdberg Hugovdberg added this to the 1.0 milestone Aug 31, 2018
@Hugovdberg Hugovdberg self-assigned this Aug 31, 2018
@KentonWhite
Copy link
Owner

I like if the cache can tell if the file has changed. This should make workflow easier. The only edge case I can see are researchers working with unstable data and using the cache to capture a particular state they are working with now.

@Hugovdberg
Copy link
Collaborator Author

This could actually be improved by this change, because if you cache the files once and then set data_loading = FALSE, cache_loading = TRUE in the config or in your call to [re]load.project() the files are loaded from the cache, or you could even exclude certain volatile files with data_ignore.
I think we should consider those researches who have volatile data in data/ but which should not always be reloaded the exceptions, and improve the workflow for the majority of people.
Of course we should make sure cache_loading = FALSE, data_loading = TRUE also still works as expected.

@bugsysiegals
Copy link
Contributor

In order to tell the if a file is changed, can we just compare the modified data of the file with the creation date of the cache file? I believe I seen another "Reproducible Research" project which used makefile in this way to only process specific files.

@bugsysiegals
Copy link
Contributor

Rather than implementing this into the cache function wouldn't it be better to implement directly into the loading function to automate this process? Perhaps a yes/no question could be asked to allow the user to not load the new file...

@KentonWhite
Copy link
Owner

Comparing created and modified timestamps is risky. Sometimes modified timestamps are updated by the operating system even though nothing has changed in the filed.

Asking a user each time a cache file is being updated is also error prone. With many files, the question becomes a nuisance and the user mindless hits "y".

Currently, you can pass a list of variable names to clear.cache to rebuild a particular cache.

@bugsysiegals
Copy link
Contributor

Excellent points, thanks for the clarity.

From an automation standpoint, one would simply call clear.cache() prior to load.project() for a full reload?

Perhaps someday another function could be added or parameter could be passed into load.project which compares files. It’s not critical but would allow a person to possibly automate E2E and produce results as quickly as possible without needing to reload very large unchanged datasets.

@KentonWhite
Copy link
Owner

Yes call clear.cache() before load.project(). What I do is call clear.cached with datasets I expect will be updated. I'll often make a call to a database. It's difficult for ProjectTemplate to tell if the database has changed, so I'll call clear.cache() with the name of the dataset read from the database. In an automated workflow the database is refreshed and everything else stays the same.

@bugsysiegals
Copy link
Contributor

Yes it would really only benefit those who are pulling in files. I'll also be trying to connect to DB's where possible but of course will have to rely on some files. At the end of the day, a few extra minutes to load data isn't going to matter unless I'm sitting there watching it load and getting impatient! :)

@Hugovdberg
Copy link
Collaborator Author

Hugovdberg commented Sep 27, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants