Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental updates #29

Open
martin-ueding opened this issue Apr 14, 2020 · 0 comments
Open

Incremental updates #29

martin-ueding opened this issue Apr 14, 2020 · 0 comments

Comments

@martin-ueding
Copy link
Contributor

A thing that we had was that the addition of a single fit range would require all other fits to be run as well. We had discussed this with various people over time and every time we reached the conclusion that it just wasn't easy to implement incremental updates in a a safe and consistent way.

Now while writing about the shortcoming of paramvalf in my thesis, I explained why it is so complicated and why a certain solution will not work until I realized that it might actually work. So let's have this discussion again, please.

The problem is that the results depend on multiple things:

  1. Input parameters
  2. Input values
  3. R code in paramval
  4. R code in R
  5. R code in ~/R

We always ignore the fifth, and recently the fourth item has been added to the dependencies that are represented in the Makefile. Currently the first two are tracked via the files that hold the PV objects. If they get updated, then everything that uses them needs to be updated. There is no granularity in that. Number 3 is the easiest as it is the code that actually needs to be executed. For Number 4 we have file granularity, so it is best to have many small function files such that functions can be edited independently of of each.

For Numbers 1 and 2 we could just hash everything. Basically we would hash the row of the param section and the value list that gets passed to the function. If we can find that hash in the old result file, we just copy the value section of that and we are done instantly. Otherwise we need to do the computation. The resulting file needs to be build up and written to disk, the time stamps would match then.

There are two issues that I see:

  • We would need a hash digest that is stable. This might become a pain when environments and references come into play. The serialization to file somehow manages this, but I am not sure whether that is stable. We could always write to RDS, take the hash digest of that file and then delete the file.

    I have found the digest package. It seems that loading a PV twice in separate R sessions gives stable results!

    > pv_load('3pi', kinematics)
    Loading kinematics ... took 0.07 seconds.
    > digest::sha1_digest(kinematics)
    [1] "6de08aeca2cc66caf0e73c06239699ae4151af47"
    
  • The output filename would be needed for pv_call in order to load old results if they exist. We need to change the function signature from result <- pv_call(func, data) to pv_call(result, func, data) again and have the pv_save as part of pv_call then. This could be changed with a regular expression.

The crucial part is that we don't miss edge cases and that it is robust. It does not help to save time if there are spurious inconsistencies. I think that we are good with this, the following is taken care of:

  • Changing a parameter will also change the hash for param.
  • Adding or removing a parameter will result in a different hash for the parameter row.
  • Addition, removal or change of a value within one PV that is passed to pv_call will change the hash of value.
  • Joining in another PV will also change the value, and therefore the hash.
  • Parameter files like for the fit ranges or bootstrap parameters are converted into PV objects. If they are not, there still are the # Depends: comments in the code file, so the whole code would be run if that file changes. For fit ranges, the CSV file needs to be converted to a PV object, but then the pv_call accepting the fit ranges PV object would be able to re-use results.
  • If something cannot be hashed (seems not to occur in R, though), we can just assume that it has changed and still do it as before.
  • If the result file does not exist (first run), we simply create it.

Did I miss something? If not, it does not hard conceptually and we could think of implementing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant