You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A thing that we had was that the addition of a single fit range would require all other fits to be run as well. We had discussed this with various people over time and every time we reached the conclusion that it just wasn't easy to implement incremental updates in a a safe and consistent way.
Now while writing about the shortcoming of paramvalf in my thesis, I explained why it is so complicated and why a certain solution will not work until I realized that it might actually work. So let's have this discussion again, please.
The problem is that the results depend on multiple things:
Input parameters
Input values
R code in paramval
R code in R
R code in ~/R
We always ignore the fifth, and recently the fourth item has been added to the dependencies that are represented in the Makefile. Currently the first two are tracked via the files that hold the PV objects. If they get updated, then everything that uses them needs to be updated. There is no granularity in that. Number 3 is the easiest as it is the code that actually needs to be executed. For Number 4 we have file granularity, so it is best to have many small function files such that functions can be edited independently of of each.
For Numbers 1 and 2 we could just hash everything. Basically we would hash the row of the param section and the value list that gets passed to the function. If we can find that hash in the old result file, we just copy the value section of that and we are done instantly. Otherwise we need to do the computation. The resulting file needs to be build up and written to disk, the time stamps would match then.
There are two issues that I see:
We would need a hash digest that is stable. This might become a pain when environments and references come into play. The serialization to file somehow manages this, but I am not sure whether that is stable. We could always write to RDS, take the hash digest of that file and then delete the file.
I have found the digest package. It seems that loading a PV twice in separate R sessions gives stable results!
The output filename would be needed for pv_call in order to load old results if they exist. We need to change the function signature from result <- pv_call(func, data) to pv_call(result, func, data) again and have the pv_save as part of pv_call then. This could be changed with a regular expression.
The crucial part is that we don't miss edge cases and that it is robust. It does not help to save time if there are spurious inconsistencies. I think that we are good with this, the following is taken care of:
Changing a parameter will also change the hash for param.
Adding or removing a parameter will result in a different hash for the parameter row.
Addition, removal or change of a value within one PV that is passed to pv_call will change the hash of value.
Joining in another PV will also change the value, and therefore the hash.
Parameter files like for the fit ranges or bootstrap parameters are converted into PV objects. If they are not, there still are the # Depends: comments in the code file, so the whole code would be run if that file changes. For fit ranges, the CSV file needs to be converted to a PV object, but then the pv_call accepting the fit ranges PV object would be able to re-use results.
If something cannot be hashed (seems not to occur in R, though), we can just assume that it has changed and still do it as before.
If the result file does not exist (first run), we simply create it.
Did I miss something? If not, it does not hard conceptually and we could think of implementing it.
The text was updated successfully, but these errors were encountered:
A thing that we had was that the addition of a single fit range would require all other fits to be run as well. We had discussed this with various people over time and every time we reached the conclusion that it just wasn't easy to implement incremental updates in a a safe and consistent way.
Now while writing about the shortcoming of paramvalf in my thesis, I explained why it is so complicated and why a certain solution will not work until I realized that it might actually work. So let's have this discussion again, please.
The problem is that the results depend on multiple things:
paramval
R
~/R
We always ignore the fifth, and recently the fourth item has been added to the dependencies that are represented in the
Makefile
. Currently the first two are tracked via the files that hold the PV objects. If they get updated, then everything that uses them needs to be updated. There is no granularity in that. Number 3 is the easiest as it is the code that actually needs to be executed. For Number 4 we have file granularity, so it is best to have many small function files such that functions can be edited independently of of each.For Numbers 1 and 2 we could just hash everything. Basically we would hash the row of the
param
section and thevalue
list that gets passed to the function. If we can find that hash in the old result file, we just copy thevalue
section of that and we are done instantly. Otherwise we need to do the computation. The resulting file needs to be build up and written to disk, the time stamps would match then.There are two issues that I see:
We would need a hash digest that is stable. This might become a pain when environments and references come into play. The serialization to file somehow manages this, but I am not sure whether that is stable. We could always write to RDS, take the hash digest of that file and then delete the file.
I have found the
digest
package. It seems that loading a PV twice in separate R sessions gives stable results!The output filename would be needed for
pv_call
in order to load old results if they exist. We need to change the function signature fromresult <- pv_call(func, data)
topv_call(result, func, data)
again and have thepv_save
as part ofpv_call
then. This could be changed with a regular expression.The crucial part is that we don't miss edge cases and that it is robust. It does not help to save time if there are spurious inconsistencies. I think that we are good with this, the following is taken care of:
param
.pv_call
will change the hash ofvalue
.value
, and therefore the hash.# Depends:
comments in the code file, so the whole code would be run if that file changes. For fit ranges, the CSV file needs to be converted to a PV object, but then thepv_call
accepting the fit ranges PV object would be able to re-use results.Did I miss something? If not, it does not hard conceptually and we could think of implementing it.
The text was updated successfully, but these errors were encountered: