-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpointer revival #326
Checkpointer revival #326
Conversation
I think |
I’m happy with output_writers.jl; I think all the code in that file is appropriately related. The checkpointer could provide its own constructor to avoid excess memory allocation. For GPU problems I don’t think there is an issue: checkpointed arrays can b loaded into CPU memory rather than GPU memory, and then the data can be copied into the fields allocated by the model constructor. So at first glance the excess memory allocation does not seem like a major issue on modern CPUs. I am particularly concerned about the maintainability of the checkpointer, since it will need to be updated every time a new feature is added. Let’s make sure the design is easy to maintain before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some work to do I think but this is a good start.
src/output_writers.jl
Outdated
|
||
function savesubfields!(file, model, name, flds=propertynames(getproperty(model, name))) | ||
for f in flds | ||
file["$name/$f"] = Array(getproperty(getproperty(model, name), f).data.parent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put an if
statement that does not save a field if its type is Function
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good check. Won't be needed as long we maintain checkpointed_fieldsets
but good check either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should allow users to supply the modelfields
that are to be checkpointed. In that case, such a check will be important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing from my comment below, the if
-statement can also emit a warning that field x
will not be saved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ali-ramadhan was this resolved?
src/output_writers.jl
Outdated
output_frequency :: Int | ||
end | ||
|
||
function Checkpointer(; output_frequency, dir=".", prefix="checkpoint", force=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kwarg 'force' is not used here. It should either be used, or removed from the function signature.
In the JLD2OutputWriter
, the kwarg force
indicates whether file creation should be 'forced' (it corresponds to the same keyword passed to mkpath
.
src/output_writers.jl
Outdated
checkpointed_fieldsets = [:velocities, :tracers, :G, :Gp] | ||
|
||
function write_output(model, c::Checkpointer) | ||
@warn "Checkpointer will not save forcing functions, output writers, or diagnostics. They will need to be " * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have a warning that is always printed? Perhaps is it simply better to document this aspect of the checkpointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, might then be good to just print the warning if there is a non-zero forcing function, or an output writer or diagnostic included.
Then yeah warning may be removed if the checkpointer is well-documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what I propose:
-
Allow users to set the
modelfields
that are to be checkpoints as an argument to theCheckpointer
constructor -
If one of those fields that the user has asked to be saved contains a function, emit a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ali-ramadhan was this resolved?
src/output_writers.jl
Outdated
_arr(::GPU, a) = CuArray(a) | ||
|
||
function restore_from_checkpoint(filepath) | ||
@warn "Checkpointer cannot restore forcing functions, output writers, or diagnostics. They will need to be " * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. What we need is documentation about proper checkpointing, which will include information about how to restore a model from a checkpoint that includes functions. In fact, we could even provide features in the checkpointer that streamline this process (by indicating parts of the model structure that are associated with functions, and asking the user to provide those functions during checkpoint restoration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah warning doesn't have be in restore_from_checkpoint
. Passing forcing functions is a good idea.
k1, k2 = round(Int, Nz/4), round(Int, 3Nz/4) | ||
true_model.tracers.T.data[i1:i2, j1:j2, k1:k2] .+= 0.01 | ||
|
||
checkpointed_model = deepcopy(true_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it necessary to copy the model before checkpointing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's creating a second model that will be checkpointed as opposed to true_model
which isn't.
Probably over paranoid but I wanted the two models to be time-stepped separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you worried that checkpointing will modify model
?
Since this behavior is undesirable / unexpected, perhaps it should simply be tested for, so that we can assume that the model is not modified.
src/output_writers.jl
Outdated
end | ||
|
||
checkpointed_structs = [:arch, :boundary_conditions, :grid, :clock, :eos, :constants, :closure] | ||
checkpointed_fieldsets = [:velocities, :tracers, :G, :Gp] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
G
and Gp
are not fields of model
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. This was before PR #325 was merged and included the tendencies with AdamsBashforthTimestepper
. Will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ali-ramadhan was this resolved?
src/output_writers.jl
Outdated
_arr(::CPU, a) = a | ||
_arr(::GPU, a) = CuArray(a) | ||
|
||
function restore_from_checkpoint(filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function must have a ; kwargs...
argument which is then merged with the retrieved kwargs
from the checkpoint file before being passed to the model constructor. This is needed to restore models with forcing functions or non-default boundary conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second review: add a field (perhaps checkpointed_fields
) to the Checkpointer
that allows the user to control which subfields of model
are checkpointed. Use a kwarg to set the default subfields to current list. The subfields that contain Field
s (like velocities
) should be included the same list as subfields like constants
(for simplicity, one list is best), and a function should be used to properly save a struct
depending on whether it contains Field
s, Functions
, or neither.
src/output_writers.jl
Outdated
# Checkpointing model properties that we can just serialize. | ||
[file["$p"] = getproperty(model, p) for p in checkpointed_structs] | ||
|
||
# Checkpointing structs containing fields. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this step can be combined with the first, along with an if statement that inspects the content of the struct to be saved and determines whether it contains Fields
or not.
src/output_writers.jl
Outdated
return nothing | ||
end | ||
|
||
checkpointed_structs = [:arch, :boundary_conditions, :grid, :clock, :eos, :constants, :closure] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These fields could be made properties of the Checkpointer
. This will allow the user to add/remove fields from checkpointing, which may be generally useful in the future and is currently useful for omitting boundary conditions from checkpointing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ali-ramadhan was this resolved?
I'll work on this until we're happy with the interace but will leave restoring massive models (that fill up memory) properly for another PR.
I still find it a little messy and hard to navigate. Maybe a good solution for now would be to reorganize the file into sections, and shared functions can be moved to the top.
Eventually the
Might make more sense. In general, the checkpointer should be flexible but you'd probably only need to change what's checkpointed for esoteric cases. E.g. if you remove something important, you'll |
This is not the only issue. Another issue is if new subfields of model are added that contain non-checkpointable elements, like forcing functions. In that sense the To address this, we need a generalized checkpointer that works in a wide range of circumstances, is relatively automated, modular, and can be user-modified. I'm happy to help develop these features. One key is a function that can query a struct recursively to determine if it contains any references to |
Codecov Report
@@ Coverage Diff @@
## master #326 +/- ##
===========================================
- Coverage 74.74% 61.43% -13.31%
===========================================
Files 22 24 +2
Lines 1176 1281 +105
===========================================
- Hits 879 787 -92
- Misses 297 494 +197
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #326 +/- ##
==========================================
+ Coverage 74.74% 76.06% +1.31%
==========================================
Files 22 22
Lines 1176 1224 +48
==========================================
+ Hits 879 931 +52
+ Misses 297 293 -4
Continue to review full report at Codecov.
|
I've changed things a bit in response to your suggestions. Mainly, the structs and fieldsets to checkpoint can be specified through the @glwagner Lemme know if this looks okay to merge. |
src/output_writers.jl
Outdated
mkpath(dir) | ||
return Checkpointer(dir, prefix, output_frequency) | ||
end | ||
|
||
function savesubfields!(file, model, name, flds=propertynames(getproperty(model, name))) | ||
for f in flds | ||
file["$name/$f"] = Array(getproperty(getproperty(model, name), f).data.parent) | ||
if name ∉ (:forcing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about how this function works:
- why restrict this functionality to
Field
(as it looks to me), but provide a hook whenname=:forcing
? - can't we check the
type
of the object being saved? Other fields (like boundary conditions, or future fields that we add to model) may be functions.
I think it'd be better to use a function that dispatches on the type of the object being saved, eg
savefield(file, location, field) = file[location] = field
savefield(file, location, field::AbstractArray) = file[location] = Array(field)
savefield(file, location, field::Field) = file[location] = Array(field.data.parent)
savefield(file, location, field::Function) = warn("Cannot checkpoint Function object into $location!")
Using such a function may be more robust and general.
This reverts commit cbd4a8d.
Good suggestion. Actually, this should help fix one problem I've been having with JLD2 output: Julia ranges like You might want to serialize the grid when checkpointing to easily restore from a checkpoint file. But when saving the grid to a JLD2 output file, which may be read by a language other than Julia, the grid properties should be saved in a language-agnostic manner. Same for boundary conditions. So I changed the way structs are saved to disk for both the JLD2 output writer and the checkpointer. It's all done recursively via multiple dispatch so it should be flexible enough to work for all current When saving stuff to disk like a JLD2 file, When checkpointing, We checkpoint structs that are important for timestepping. Diagnostics and output writers are not checkpointed, as they are not essential and can be added in any time after model constructions. But if one or more boundary conditions contain a function, There is one mess bits associated with restoring from a checkpoint:
@glwagner Let me know what you think. This PR has been open for a while so I'd like to merge it ASAP and work on more pressing issues. |
Checkpointer revival Former-commit-id: 35a6a05
This PR reintroduces checkpointing. This time the checkpointer saves the bare essentials to a JLD2 file.
Checkpointer will not save forcing functions as before (see #141).
Need to test GPU checkpointing, checkpointing with LES closures (I assumed
model.diffusivities
does not need to be checkpointed), and checkpointing with fancier boundary conditions.A major issue with this checkpointer is that it can't restore the largest models because it creates a
Model
and then copies the data from the checkpoint file into the model fields. But if the model is using up all memory, then there's no room for reading data from disk to memory.It should feed the data directly through the Model constructor, but this would require refactoring some of the model and field code.
This can be addressed in this PR or in a future PR, although it's kind of useless because checkpointing becomes more important the larger the model/simulation...
Resolves #324