-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: det.keras.DeterminedCallback #10075
Conversation
a54c57c
to
b7c500a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## searcher-context-removal #10075 +/- ##
=============================================================
- Coverage 58.24% 45.87% -12.38%
=============================================================
Files 742 165 -577
Lines 101672 15946 -85726
Branches 3599 0 -3599
=============================================================
- Hits 59220 7315 -51905
+ Misses 42319 8631 -33688
+ Partials 133 0 -133
Flags with carried forward coverage won't be shown. Click here to find out more.
|
docs/model-dev-guide/api-guides/apis-howto/api-core-ug-basic.rst
Outdated
Show resolved
Hide resolved
docs/model-dev-guide/api-guides/apis-howto/api-core-ug-basic.rst
Outdated
Show resolved
Hide resolved
checkpoint=checkpoint, | ||
continue_id=continue_id, | ||
# Iris epochs are very short, so we don't even bother to save checkpoints until we finish. | ||
checkpoint_epochs=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if checkpoint = check preemption, this basically means this example won't really work with asha at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preemption is checked every epoch, always. The logic is a little more sophisticated than when I last walked you through it.
self.model.stop_training = True | ||
|
||
# Remember how many batches we trained, for next time. | ||
if self._is_chief and self.params["verbose"] != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only do this if self.params["verbose"] != 0:
? looks like we save a checkpoint with self._training_length
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because self._training_length is purely a progress-printing detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh, it is slightly safer to always update self._training_length and self._validation_length, since we always save/restore them.
return | ||
|
||
# Don't report more often than 10% increments. | ||
percent_10 = int((batches / total) * 10) * 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could this logic be in the callers, and could it just be based off the batch? like
if batch % 10 == 0:
self._print_progress(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you wrote prints every 10 batches, and what I have prints every 10 percent of progress, they are not equivalent.
In particular, I recall this was tricky to write (here I copied it from our _DeterminedProgress callback for TFKerasTrial), but it was tricky because handling very many batches per epoch and also very few batches per epoch and making both outputs look good was tricky, which is why I track the percent_10 and percent separately.
I don't see how pulling anything into the caller would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you wrote prints every 10 batches, and what I have prints every 10 percent of progress, they are not equivalent.
i'm not really sure what i was thinking. like i knew this, but i swear i still thought my comment made sense somehow lol. maybe i thought why bother with the percents, to which your explanation answers.
anyway, i think it's fine what you have. i only wanted to move it out of the method because it seemed like "print progress" should just print progress, instead of doing some checks as to whether it should print progress.
|
||
.. note:: | ||
Determined requires you to launch training jobs by submitting them with an | ||
:ref:`experiment-configuration` that tells the Determined master how to start your container. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:ref:`experiment-configuration` that tells the Determined master how to start your container. For | |
:ref:`experiment-configuration` which tells the Determined master how to start your container. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: don't you need a comma before "which", but not before "that"?
.. note:: | ||
Determined requires you to launch training jobs by submitting them with an | ||
:ref:`experiment-configuration` that tells the Determined master how to start your container. For | ||
Keras training, you will always want to wrap your training script in Determined's :ref:`TensorFlow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keras training, you will always want to wrap your training script in Determined's :ref:`TensorFlow | |
Keras training, you should wrap your training script in Determined's :ref:`TensorFlow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying "should always" because dropping the world "always" loses the significance of the statement.
The point is that you should do this always, without concern for what kind of training you are doing. If we wrote "you should" then a reader could say, "oh well they're assuming that I want dtrain but I personally don't care so I can ignore this statement".
the Determined cluster. It reports train and test metrics, it reports progress, it saves | ||
checkpoints, and it uploads them to checkpoint storage. It also handles preemption signals from the | ||
Determined master (such as if you pause your experiment), shutting down training, then it restores | ||
training from where it left off when the experiment continues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The :class:~determined.keras.DeterminedCallback
automatically integrates your training with the Determined cluster. It reports both train and test metrics, reports progress, saves checkpoints, and uploads them to checkpoint storage. Additionally, it manages preemption signals from the Determined master (for example, when you pause your experiment), gracefully halting training and resuming from where it left off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to say:
gracefully halting training and *later resuming from where it left off
Because otherwise it sounds like, "wait why are you stopping and resuming as a result of a preemption signal? Isn't the right action to just stop"? The word "later" serves the purpose of my original "when the experiment continues" to indicate that there is some sort of trigger in between shutting down training and resuming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now there are redundant paragraphs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doh.
*********** | ||
In training jobs, a value for ``checkpoint`` should be obtained from | ||
``det.get_cluster_info().latest_checkpoint``, which will automatically be populated with the latest | ||
checkpoint saved by this trial, if there is one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it "a value" or "the value"?
e.g.,
In training jobs, the value for checkpoint
should be retrieved from det.get_cluster_info().latest_checkpoint
. This field will automatically contain the most recent checkpoint saved by the trial, if one exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is "a value", because users can put whatever checkpoint they want to restore from in there.
For instance, a user might want to have a starting checkpoint in mind, but also support pausing and resuming, they might want to use:
info.latest_checkpoint or my_starting_checkpoint
as the value. I'll just elaborate on this sentence, it probably belongs in the user guide.
docs/model-dev-guide/hyperparameter/search-methods/hp-adaptive-asha.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed rst files
Mark them as deprecated, add to release notes. Then we can take them out the next release if we remember / have time. |
Mark them as deprecated. "The following APIs have been deprecated as of 0.38.0." then they can be removed from docs with a future release. |
028b3bc
to
76988c0
Compare
It's everything you've ever wanted. Or, it's everything _I've_ ever wanted, at least.
Should I actually rip out all of the TFKerasTrial docs? Or just leave them marked as deprecated?
76988c0
to
bb79598
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've been waiting for this for years 🔥 🚀
406a7c9
into
searcher-context-removal
It's everything you've ever wanted. Or, it's everything _I've_ ever wanted, at least.
It's everything you've ever wanted. Or, it's everything _I've_ ever wanted, at least.
It's everything you've ever wanted.
It's everything you've ever wanted.
It's everything you've ever wanted.
It's everything you've ever wanted.
It's everything you've ever wanted.
It's everything you've ever wanted.
It's everything you've ever wanted.
It's everything you've ever wanted.
It's everything you've ever wanted.
It's everything you've ever wanted.
Or, it's everything I've ever wanted, at least.
Should I actually rip out all of the TFKerasTrial docs? Or just leave them marked as deprecated?
[ ] release note / deprecations