Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: det.keras.DeterminedCallback #10075

Merged
merged 2 commits into from
Oct 22, 2024

Conversation

rb-determined-ai
Copy link
Contributor

@rb-determined-ai rb-determined-ai commented Oct 17, 2024

It's everything you've ever wanted.

Or, it's everything I've ever wanted, at least.

Should I actually rip out all of the TFKerasTrial docs? Or just leave them marked as deprecated?

[ ] release note / deprecations

@rb-determined-ai rb-determined-ai requested a review from a team as a code owner October 17, 2024 06:20
@cla-bot cla-bot bot added the cla-signed label Oct 17, 2024
@determined-ci determined-ci requested a review from a team October 17, 2024 06:20
@determined-ci determined-ci added the documentation Improvements or additions to documentation label Oct 17, 2024
Copy link

codecov bot commented Oct 17, 2024

Codecov Report

Attention: Patch coverage is 21.00457% with 346 lines in your changes missing coverage. Please review.

Project coverage is 45.87%. Comparing base (a1959b9) to head (bb79598).

Files with missing lines Patch % Lines
harness/determined/keras/_callback.py 19.00% 162 Missing ⚠️
harness/tests/experiment/keras/test_callback.py 21.07% 161 Missing ⚠️
harness/determined/core/_distributed.py 21.73% 18 Missing ⚠️
harness/determined/keras/callbacks.py 25.00% 3 Missing ⚠️
harness/determined/keras/_load.py 50.00% 1 Missing ⚠️
harness/determined/keras/_tensorboard_callback.py 50.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (a1959b9) and HEAD (bb79598). Click for more details.

HEAD has 5 uploads less than BASE
Flag BASE (a1959b9) HEAD (bb79598)
harness 7 3
web 1 0
Additional details and impacted files
@@                      Coverage Diff                      @@
##           searcher-context-removal   #10075       +/-   ##
=============================================================
- Coverage                     58.24%   45.87%   -12.38%     
=============================================================
  Files                           742      165      -577     
  Lines                        101672    15946    -85726     
  Branches                       3599        0     -3599     
=============================================================
- Hits                          59220     7315    -51905     
+ Misses                        42319     8631    -33688     
+ Partials                        133        0      -133     
Flag Coverage Δ
harness 46.15% <21.00%> (-24.45%) ⬇️
web ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
harness/determined/keras/__init__.py 100.00% <100.00%> (ø)
harness/determined/keras/_tf_keras_trial.py 66.94% <100.00%> (-14.78%) ⬇️
harness/determined/keras/_load.py 26.00% <50.00%> (-63.59%) ⬇️
harness/determined/keras/_tensorboard_callback.py 77.77% <50.00%> (+2.77%) ⬆️
harness/determined/keras/callbacks.py 73.91% <25.00%> (-17.60%) ⬇️
harness/determined/core/_distributed.py 33.83% <21.73%> (-58.91%) ⬇️
harness/tests/experiment/keras/test_callback.py 21.07% <21.07%> (ø)
harness/determined/keras/_callback.py 19.00% <19.00%> (ø)

... and 667 files with indirect coverage changes

checkpoint=checkpoint,
continue_id=continue_id,
# Iris epochs are very short, so we don't even bother to save checkpoints until we finish.
checkpoint_epochs=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if checkpoint = check preemption, this basically means this example won't really work with asha at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preemption is checked every epoch, always. The logic is a little more sophisticated than when I last walked you through it.

harness/determined/keras/_callback.py Outdated Show resolved Hide resolved
harness/determined/keras/_callback.py Outdated Show resolved Hide resolved
harness/determined/keras/_callback.py Show resolved Hide resolved
self.model.stop_training = True

# Remember how many batches we trained, for next time.
if self._is_chief and self.params["verbose"] != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only do this if self.params["verbose"] != 0:? looks like we save a checkpoint with self._training_length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because self._training_length is purely a progress-printing detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, it is slightly safer to always update self._training_length and self._validation_length, since we always save/restore them.

harness/determined/keras/_callback.py Show resolved Hide resolved
harness/determined/keras/_callback.py Show resolved Hide resolved
return

# Don't report more often than 10% increments.
percent_10 = int((batches / total) * 10) * 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could this logic be in the callers, and could it just be based off the batch? like

if batch % 10 == 0:
    self._print_progress(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you wrote prints every 10 batches, and what I have prints every 10 percent of progress, they are not equivalent.

In particular, I recall this was tricky to write (here I copied it from our _DeterminedProgress callback for TFKerasTrial), but it was tricky because handling very many batches per epoch and also very few batches per epoch and making both outputs look good was tricky, which is why I track the percent_10 and percent separately.

I don't see how pulling anything into the caller would be better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you wrote prints every 10 batches, and what I have prints every 10 percent of progress, they are not equivalent.

i'm not really sure what i was thinking. like i knew this, but i swear i still thought my comment made sense somehow lol. maybe i thought why bother with the percents, to which your explanation answers.

anyway, i think it's fine what you have. i only wanted to move it out of the method because it seemed like "print progress" should just print progress, instead of doing some checks as to whether it should print progress.


.. note::
Determined requires you to launch training jobs by submitting them with an
:ref:`experiment-configuration` that tells the Determined master how to start your container. For
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:ref:`experiment-configuration` that tells the Determined master how to start your container. For
:ref:`experiment-configuration` which tells the Determined master how to start your container. For

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: don't you need a comma before "which", but not before "that"?

.. note::
Determined requires you to launch training jobs by submitting them with an
:ref:`experiment-configuration` that tells the Determined master how to start your container. For
Keras training, you will always want to wrap your training script in Determined's :ref:`TensorFlow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Keras training, you will always want to wrap your training script in Determined's :ref:`TensorFlow
Keras training, you should wrap your training script in Determined's :ref:`TensorFlow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying "should always" because dropping the world "always" loses the significance of the statement.

The point is that you should do this always, without concern for what kind of training you are doing. If we wrote "you should" then a reader could say, "oh well they're assuming that I want dtrain but I personally don't care so I can ignore this statement".

the Determined cluster. It reports train and test metrics, it reports progress, it saves
checkpoints, and it uploads them to checkpoint storage. It also handles preemption signals from the
Determined master (such as if you pause your experiment), shutting down training, then it restores
training from where it left off when the experiment continues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The :class:~determined.keras.DeterminedCallback automatically integrates your training with the Determined cluster. It reports both train and test metrics, reports progress, saves checkpoints, and uploads them to checkpoint storage. Additionally, it manages preemption signals from the Determined master (for example, when you pause your experiment), gracefully halting training and resuming from where it left off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to say:

gracefully halting training and *later resuming from where it left off

Because otherwise it sounds like, "wait why are you stopping and resuming as a result of a preemption signal? Isn't the right action to just stop"? The word "later" serves the purpose of my original "when the experiment continues" to indicate that there is some sort of trigger in between shutting down training and resuming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now there are redundant paragraphs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh.

***********
In training jobs, a value for ``checkpoint`` should be obtained from
``det.get_cluster_info().latest_checkpoint``, which will automatically be populated with the latest
checkpoint saved by this trial, if there is one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it "a value" or "the value"?

e.g.,

In training jobs, the value for checkpoint should be retrieved from det.get_cluster_info().latest_checkpoint. This field will automatically contain the most recent checkpoint saved by the trial, if one exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is "a value", because users can put whatever checkpoint they want to restore from in there.

For instance, a user might want to have a starting checkpoint in mind, but also support pausing and resuming, they might want to use:

info.latest_checkpoint or my_starting_checkpoint

as the value. I'll just elaborate on this sentence, it probably belongs in the user guide.

Copy link
Contributor

@tara-hpe tara-hpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed rst files

@tara-hpe
Copy link
Contributor

It's everything you've ever wanted.

Or, it's everything I've ever wanted, at least.

Should I actually rip out all of the TFKerasTrial docs? Or just leave them marked as deprecated?

[ ] release note / deprecations

Mark them as deprecated, add to release notes. Then we can take them out the next release if we remember / have time.

@tara-hpe tara-hpe closed this Oct 17, 2024
@tara-hpe tara-hpe reopened this Oct 17, 2024
@determined-ci determined-ci requested a review from a team October 17, 2024 21:15
@tara-hpe
Copy link
Contributor

It's everything you've ever wanted.

Or, it's everything I've ever wanted, at least.

Should I actually rip out all of the TFKerasTrial docs? Or just leave them marked as deprecated?

[ ] release note / deprecations

Mark them as deprecated. "The following APIs have been deprecated as of 0.38.0." then they can be removed from docs with a future release.

@rb-determined-ai rb-determined-ai force-pushed the rb/keras-cb branch 2 times, most recently from 028b3bc to 76988c0 Compare October 21, 2024 18:28
It's everything you've ever wanted.

Or, it's everything _I've_ ever wanted, at least.
Should I actually rip out all of the TFKerasTrial docs?  Or just leave
them marked as deprecated?
Copy link
Contributor

@azhou-determined azhou-determined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've been waiting for this for years 🔥 🚀

@rb-determined-ai rb-determined-ai merged commit 406a7c9 into searcher-context-removal Oct 22, 2024
38 of 69 checks passed
@rb-determined-ai rb-determined-ai deleted the rb/keras-cb branch October 22, 2024 17:33
rb-determined-ai added a commit that referenced this pull request Oct 22, 2024
It's everything you've ever wanted.

Or, it's everything _I've_ ever wanted, at least.
rb-determined-ai added a commit that referenced this pull request Oct 22, 2024
It's everything you've ever wanted.

Or, it's everything _I've_ ever wanted, at least.
rb-determined-ai added a commit that referenced this pull request Oct 22, 2024
It's everything you've ever wanted.
rb-determined-ai added a commit that referenced this pull request Oct 22, 2024
It's everything you've ever wanted.
rb-determined-ai added a commit that referenced this pull request Oct 23, 2024
It's everything you've ever wanted.
azhou-determined pushed a commit that referenced this pull request Oct 23, 2024
It's everything you've ever wanted.
rb-determined-ai added a commit that referenced this pull request Oct 24, 2024
It's everything you've ever wanted.
rb-determined-ai added a commit that referenced this pull request Oct 24, 2024
It's everything you've ever wanted.
rb-determined-ai added a commit that referenced this pull request Oct 25, 2024
It's everything you've ever wanted.
rb-determined-ai added a commit that referenced this pull request Oct 25, 2024
It's everything you've ever wanted.
azhou-determined pushed a commit that referenced this pull request Oct 25, 2024
It's everything you've ever wanted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants