feat: det.keras.DeterminedCallback #10075

rb-determined-ai · 2024-10-17T06:20:01Z

It's everything you've ever wanted.

Or, it's everything I've ever wanted, at least.

Should I actually rip out all of the TFKerasTrial docs? Or just leave them marked as deprecated?

[ ] release note / deprecations

codecov · 2024-10-17T15:22:22Z

Codecov Report

Attention: Patch coverage is 21.00457% with 346 lines in your changes missing coverage. Please review.

Project coverage is 45.87%. Comparing base (a1959b9) to head (bb79598).

Files with missing lines	Patch %	Lines
harness/determined/keras/_callback.py	19.00%	162 Missing ⚠️
harness/tests/experiment/keras/test_callback.py	21.07%	161 Missing ⚠️
harness/determined/core/_distributed.py	21.73%	18 Missing ⚠️
harness/determined/keras/callbacks.py	25.00%	3 Missing ⚠️
harness/determined/keras/_load.py	50.00%	1 Missing ⚠️
harness/determined/keras/_tensorboard_callback.py	50.00%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (a1959b9) and HEAD (bb79598). Click for more details.

HEAD has 5 uploads less than BASE

Flag BASE (a1959b9) HEAD (bb79598)

harness 7 3

web 1 0

Additional details and impacted files

@@                      Coverage Diff                      @@
##           searcher-context-removal   #10075       +/-   ##
=============================================================
- Coverage                     58.24%   45.87%   -12.38%     
=============================================================
  Files                           742      165      -577     
  Lines                        101672    15946    -85726     
  Branches                       3599        0     -3599     
=============================================================
- Hits                          59220     7315    -51905     
+ Misses                        42319     8631    -33688     
+ Partials                        133        0      -133

Flag	Coverage Δ
harness	`46.15% <21.00%> (-24.45%)`	⬇️
web	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
harness/determined/keras/__init__.py	`100.00% <100.00%> (ø)`
harness/determined/keras/_tf_keras_trial.py	`66.94% <100.00%> (-14.78%)`	⬇️
harness/determined/keras/_load.py	`26.00% <50.00%> (-63.59%)`	⬇️
harness/determined/keras/_tensorboard_callback.py	`77.77% <50.00%> (+2.77%)`	⬆️
harness/determined/keras/callbacks.py	`73.91% <25.00%> (-17.60%)`	⬇️
harness/determined/core/_distributed.py	`33.83% <21.73%> (-58.91%)`	⬇️
harness/tests/experiment/keras/test_callback.py	`21.07% <21.07%> (ø)`
harness/determined/keras/_callback.py	`19.00% <19.00%> (ø)`

... and 667 files with indirect coverage changes

docs/model-dev-guide/api-guides/apis-howto/api-core-ug-basic.rst

docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst

azhou-determined · 2024-10-17T16:08:44Z

examples/computer_vision/iris_tf_keras/train.py

+        checkpoint=checkpoint,
+        continue_id=continue_id,
+        # Iris epochs are very short, so we don't even bother to save checkpoints until we finish.
+        checkpoint_epochs=0,


if checkpoint = check preemption, this basically means this example won't really work with asha at all?

Preemption is checked every epoch, always. The logic is a little more sophisticated than when I last walked you through it.

examples/computer_vision/iris_tf_keras/train.py

harness/determined/keras/_callback.py

azhou-determined · 2024-10-17T19:26:11Z

harness/determined/keras/_callback.py

+            self.model.stop_training = True
+
+        # Remember how many batches we trained, for next time.
+        if self._is_chief and self.params["verbose"] != 0:


why only do this if self.params["verbose"] != 0:? looks like we save a checkpoint with self._training_length?

Because self._training_length is purely a progress-printing detail.

eh, it is slightly safer to always update self._training_length and self._validation_length, since we always save/restore them.

harness/determined/keras/_callback.py

azhou-determined · 2024-10-17T20:43:39Z

harness/determined/keras/_callback.py

+            return
+
+        # Don't report more often than 10% increments.
+        percent_10 = int((batches / total) * 10) * 10


nit: could this logic be in the callers, and could it just be based off the batch? like

if batch % 10 == 0: self._print_progress(...)

What you wrote prints every 10 batches, and what I have prints every 10 percent of progress, they are not equivalent.

In particular, I recall this was tricky to write (here I copied it from our _DeterminedProgress callback for TFKerasTrial), but it was tricky because handling very many batches per epoch and also very few batches per epoch and making both outputs look good was tricky, which is why I track the percent_10 and percent separately.

I don't see how pulling anything into the caller would be better.

What you wrote prints every 10 batches, and what I have prints every 10 percent of progress, they are not equivalent.

i'm not really sure what i was thinking. like i knew this, but i swear i still thought my comment made sense somehow lol. maybe i thought why bother with the percents, to which your explanation answers.

anyway, i think it's fine what you have. i only wanted to move it out of the method because it seemed like "print progress" should just print progress, instead of doing some checks as to whether it should print progress.

tara-hpe · 2024-10-17T20:48:24Z

docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst


-.. note::
+Determined requires you to launch training jobs by submitting them with an
+:ref:`experiment-configuration` that tells the Determined master how to start your container.  For


Suggested change

:ref:`experiment-configuration` that tells the Determined master how to start your container. For

:ref:`experiment-configuration` which tells the Determined master how to start your container. For

question: don't you need a comma before "which", but not before "that"?

tara-hpe · 2024-10-17T20:49:09Z

docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst

-.. note::
+Determined requires you to launch training jobs by submitting them with an
+:ref:`experiment-configuration` that tells the Determined master how to start your container.  For
+Keras training, you will always want to wrap your training script in Determined's :ref:`TensorFlow


Suggested change

Keras training, you will always want to wrap your training script in Determined's :ref:`TensorFlow

Keras training, you should wrap your training script in Determined's :ref:`TensorFlow

I'm saying "should always" because dropping the world "always" loses the significance of the statement.

The point is that you should do this always, without concern for what kind of training you are doing. If we wrote "you should" then a reader could say, "oh well they're assuming that I want dtrain but I personally don't care so I can ignore this statement".

docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst

tara-hpe · 2024-10-17T20:54:25Z

docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst

+the Determined cluster.  It reports train and test metrics, it reports progress, it saves
+checkpoints, and it uploads them to checkpoint storage.  It also handles preemption signals from the
+Determined master (such as if you pause your experiment), shutting down training, then it restores
+training from where it left off when the experiment continues.


The :class:~determined.keras.DeterminedCallback automatically integrates your training with the Determined cluster. It reports both train and test metrics, reports progress, saves checkpoints, and uploads them to checkpoint storage. Additionally, it manages preemption signals from the Determined master (for example, when you pause your experiment), gracefully halting training and resuming from where it left off.

I'm going to say:

gracefully halting training and *later resuming from where it left off

Because otherwise it sounds like, "wait why are you stopping and resuming as a result of a preemption signal? Isn't the right action to just stop"? The word "later" serves the purpose of my original "when the experiment continues" to indicate that there is some sort of trigger in between shutting down training and resuming.

makes sense

now there are redundant paragraphs

docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst

tara-hpe · 2024-10-17T20:58:39Z

docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst

-***********
+In training jobs, a value for ``checkpoint`` should be obtained from
+``det.get_cluster_info().latest_checkpoint``, which will automatically be populated with the latest
+checkpoint saved by this trial, if there is one.


is it "a value" or "the value"?

e.g.,

In training jobs, the value for checkpoint should be retrieved from det.get_cluster_info().latest_checkpoint. This field will automatically contain the most recent checkpoint saved by the trial, if one exists.

It is "a value", because users can put whatever checkpoint they want to restore from in there.

For instance, a user might want to have a starting checkpoint in mind, but also support pausing and resuming, they might want to use:

info.latest_checkpoint or my_starting_checkpoint

as the value. I'll just elaborate on this sentence, it probably belongs in the user guide.

docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst

docs/reference/training/api-keras-reference.rst

docs/model-dev-guide/hyperparameter/search-methods/hp-adaptive-asha.rst

tara-hpe

reviewed rst files

tara-hpe · 2024-10-17T21:15:02Z

It's everything you've ever wanted.

Or, it's everything I've ever wanted, at least.

Should I actually rip out all of the TFKerasTrial docs? Or just leave them marked as deprecated?

[ ] release note / deprecations

Mark them as deprecated, add to release notes. Then we can take them out the next release if we remember / have time.

tara-hpe · 2024-10-17T21:15:58Z

It's everything you've ever wanted.

Or, it's everything I've ever wanted, at least.

Should I actually rip out all of the TFKerasTrial docs? Or just leave them marked as deprecated?

[ ] release note / deprecations

Mark them as deprecated. "The following APIs have been deprecated as of 0.38.0." then they can be removed from docs with a future release.

examples/computer_vision/iris_tf_keras/train.py

It's everything you've ever wanted. Or, it's everything _I've_ ever wanted, at least.

Should I actually rip out all of the TFKerasTrial docs? Or just leave them marked as deprecated?

docs/model-dev-guide/api-guides/apis-howto/api-keras-ug.rst

docs/model-dev-guide/debug-models.rst

azhou-determined

we've been waiting for this for years 🔥 🚀

It's everything you've ever wanted. Or, it's everything _I've_ ever wanted, at least.

It's everything you've ever wanted.

rb-determined-ai requested a review from azhou-determined October 17, 2024 06:20

rb-determined-ai requested a review from a team as a code owner October 17, 2024 06:20

cla-bot bot added the cla-signed label Oct 17, 2024

determined-ci requested a review from a team October 17, 2024 06:20

determined-ci added the documentation Improvements or additions to documentation label Oct 17, 2024

rb-determined-ai force-pushed the rb/keras-cb branch from a54c57c to b7c500a Compare October 17, 2024 15:19