chore: remove searcher context from DetCallback (#10083)

The det.transformers.DetCallback chose to only work on-cluster, and derived much of its behavior by directly inspecting ClusterInfo. It also enforced that the user provided a transformers-style training length which matched the searcher max_length. And it didn't bother respecting min_validation_period or min_checkpoint_period, or really any other training loop detail found in the experiment config. The end result is that with absolutely no API changes, we can actually move to the searcher-context-removal paradigm with almost no breakages whatsoever. The only case that would break is if the user was _relying_ on the asha searcher to tell the huggingface Trainer when to checkpoint. Our system is overall not supporting that behavior anymore, so if that breaks it is unavoidable.
determined-ai · Oct 25, 2024 · 99f03c2 · 99f03c2
1 parent 58c6f1e
commit 99f03c2
Show file tree

Hide file tree

Showing 19 changed files with 717 additions and 207 deletions.
diff --git a/docs/model-dev-guide/hyperparameter/search-methods/hp-adaptive-asha.rst b/docs/model-dev-guide/hyperparameter/search-methods/hp-adaptive-asha.rst
@@ -25,7 +25,7 @@ Resource budget:
    <experiment-configuration_training_units>`). Note that the searcher will expect this metric to
    appear in validation metrics reported by the model. This quantity is domain-specific and should
    roughly reflect the number of minibatches the model must be trained on for it to converge on the
-   data set. For users who would like to determine this number experimentally, train a model with
+   dataset. For users who would like to determine this number experimentally, train a model with
    reasonable hyperparameters using the ``single`` search method.
 
 -  ``max_trials``: This indicates the total number of hyperparameter settings that will be evaluated

diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst
@@ -335,8 +335,8 @@ Optional. Specifies the minimum frequency at which validation should be run for
       epochs: 2
 
 -  :class:`~determined.pytorch.deepspeed.DeepSpeedTrial` and
-   :class:`~determined.keras.TFKerasTrial`: If this is in the unit of epochs,
-   :ref:`records_per_epoch <config-records-per-epoch>` must be specified.
+   :class:`~determined.keras.TFKerasTrial`: If this is in the unit of epochs, ``records_per_epoch``
+   must be specified.
 
 .. _experiment-config-perform-initial-validation:
 
@@ -377,7 +377,7 @@ Optional. Specifies the minimum frequency for running checkpointing for each tri
 
 -  :class:`~determined.pytorch.deepspeed.DeepSpeedTrial` and
    :class:`~determined.keras.TFKerasTrial`: If the unit is in epochs, you must also specify
-   :ref:`records_per_epoch <config-records-per-epoch>`.
+   ``records_per_epoch``.
 
 ``checkpoint_policy``
 =====================

diff --git a/docs/reference/training/_index.rst b/docs/reference/training/_index.rst
@@ -15,6 +15,7 @@
 -  :ref:`det.pytorch.samplers <pytorch-samplers>`
 -  :ref:`det.pytorch.deepspeed <deepspeed-reference>`
 -  :ref:`det.keras <keras-reference>`
+-  :ref:`det.transformers <transformers-reference>`
 
 *******************************
  Experiment Configuration File

diff --git a/docs/reference/training/api-transformers-reference.rst b/docs/reference/training/api-transformers-reference.rst
@@ -0,0 +1,11 @@
+.. _transformers-reference:
+
+####################################
+ ``det.transformers`` API Reference
+####################################
+
+*****************************************
+ ``determined.transformers.DetCallback``
+*****************************************
+
+.. autoclass:: determined.transformers.DetCallback
diff --git a/e2e_tests/tests/config.py b/e2e_tests/tests/config.py
@@ -1,6 +1,6 @@
 import os
 import pathlib
-from typing import Any, Dict, List, Union
+from typing import Any, Dict, List
 
 from determined.common import api, util
 

diff --git a/e2e_tests/tests/fixtures/hpc/embedded-single-quote.yaml b/e2e_tests/tests/fixtures/hpc/embedded-single-quote.yaml
@@ -5,7 +5,5 @@ data:
 searcher:
   name: single
   metric: error
-  max_length:
-    batches: 1000
 max_restarts: 0
 entrypoint: python3 data_validator.py
diff --git a/examples/hf_trainer_api/hf_image_classification/adaptive.yaml b/examples/hf_trainer_api/hf_image_classification/adaptive.yaml
@@ -9,8 +9,8 @@ resources:
   slots_per_trial: 2
 searcher:
   name: adaptive_asha
-  max_length:
-    batches: 100
+  time_metric: batches
+  max_time: 100
   max_trials: 64
   max_rungs: 4
   divisor: 4

diff --git a/examples/hf_trainer_api/hf_image_classification/const.yaml b/examples/hf_trainer_api/hf_image_classification/const.yaml
@@ -9,8 +9,6 @@ resources:
   slots_per_trial: 1
 searcher:
   name: single
-  max_length:
-    batches: 100
   metric: eval_loss
 hyperparameters:
   training_arguments:

diff --git a/examples/hf_trainer_api/hf_image_classification/const_epochs.yaml b/examples/hf_trainer_api/hf_image_classification/const_epochs.yaml
@@ -10,8 +10,6 @@ resources:
 records_per_epoch: 1000
 searcher:
   name: single
-  max_length:
-    epochs: 5
   metric: eval_loss
 hyperparameters:
   training_arguments:

diff --git a/examples/hf_trainer_api/hf_image_classification/deepspeed.yaml b/examples/hf_trainer_api/hf_image_classification/deepspeed.yaml
@@ -11,8 +11,6 @@ resources:
   slots_per_trial: 2
 searcher:
   name: single
-  max_length:
-    batches: 100
   metric: eval_loss
 hyperparameters:
   deepspeed_config: ds_configs/ds_config_stage_1.json

diff --git a/examples/hf_trainer_api/hf_image_classification/distributed.yaml b/examples/hf_trainer_api/hf_image_classification/distributed.yaml
@@ -9,8 +9,6 @@ resources:
   slots_per_trial: 2
 searcher:
   name: single
-  max_length:
-    batches: 100
   metric: eval_loss
 hyperparameters:
   training_arguments:

diff --git a/examples/hf_trainer_api/hf_language_modeling/adaptive.yaml b/examples/hf_trainer_api/hf_language_modeling/adaptive.yaml
@@ -9,8 +9,8 @@ resources:
   slots_per_trial: 2
 searcher:
   name: adaptive_asha
-  max_length:
-    batches: 100
+  time_metric: batches
+  max_time: 100
   max_trials: 64
   max_rungs: 4
   divisor: 4

diff --git a/examples/hf_trainer_api/hf_language_modeling/const.yaml b/examples/hf_trainer_api/hf_language_modeling/const.yaml
@@ -9,8 +9,6 @@ resources:
   slots_per_trial: 1
 searcher:
   name: single
-  max_length:
-    batches: 100
   metric: eval_loss
 hyperparameters:
   training_arguments:

diff --git a/examples/hf_trainer_api/hf_language_modeling/const_epochs.yaml b/examples/hf_trainer_api/hf_language_modeling/const_epochs.yaml
@@ -10,8 +10,6 @@ resources:
 records_per_epoch: 1000
 searcher:
   name: single
-  max_length:
-    epochs: 5
   metric: eval_loss
 hyperparameters:
   training_arguments:

diff --git a/examples/hf_trainer_api/hf_language_modeling/deepspeed.yaml b/examples/hf_trainer_api/hf_language_modeling/deepspeed.yaml
@@ -11,8 +11,6 @@ resources:
   slots_per_trial: 2
 searcher:
   name: single
-  max_length:
-    batches: 100
   metric: eval_loss
 hyperparameters:
   deepspeed_config: ds_configs/ds_config_stage_1.json

diff --git a/examples/hf_trainer_api/hf_language_modeling/distributed.yaml b/examples/hf_trainer_api/hf_language_modeling/distributed.yaml
@@ -9,8 +9,6 @@ resources:
   slots_per_trial: 2
 searcher:
   name: single
-  max_length:
-    batches: 100
   metric: eval_loss
 hyperparameters:
   training_arguments: