Skip to content

Commit

Permalink
chore: remove searcher context from DetCallback (#10083)
Browse files Browse the repository at this point in the history
The det.transformers.DetCallback chose to only work on-cluster, and
derived much of its behavior by directly inspecting ClusterInfo.

It also enforced that the user provided a transformers-style training
length which matched the searcher max_length.

And it didn't bother respecting min_validation_period or
min_checkpoint_period, or really any other training loop detail found in
the experiment config.

The end result is that with absolutely no API changes, we can actually
move to the searcher-context-removal paradigm with almost no breakages
whatsoever.

The only case that would break is if the user was _relying_ on the
asha searcher to tell the huggingface Trainer when to checkpoint.  Our
system is overall not supporting that behavior anymore, so if that
breaks it is unavoidable.
  • Loading branch information
rb-determined-ai authored and azhou-determined committed Oct 25, 2024
1 parent 58c6f1e commit 99f03c2
Show file tree
Hide file tree
Showing 19 changed files with 717 additions and 207 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Resource budget:
<experiment-configuration_training_units>`). Note that the searcher will expect this metric to
appear in validation metrics reported by the model. This quantity is domain-specific and should
roughly reflect the number of minibatches the model must be trained on for it to converge on the
data set. For users who would like to determine this number experimentally, train a model with
dataset. For users who would like to determine this number experimentally, train a model with
reasonable hyperparameters using the ``single`` search method.

- ``max_trials``: This indicates the total number of hyperparameter settings that will be evaluated
Expand Down
6 changes: 3 additions & 3 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -335,8 +335,8 @@ Optional. Specifies the minimum frequency at which validation should be run for
epochs: 2
- :class:`~determined.pytorch.deepspeed.DeepSpeedTrial` and
:class:`~determined.keras.TFKerasTrial`: If this is in the unit of epochs,
:ref:`records_per_epoch <config-records-per-epoch>` must be specified.
:class:`~determined.keras.TFKerasTrial`: If this is in the unit of epochs, ``records_per_epoch``
must be specified.

.. _experiment-config-perform-initial-validation:

Expand Down Expand Up @@ -377,7 +377,7 @@ Optional. Specifies the minimum frequency for running checkpointing for each tri
- :class:`~determined.pytorch.deepspeed.DeepSpeedTrial` and
:class:`~determined.keras.TFKerasTrial`: If the unit is in epochs, you must also specify
:ref:`records_per_epoch <config-records-per-epoch>`.
``records_per_epoch``.

``checkpoint_policy``
=====================
Expand Down
1 change: 1 addition & 0 deletions docs/reference/training/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
- :ref:`det.pytorch.samplers <pytorch-samplers>`
- :ref:`det.pytorch.deepspeed <deepspeed-reference>`
- :ref:`det.keras <keras-reference>`
- :ref:`det.transformers <transformers-reference>`

*******************************
Experiment Configuration File
Expand Down
11 changes: 11 additions & 0 deletions docs/reference/training/api-transformers-reference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
.. _transformers-reference:

####################################
``det.transformers`` API Reference
####################################

*****************************************
``determined.transformers.DetCallback``
*****************************************

.. autoclass:: determined.transformers.DetCallback
2 changes: 1 addition & 1 deletion e2e_tests/tests/config.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import os
import pathlib
from typing import Any, Dict, List, Union
from typing import Any, Dict, List

from determined.common import api, util

Expand Down
2 changes: 0 additions & 2 deletions e2e_tests/tests/fixtures/hpc/embedded-single-quote.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,5 @@ data:
searcher:
name: single
metric: error
max_length:
batches: 1000
max_restarts: 0
entrypoint: python3 data_validator.py
4 changes: 2 additions & 2 deletions examples/hf_trainer_api/hf_image_classification/adaptive.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ resources:
slots_per_trial: 2
searcher:
name: adaptive_asha
max_length:
batches: 100
time_metric: batches
max_time: 100
max_trials: 64
max_rungs: 4
divisor: 4
Expand Down
2 changes: 0 additions & 2 deletions examples/hf_trainer_api/hf_image_classification/const.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@ resources:
slots_per_trial: 1
searcher:
name: single
max_length:
batches: 100
metric: eval_loss
hyperparameters:
training_arguments:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,6 @@ resources:
records_per_epoch: 1000
searcher:
name: single
max_length:
epochs: 5
metric: eval_loss
hyperparameters:
training_arguments:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@ resources:
slots_per_trial: 2
searcher:
name: single
max_length:
batches: 100
metric: eval_loss
hyperparameters:
deepspeed_config: ds_configs/ds_config_stage_1.json
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@ resources:
slots_per_trial: 2
searcher:
name: single
max_length:
batches: 100
metric: eval_loss
hyperparameters:
training_arguments:
Expand Down
4 changes: 2 additions & 2 deletions examples/hf_trainer_api/hf_language_modeling/adaptive.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ resources:
slots_per_trial: 2
searcher:
name: adaptive_asha
max_length:
batches: 100
time_metric: batches
max_time: 100
max_trials: 64
max_rungs: 4
divisor: 4
Expand Down
2 changes: 0 additions & 2 deletions examples/hf_trainer_api/hf_language_modeling/const.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@ resources:
slots_per_trial: 1
searcher:
name: single
max_length:
batches: 100
metric: eval_loss
hyperparameters:
training_arguments:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,6 @@ resources:
records_per_epoch: 1000
searcher:
name: single
max_length:
epochs: 5
metric: eval_loss
hyperparameters:
training_arguments:
Expand Down
2 changes: 0 additions & 2 deletions examples/hf_trainer_api/hf_language_modeling/deepspeed.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,6 @@ resources:
slots_per_trial: 2
searcher:
name: single
max_length:
batches: 100
metric: eval_loss
hyperparameters:
deepspeed_config: ds_configs/ds_config_stage_1.json
Expand Down
2 changes: 0 additions & 2 deletions examples/hf_trainer_api/hf_language_modeling/distributed.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@ resources:
slots_per_trial: 2
searcher:
name: single
max_length:
batches: 100
metric: eval_loss
hyperparameters:
training_arguments:
Expand Down
Loading

0 comments on commit 99f03c2

Please sign in to comment.