Skip to content

Commit

Permalink
merge
Browse files Browse the repository at this point in the history
  • Loading branch information
MikhailKardash committed Oct 24, 2024
1 parent 7212d0e commit 77351cf
Show file tree
Hide file tree
Showing 24 changed files with 1,983 additions and 651 deletions.
Binary file removed docs/assets/images/webui-runs-metadata-filter.png
Binary file not shown.
2 changes: 0 additions & 2 deletions docs/get-started/webui-qs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@ You must have a running Determined cluster with the CLI installed.
- To set up a remote cluster, visit the :ref:`Installation Guide <installation-guide>` where you'll
find options for On Prem, AWS, GCP, Kubernetes, and Slurm.

.. _qs-webui-concepts:

**********
Concepts
**********
Expand Down
12 changes: 2 additions & 10 deletions docs/reference/experiment-config-reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -877,12 +877,12 @@ Optional. The maximum number of trials that can be worked on simultaneously. The

Optional. If specified, the weights of *every* trial in the search will be initialized to the most
recent checkpoint of the given trial ID. This will fail if the source trial's model architecture is
inconsistent with the model architecture of any of the trials in this experiment.
incompatible with the model architecture of any of the trials in this experiment.

``source_checkpoint_uuid``
--------------------------

Optional. Like ``source_trial_id``, but specifies an arbitrary checkpoint from which to initialize
Optional. Like ``source_trial_id`` but specifies an arbitrary checkpoint from which to initialize
weights. At most one of ``source_trial_id`` or ``source_checkpoint_uuid`` should be set.

Grid
Expand Down Expand Up @@ -1502,11 +1502,3 @@ If :ref:`gres_supported <cluster-configuration-slurm>` is set to ``false``, spec
to ensure that ``slots_per_node`` GPUs will be available on the nodes selected for the job using
other configurations such as targeting a specific resource pool with only ``slots_per_node`` GPU
nodes or specifying a PBS constraint in the experiment configuration.

******************
Metadata Logging
******************

Determined supports logging arbitrary metadata for experiments. This feature allows users to store
additional context and information about their runs. To learn how to log custom metadata, visit
:ref:`the tutorial <metadata-logging-tutorial>`.
23 changes: 0 additions & 23 deletions docs/tools/webui-if.rst
Original file line number Diff line number Diff line change
Expand Up @@ -241,26 +241,3 @@ Clear the message with the following command:
.. code:: bash
det master cluster-message clear
********************************
Viewing and Filtering Metadata
********************************

You can use the WebUI to view and filter experiment runs based on logged metadata. For a tutorial on
how to log metadata, visit :ref:`metadata-logging-tutorial`.

- In the Overview tab of the experiment, you can filter and sort runs based on metadata values
using the filter menu.
- In the experiment's Runs view, metadata columns are displayed alongside other experiment
information.
- On the Run details page, you'll find the "Metadata" section under the "Overview" tab, displaying
all logged metadata for that run.
- To download the metadata in JSON format, click the "Download" button.

To filter runs based on metadata:

#. In the Runs view, click on the filter icon.
#. Select a metadata field from the dropdown menu.
#. Choose a condition (is, is not, or contains) and enter a value.

Note: Array-type metadata can be viewed but cannot be used for sorting or filtering.
1 change: 0 additions & 1 deletion docs/tutorials/_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ Examples let you build off of an existing model that already runs on Determined.
:hidden:

Quickstart for Model Developers <quickstart-mdldev>
Logging Arbitrary Metadata <metadata-logging>
Porting Your PyTorch Model to Determined <pytorch-mnist-tutorial>
Get Started with Detached Mode <detached-mode/_index>
Viewing Epoch-Based Metrics in the WebUI <viewing-epoch-based-metrics>
Expand Down
83 changes: 0 additions & 83 deletions docs/tutorials/metadata-logging.rst

This file was deleted.

10 changes: 0 additions & 10 deletions docs/tutorials/quickstart-mdldev.rst
Original file line number Diff line number Diff line change
Expand Up @@ -352,16 +352,6 @@ This example uses a fixed batch size and searches on dropout size, filters, and
one trial performing at about 98 percent validation accuracy. The hyperparameter search halts
poorly performing trials.

*************************
Logging Custom Metadata
*************************

Determined also supports logging custom metadata during a trial run. This feature allows you to
capture additional context and information about your experiments beyond standard metrics.

To learn more about how to use metadata logging in your experiments, please refer to the
:ref:`metadata-logging-tutorial`.

************
Learn More
************
Expand Down
49 changes: 49 additions & 0 deletions examples/deepspeed/dcgan/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# DeepSpeed CIFAR Example
This example is adapted from the
[DCGAN example in the DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/training/gan)
repository. It is intended to demonstrate a simple usecase of DeepSpeed with Determined.

## Files
* **model.py**: The DCGANTrial definition.
* **gan_model.py**: Network definitions for generator and discriminator.
* **data.py**: Dataset loading/downloading code.

### Configuration Files
* **ds_config.json**: The DeepSpeed config file.
* **mnist.yaml**: Determined config to train the model on mnist on a cluster.

## Data
This repo supports the same datasets as the original example: `["imagenet", "lfw", "lsun", "cifar10", "mnist", "fake", "celeba"]`. The `cifar10` and `mnist` datasets will be downloaded as needed, whereas the rest must be mounted on the agent. For `lsun`, the `data_config.classes` setting must be set. The `folder` dataset can be used to load an arbitrary torchvision `ImageFolder` that is mounted on the agent.

## To Run Locally

It is recommended to run this from within one of our agent docker images, found at
https://hub.docker.com/r/determinedai/pytorch-ngc/tags

After installing docker and pulling an image, users can launch a container via
`docker run --gpus=all -v ~path/to/repo:/src/proj -it <container name>`

Install necessary dependencies via `pip install determined mpi4py`

Then, run the following command:
```
python trainer.py
```

Any additional configs can be specified in `mnist.yaml` and `ds_config.json` accordingly.

## To Run on Cluster
If you have not yet installed Determined, installation instructions can be found
under `docs/install-admin.html` or at https://docs.determined.ai/latest/index.html

Run the following command:
```
det experiment create mnist.yaml .
```
The other configurations can be run by specifying the appropriate configuration file in place
of `mnist.yaml`.

## Results
Training `mnist` should yield reasonable looking fake digit images on the images tab in TensorBoard after ~5k steps.

Training `cifar10` does not converge as convincingly, but should look image-like after ~10k steps.
104 changes: 104 additions & 0 deletions examples/deepspeed/dcgan/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
import contextlib
import os
from typing import cast

import filelock
import torch
import torchvision.datasets as dset
import torchvision.transforms as transforms

CHANNELS_BY_DATASET = {
"imagenet": 3,
"folder": 3,
"lfw": 3,
"lsun": 3,
"cifar10": 3,
"mnist": 1,
"fake": 3,
"celeba": 3,
}


def get_dataset(data_config: dict) -> torch.utils.data.Dataset:
if data_config.get("dataroot", None) is None:
if str(data_config.get("dataset"),"").lower() != "fake":
raise ValueError('`dataroot` parameter is required for dataset "%s"'
% data_config.get("dataset", ""))
else:
context = contextlib.nullcontext()
else:
# Ensure that only one local process attempts to download/validate datasets at once.
context = filelock.FileLock(os.path.join(data_config["dataroot"], ".lock"))
with context:
if data_config["dataset"] in ["imagenet", "folder", "lfw"]:
# folder dataset
dataset = dset.ImageFolder(
root=data_config["dataroot"],
transform=transforms.Compose(
[
transforms.Resize(data_config["image_size"]),
transforms.CenterCrop(data_config["image_size"]),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]
),
)
elif data_config["dataset"] == "lsun":
classes = [c + "_train" for c in data_config["classes"].split(",")]
dataset = dset.LSUN(
root=data_config["dataroot"],
classes=classes,
transform=transforms.Compose(
[
transforms.Resize(data_config["image_size"]),
transforms.CenterCrop(data_config["image_size"]),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]
),
)
elif data_config["dataset"] == "cifar10":
dataset = dset.CIFAR10(
root=data_config["dataroot"],
download=True,
transform=transforms.Compose(
[
transforms.Resize(data_config["image_size"]),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]
),
)
elif data_config["dataset"] == "mnist":
dataset = dset.MNIST(
root=data_config["dataroot"],
download=True,
transform=transforms.Compose(
[
transforms.Resize(data_config["image_size"]),
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
]
),
)
elif data_config["dataset"] == "fake":
dataset = dset.FakeData(
image_size=(3, data_config["image_size"], data_config["image_size"]),
transform=transforms.ToTensor(),
)
elif data_config["dataset"] == "celeba":
dataset = dset.ImageFolder(
root=data_config["dataroot"],
transform=transforms.Compose(
[
transforms.Resize(data_config["image_size"]),
transforms.CenterCrop(data_config["image_size"]),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
]
),
)
else:
unknown_dataset_name = data_config["dataset"]
raise Exception(f"Unknown dataset {unknown_dataset_name}")
return cast(torch.utils.data.Dataset, dataset)
15 changes: 15 additions & 0 deletions examples/deepspeed/dcgan/ds_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"train_batch_size": 64,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0002,
"betas": [
0.5,
0.999
],
"eps": 1e-8
}
},
"steps_per_print": 10
}
Loading

0 comments on commit 77351cf

Please sign in to comment.