merge

determined-ai · Oct 24, 2024 · 77351cf · 77351cf
1 parent 7212d0e
commit 77351cf
Show file tree

Hide file tree

Showing 24 changed files with 1,983 additions and 651 deletions.
diff --git a/docs/assets/images/webui-runs-metadata-filter.png b/docs/assets/images/webui-runs-metadata-filter.png
diff --git a/docs/get-started/webui-qs.rst b/docs/get-started/webui-qs.rst
@@ -20,8 +20,6 @@ You must have a running Determined cluster with the CLI installed.
 -  To set up a remote cluster, visit the :ref:`Installation Guide <installation-guide>` where you'll
    find options for On Prem, AWS, GCP, Kubernetes, and Slurm.
 
-.. _qs-webui-concepts:
-
 **********
  Concepts
 **********

diff --git a/docs/reference/experiment-config-reference.rst b/docs/reference/experiment-config-reference.rst
@@ -877,12 +877,12 @@ Optional. The maximum number of trials that can be worked on simultaneously. The
 
 Optional. If specified, the weights of *every* trial in the search will be initialized to the most
 recent checkpoint of the given trial ID. This will fail if the source trial's model architecture is
-inconsistent with the model architecture of any of the trials in this experiment.
+incompatible with the model architecture of any of the trials in this experiment.
 
 ``source_checkpoint_uuid``
 --------------------------
 
-Optional. Like ``source_trial_id``, but specifies an arbitrary checkpoint from which to initialize
+Optional. Like ``source_trial_id`` but specifies an arbitrary checkpoint from which to initialize
 weights. At most one of ``source_trial_id`` or ``source_checkpoint_uuid`` should be set.
 
 Grid
@@ -1502,11 +1502,3 @@ If :ref:`gres_supported <cluster-configuration-slurm>` is set to ``false``, spec
 to ensure that ``slots_per_node`` GPUs will be available on the nodes selected for the job using
 other configurations such as targeting a specific resource pool with only ``slots_per_node`` GPU
 nodes or specifying a PBS constraint in the experiment configuration.
-
-******************
- Metadata Logging
-******************
-
-Determined supports logging arbitrary metadata for experiments. This feature allows users to store
-additional context and information about their runs. To learn how to log custom metadata, visit
-:ref:`the tutorial <metadata-logging-tutorial>`.
diff --git a/docs/tools/webui-if.rst b/docs/tools/webui-if.rst
@@ -241,26 +241,3 @@ Clear the message with the following command:
    .. code:: bash
 
       det master cluster-message clear
-
-********************************
- Viewing and Filtering Metadata
-********************************
-
-You can use the WebUI to view and filter experiment runs based on logged metadata. For a tutorial on
-how to log metadata, visit :ref:`metadata-logging-tutorial`.
-
--  In the Overview tab of the experiment, you can filter and sort runs based on metadata values
-   using the filter menu.
--  In the experiment's Runs view, metadata columns are displayed alongside other experiment
-   information.
--  On the Run details page, you'll find the "Metadata" section under the "Overview" tab, displaying
-   all logged metadata for that run.
--  To download the metadata in JSON format, click the "Download" button.
-
-To filter runs based on metadata:
-
-#. In the Runs view, click on the filter icon.
-#. Select a metadata field from the dropdown menu.
-#. Choose a condition (is, is not, or contains) and enter a value.
-
-Note: Array-type metadata can be viewed but cannot be used for sorting or filtering.
diff --git a/docs/tutorials/_index.rst b/docs/tutorials/_index.rst
@@ -46,7 +46,6 @@ Examples let you build off of an existing model that already runs on Determined.
    :hidden:
 
    Quickstart for Model Developers <quickstart-mdldev>
-   Logging Arbitrary Metadata <metadata-logging>
    Porting Your PyTorch Model to Determined <pytorch-mnist-tutorial>
    Get Started with Detached Mode <detached-mode/_index>
    Viewing Epoch-Based Metrics in the WebUI <viewing-epoch-based-metrics>

diff --git a/docs/tutorials/metadata-logging.rst b/docs/tutorials/metadata-logging.rst
diff --git a/docs/tutorials/quickstart-mdldev.rst b/docs/tutorials/quickstart-mdldev.rst
@@ -352,16 +352,6 @@ This example uses a fixed batch size and searches on dropout size, filters, and
    one trial performing at about 98 percent validation accuracy. The hyperparameter search halts
    poorly performing trials.
 
-*************************
- Logging Custom Metadata
-*************************
-
-Determined also supports logging custom metadata during a trial run. This feature allows you to
-capture additional context and information about your experiments beyond standard metrics.
-
-To learn more about how to use metadata logging in your experiments, please refer to the
-:ref:`metadata-logging-tutorial`.
-
 ************
  Learn More
 ************

diff --git a/examples/deepspeed/dcgan/README.md b/examples/deepspeed/dcgan/README.md
@@ -0,0 +1,49 @@
+# DeepSpeed CIFAR Example
+This example is adapted from the
+[DCGAN example in the DeepSpeedExamples](https://github.com/microsoft/DeepSpeedExamples/tree/master/training/gan)
+repository. It is intended to demonstrate a simple usecase of DeepSpeed with Determined.
+
+## Files
+* **model.py**: The DCGANTrial definition.
+* **gan_model.py**: Network definitions for generator and discriminator.
+* **data.py**: Dataset loading/downloading code.
+
+### Configuration Files
+* **ds_config.json**: The DeepSpeed config file.
+* **mnist.yaml**: Determined config to train the model on mnist on a cluster.
+
+## Data
+This repo supports the same datasets as the original example: `["imagenet", "lfw", "lsun", "cifar10", "mnist", "fake", "celeba"]`.  The `cifar10` and `mnist` datasets will be downloaded as needed, whereas the rest must be mounted on the agent.  For `lsun`, the `data_config.classes` setting must be set.  The `folder` dataset can be used to load an arbitrary torchvision `ImageFolder` that is mounted on the agent.
+
+## To Run Locally
+
+It is recommended to run this from within one of our agent docker images, found at
+https://hub.docker.com/r/determinedai/pytorch-ngc/tags
+
+After installing docker and pulling an image, users can launch a container via
+`docker run --gpus=all -v ~path/to/repo:/src/proj -it <container name>`
+
+Install necessary dependencies via `pip install determined mpi4py`
+
+Then, run the following command:
+```
+python trainer.py
+```
+
+Any additional configs can be specified in `mnist.yaml` and `ds_config.json` accordingly.
+
+## To Run on Cluster
+If you have not yet installed Determined, installation instructions can be found
+under `docs/install-admin.html` or at https://docs.determined.ai/latest/index.html
+
+Run the following command:
+```
+det experiment create mnist.yaml .
+```
+The other configurations can be run by specifying the appropriate configuration file in place
+of `mnist.yaml`.
+
+## Results
+Training `mnist` should yield reasonable looking fake digit images on the images tab in TensorBoard after ~5k steps.
+
+Training `cifar10` does not converge as convincingly, but should look image-like after ~10k steps.
diff --git a/examples/deepspeed/dcgan/data.py b/examples/deepspeed/dcgan/data.py
@@ -0,0 +1,104 @@
+import contextlib
+import os
+from typing import cast
+
+import filelock
+import torch
+import torchvision.datasets as dset
+import torchvision.transforms as transforms
+
+CHANNELS_BY_DATASET = {
+    "imagenet": 3,
+    "folder": 3,
+    "lfw": 3,
+    "lsun": 3,
+    "cifar10": 3,
+    "mnist": 1,
+    "fake": 3,
+    "celeba": 3,
+}
+
+
+def get_dataset(data_config: dict) -> torch.utils.data.Dataset:
+    if data_config.get("dataroot", None) is None:
+        if str(data_config.get("dataset"),"").lower() != "fake":
+            raise ValueError('`dataroot` parameter is required for dataset "%s"'
+                            % data_config.get("dataset", ""))
+        else:
+            context = contextlib.nullcontext()
+    else:
+        # Ensure that only one local process attempts to download/validate datasets at once.
+        context = filelock.FileLock(os.path.join(data_config["dataroot"], ".lock"))
+    with context:
+        if data_config["dataset"] in ["imagenet", "folder", "lfw"]:
+            # folder dataset
+            dataset = dset.ImageFolder(
+                root=data_config["dataroot"],
+                transform=transforms.Compose(
+                    [
+                        transforms.Resize(data_config["image_size"]),
+                        transforms.CenterCrop(data_config["image_size"]),
+                        transforms.ToTensor(),
+                        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+                    ]
+                ),
+            )
+        elif data_config["dataset"] == "lsun":
+            classes = [c + "_train" for c in data_config["classes"].split(",")]
+            dataset = dset.LSUN(
+                root=data_config["dataroot"],
+                classes=classes,
+                transform=transforms.Compose(
+                    [
+                        transforms.Resize(data_config["image_size"]),
+                        transforms.CenterCrop(data_config["image_size"]),
+                        transforms.ToTensor(),
+                        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+                    ]
+                ),
+            )
+        elif data_config["dataset"] == "cifar10":
+            dataset = dset.CIFAR10(
+                root=data_config["dataroot"],
+                download=True,
+                transform=transforms.Compose(
+                    [
+                        transforms.Resize(data_config["image_size"]),
+                        transforms.ToTensor(),
+                        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+                    ]
+                ),
+            )
+        elif data_config["dataset"] == "mnist":
+            dataset = dset.MNIST(
+                root=data_config["dataroot"],
+                download=True,
+                transform=transforms.Compose(
+                    [
+                        transforms.Resize(data_config["image_size"]),
+                        transforms.ToTensor(),
+                        transforms.Normalize((0.5,), (0.5,)),
+                    ]
+                ),
+            )
+        elif data_config["dataset"] == "fake":
+            dataset = dset.FakeData(
+                image_size=(3, data_config["image_size"], data_config["image_size"]),
+                transform=transforms.ToTensor(),
+            )
+        elif data_config["dataset"] == "celeba":
+            dataset = dset.ImageFolder(
+                root=data_config["dataroot"],
+                transform=transforms.Compose(
+                    [
+                        transforms.Resize(data_config["image_size"]),
+                        transforms.CenterCrop(data_config["image_size"]),
+                        transforms.ToTensor(),
+                        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+                    ]
+                ),
+            )
+        else:
+            unknown_dataset_name = data_config["dataset"]
+            raise Exception(f"Unknown dataset {unknown_dataset_name}")
+    return cast(torch.utils.data.Dataset, dataset)
diff --git a/examples/deepspeed/dcgan/ds_config.json b/examples/deepspeed/dcgan/ds_config.json
@@ -0,0 +1,15 @@
+{
+    "train_batch_size": 64,
+    "optimizer": {
+        "type": "Adam",
+        "params": {
+            "lr": 0.0002,
+            "betas": [
+                0.5,
+                0.999
+            ],
+            "eps": 1e-8
+        }
+    },
+    "steps_per_print": 10
+}