Skip to content

Commit

Permalink
Stale API deprecations and example adjustments (#55)
Browse files Browse the repository at this point in the history
  • Loading branch information
jaywonchung authored Apr 29, 2024
1 parent a25a533 commit 4002076
Show file tree
Hide file tree
Showing 132 changed files with 447 additions and 17,349 deletions.
22 changes: 8 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ Total energy (J):
```console
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,210] [zeus.utils.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Expand All @@ -102,20 +102,14 @@ Zeus is part of [The ML.ENERGY Initiative](https://ml.energy).
```
.
├── zeus/ # ⚡ Zeus Python package
│   ├── optimizer/ # - GPU energy and time optimizers
│   ├── run/ # - Tools for running Zeus on real training jobs
│   ├── policy/ # - Optimization policies and extension interfaces
│   ├── util/ # - Utility functions and classes
│   ├── monitor.py # - `ZeusMonitor`: Measure GPU time and energy of any code block
│   ├── controller.py # - Tools for controlling the flow of training
│   ├── callback.py # - Base class for Hugging Face-like training callbacks.
│   ├── simulate.py # - Tools for trace-driven simulation
│   ├── analyze.py # - Analysis functions for power logs
│   └── job.py # - Class for job specification
│   ├── optimizer/ # - A collection of optimizers for time and energy
│   ├── monitor/ # - Programmatic power and energy measurement tools
│   ├── utils/ # - Utility functions and classes
│   ├── _legacy/ # - Legacy code mostly to keep our papers reproducible
│   ├── device.py # - Abstraction layer over compute devices.
│   └── callback.py # - Base class for HuggingFace-like training callbacks
├── zeus_monitor/ # 🔌 GPU power monitor
│   ├── zemo/ # - A header-only library for querying NVML
│   └── main.cpp # - Source code of the power monitor
├── docker/ # 🐳 Dockerfiles and Docker Compose files
├── examples/ # 🛠️ Examples of integrating Zeus
Expand Down
3 changes: 1 addition & 2 deletions capriccio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,5 +30,4 @@ data_path = dict(train="9_train.json", validation="9_val.json")
raw_datasets = datasets.load_dataset("json", data_files=data_path)
```

For a full example, you can use [`examples/ZeusDataLoader/capriccio/train.py`](../examples/ZeusDataLoader/capriccio/train.py) to fine-tune a Huggingface pre-trained language model on a slice of Capriccio.
Parts relevant to using Capriccio are marked with `# CAPRICCIO` in the script.
For a full example, please refer to [`examples/batch_size_optimizer/capriccio/train.py`](../examples/batch_size_optimizer/capriccio/train.py).
12 changes: 0 additions & 12 deletions docs/extend.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,5 @@ You can find examples of policy implementations in [`zeus._legacy.policy.optimiz

## Plugging it into Zeus

There are two ways to run Zeus: trace-driven and end-to-end.

### Trace-driven Zeus

The Zeus simulator ([`Simulator`][zeus._legacy.simulate.Simulator]) accepts one [`BatchSizeOptimizer`][zeus._legacy.policy.BatchSizeOptimizer] and [`PowerLimitOptimizer`][zeus._legacy.policy.PowerLimitOptimizer] in its constructor.
A full-example can be found in [`examples/trace_driven`](https://github.com/ml-energy/zeus/tree/master/examples/trace_driven/).

### End-to-end Zeus

There are two central components in end-to-end Zeus: [`ZeusMaster`][zeus.run.ZeusMaster] and [`ZeusDataLoader`][zeus.run.ZeusDataLoader].
The former takes charge of driving the entire optimization over recurring jobs, and accepts an instance of [`BatchSizeOptimizer`][zeus._legacy.policy.BatchSizeOptimizer] in its constructor.
The latter takes charge of JIT-profiling power in the background, determining the optimal power limit, and setting it.
Hence, the functionality of [`JITPowerLimitOptimizer`][zeus._legacy.policy.optimizer.JITPowerLimitOptimizer] is already tightly integrated into `ZeusDataLoader`.
Users will have to implement their own [`ZeusDataLoader`][zeus.run.ZeusDataLoader] in order to test another [`PowerLimitOptimizer`][zeus._legacy.policy.PowerLimitOptimizer] policy.
13 changes: 7 additions & 6 deletions docs/getting_started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,11 +127,12 @@ We created [Perseus](../perseus/index.md), which can optimize the energy consump

## Recurring jobs

The cost-optimal batch size is located *across* multiple job runs using a Multi-Armed Bandit algorithm.
First, go through the steps for non-recurring jobs.
[`ZeusDataLoader`][zeus.run.ZeusDataLoader] will transparently optimize the GPU power limit for any given batch size.
Then, you can use [`ZeusMaster`][zeus.run.ZeusMaster] to drive recurring jobs and batch size optimization.
In production, it's likely that a DNN is trained and re-trained repetitively to keep it up to date.
For these kinds of recurring jobs, we can take those recurrences as exploration opportunities to find the cost-optimal training batch size.
This is done with a Multi-Armed Bandit algorithm.
See [`BatchSizeOptimizer`][zeus.optimizer.batch_size.client.BatchSizeOptimizer].

This example will come in handy:
Two full examples are given for the batch size optimizer:

- [Running trace-driven simulation on single recurring jobs and the Alibaba GPU cluster trace](https://github.com/ml-energy/zeus/tree/master/examples/trace_driven){.external}
- [MNIST](https://github.com/ml-energy/zeus/tree/master/examples/batch_size_optimizer/mnist/): Single-GPU and data parallel training, with integration examples with Kubeflow
- [Sentiment Analysis](https://github.com/ml-energy/zeus/tree/master/examples/batch_size_optimizer/capriccio/): Full training example with HuggingFace transformers using the Capriccio dataset, a sentiment analysis dataset with data drift.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ Total energy (J):
```console
$ python -m zeus.monitor energy
[2023-08-22 22:44:45,106] [ZeusMonitor](energy.py:157) Monitoring GPU [0, 1, 2, 3].
[2023-08-22 22:44:46,210] [zeus.util.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,210] [zeus.utils.framework](framework.py:38) PyTorch with CUDA support is available.
[2023-08-22 22:44:46,760] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' started.
^C[2023-08-22 22:44:50,205] [ZeusMonitor](energy.py:329) Measurement window 'zeus.monitor.energy' ended.
Total energy (J):
Expand Down
4 changes: 2 additions & 2 deletions docs/overview/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,8 @@ Fortunately, DNN training jobs often **recur** in production GPU clusters,[^9] a

This results in two main components in Zeus:

- **JIT energy profiler** ([`ZeusDataLoader`][zeus.run.dataloader.ZeusDataLoader]): Finds the optimal power limit via online profiling.
- **MAB + Thompson Sampling** ([`ZeusMaster`][zeus.run.master.ZeusMaster]): Finds the optimal batch size across recurrences.
- **JIT energy profiler**: Finds the optimal power limit via online profiling.
- **MAB + Thompson Sampling**: Finds the optimal batch size across recurrences.


<!-- Abbreviation definitions -->
Expand Down
9 changes: 0 additions & 9 deletions docs/requirements.txt

This file was deleted.

4 changes: 0 additions & 4 deletions examples/ZeusDataLoader/README.md

This file was deleted.

111 changes: 0 additions & 111 deletions examples/ZeusDataLoader/capriccio/README.md

This file was deleted.

Loading

0 comments on commit 4002076

Please sign in to comment.