Skip to content

Commit

Permalink
Separate BSO as a server (#34)
Browse files Browse the repository at this point in the history
Co-authored-by: Jae-Won Chung <jwnchung@umich.edu>
  • Loading branch information
show981111 and jaywonchung authored Apr 28, 2024
1 parent b662a1f commit 611084c
Show file tree
Hide file tree
Showing 73 changed files with 6,462 additions and 161 deletions.
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ zeus.egg-info/
.git/

**/data/
**/versions/*.py
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,7 @@ dist/
*.json
**/.DS_Store
.cache/
.env
env/
.pytest_cache/
/envs
File renamed without changes.
12 changes: 12 additions & 0 deletions docker/bso_migration.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
FROM python:3.9

WORKDIR /workspace

ADD . /workspace

# For sqlite
# RUN pip install --no-cache-dir aiosqlite

# For mysql
RUN pip install --no-cache-dir asyncmy
RUN pip install --no-cache-dir '.[migration]'
14 changes: 14 additions & 0 deletions docker/bso_server.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM python:3.9

WORKDIR /workspace

ADD . /workspace

# For sqlite
# RUN pip install --no-cache-dir aiosqlite

# For mysql
RUN pip install --no-cache-dir asyncmy
RUN pip install --no-cache-dir '.[bso-server]'

CMD ["uvicorn", "zeus.optimizer.batch_size.server.router:app", "--host", "0.0.0.0", "--port", "80"]
84 changes: 84 additions & 0 deletions docker/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
version: '3.9'
name: zeus_bso_server

services:
server:
image: bso-server
build:
context: ../
dockerfile: ./docker/bso_server.Dockerfile
container_name: bso
restart: always
environment:
ZEUS_BSO_DATABASE_URL: ${ZEUS_BSO_DATABASE_URL-mysql+asyncmy://${ZEUS_BSO_DB_USER}:${ZEUS_BSO_DB_PASSWORD}@db:3306/Zeus}
ZEUS_BSO_LOG_LEVEL: ${ZEUS_BSO_LOG_LEVEL}
ZEUS_BSO_ECHO_SQL: ${ZEUS_BSO_ECHO_SQL}
ports:
# Map 80 to the container
- "80:80"
networks:
- servernet
depends_on:
migration:
# start running when migration is done.
condition: service_completed_successfully
labels:
# labels for kubernetes
kompose.service.type: nodeport
# Pull image only when there is no image locally. Otherewise use that one.
kompose.image-pull-policy: IfNotPresent
# set the node port. Should be 30000-32767
kompose.service.nodeport.port: ${ZEUS_BSO_SERVER_PORT-30100}
db:
image: mysql
container_name: db
restart: always
environment:
MYSQL_DATABASE: Zeus
MYSQL_USER: ${ZEUS_BSO_DB_USER}
MYSQL_ROOT_PASSWORD: ${ZEUS_BSO_ROOT_PASSWORD}
MYSQL_PASSWORD: ${ZEUS_BSO_DB_PASSWORD}
expose:
# Opens 3306 on the container to server & migration
- 3306
volumes:
- ./mysql_data:/var/lib/mysql
networks:
- servernet
healthcheck:
test: ["CMD", "mysqladmin" ,"ping", "-h", "localhost"]
timeout: 3s
retries: 10
start_period: 2s
start_interval: 1s

migration:
image: bso-migration
build:
context: ../
dockerfile: ./docker/bso_migration.Dockerfile
deploy:
restart_policy:
condition: on-failure
max_attempts: 3
depends_on:
db:
# wait until db is ready to accept connection
condition: service_healthy
# Generate revision and upgrade database. Change message of revision as you want
command: >
bash -c 'cd /workspace/zeus/optimizer/batch_size && alembic revision --autogenerate -m "Baseline: create tables" && alembic upgrade head'
environment:
ZEUS_BSO_DATABASE_URL: ${ZEUS_BSO_DATABASE_URL-mysql+asyncmy://${ZEUS_BSO_DB_USER}:${ZEUS_BSO_DB_PASSWORD}@db:3306/Zeus}
networks:
- servernet
volumes:
# mount version scripts we generated.
- ./zeus/optimizer/batch_size/migrations/versions:/workspace/zeus/optimizer/batch_size/migrations/versions
labels:
kompose.image-pull-policy: IfNotPresent


networks:
servernet:
driver: bridge
200 changes: 200 additions & 0 deletions docs/batch_size_optimizer/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# Batch Size Optimizer in Zeus

## What is it?

Batch size optimizer(BSO) can choose the best batch size that minimizes the cost, where cost is defined as $cost = \eta \times \text{Energy consumption to accuracy} + (1-\eta) \times \text{Max power}\times \text{Time to accuracy}$.

## How does it work?

The core of BSO is a multi-arm-bandit based on **recurrent** training. After each training, we feed the result cost to MAB and after a certain number of trainings, MAB can converge to the best batch size. In addition to MAB, we employed early-stopping and pruning to handle stragglers. For more details, refer to [paper](https://www.usenix.org/conference/nsdi23/presentation/you).

## Should I use this?

The key of BSO is recurrent training. If you are training your model periodically or repeatedly, BSO can be a great choice to reduce energy or time consumption.

## Limitations

We currently don't support heterogeneous GPUs or different configurations. The number of GPUs, GPU models, and other configurations in JobSpec should be identical in recurrent training. If you are running your training in a various environment each time, then it might not be desirable to use BSO.

## Sequence diagram of BSO

```mermaid
sequenceDiagram;
participant BSO server
participant BSO client
loop Every recurrent training
BSO client->>BSO server: Register the training job and ask for the batch size
BSO server->>BSO client: Return the next batch size to use with a trial number
loop Every epoch
BSO client->>BSO server: At the end of each epoch, report the result
BSO server->>BSO client: Compute the cost and tell the client if it should stop the training
end
BSO client->>BSO server: Report the end of the trial on exit
end
```

## Quick start (Server)

1. Clone the repository

```Shell
git clone https://github.com/ml-energy/zeus/tree/master
```

2. Create `.env` under `/docker`. An example of `.env` is provided below.

By default, we are using the MySQL for the database.

```Shell
ZEUS_BSO_DB_USER="me"
ZEUS_BSO_DB_PASSWORD="secret"
ZEUS_BSO_ROOT_PASSWORD="secret*"
ZEUS_BSO_SERVER_PORT=8000
ZEUS_BSO_LOG_LEVEL="INFO"
ZEUS_BSO_ECHO_SQL="True"
```

If you want to use different databases, you need to add `ZEUS_BSO_DATABASE_URL` as an environment variable. See [Remark](#remark-about-server) for detail.
Also, if you are running using docker-compose or Kubernetes, you need to change the image name under `db` in the docker-compose file.

3. Running a server

- Using docker-compose

```Shell
cd docker
docker-compose -f ./docker/docker-compose.yaml up
```

This will build images for each container: db, migration, and the server. Then, it will spin those containers.

- Using Kubernetes.

1. Build an image.

```Shell
# From the root directory
docker build -f ./docker/bso_server.Dockerfile -t bso-server .
docker build -f ./docker/bso_migration.Dockerfile -t bso-migration .
```

2. Create Kubernetes yaml files using Kompose. Kompose is a tool that converts docker-compose files into Kubernetes files. For more information, visit [Kompose Reference](#kompose-references)

```Shell
cd docker
docker-compose config > docker-compose-resolved.yaml && kompose convert -f docker-compose-resolved.yaml -o ./kube/ && rm docker-compose-resolved.yaml
```

It first resolves env files using docker-compose, then creates Kubernetes yaml files under `docker/kube/`

3. Run kubernetes.

```Shell
cd kube
kubectl apply -f .
```

- Using uvicorn.

If you are using the uvicorn to spin the server, you need to create a database and perform migration before starting the server.

1. Run the database of your choice.
2. Set the environment variables in `.env`

```Shell
ZEUS_BSO_DATABASE_URL="me"
ZEUS_BSO_LOG_LEVEL="INFO"
ZEUS_BSO_ECHO_SQL="True"
```

3. Run Alembic migration

1. Install dependencies

```Bash
pip install '.[migration]'
```

2. Create the migration script. This will create scripts under ./versions

```Bash
alembic revision --autogenerate -m "Baseline: create tables"
```

3. Apply migration
1. Online (apply it to database directly)

```Bash
alembic upgrade head
```

2. Offline (generate sql)

```Bash
alembic upgrade head --sql
```

4. Run the server using uvicorn.

```Shell
cd zeus/optimizer/batch_size/server
uvicorn router:app --reload
```

Now the server is good to go!

### Remark about the server

Zeus Batch Size Optimizer server is using Sqlalchemy to support various types of databases. However, you need to download the corresponding async connection driver.
As a default, we are using Mysql. You can add installation code to `bso_migration.Dockerfile` and `bso_server.Dockerfile`. Refer to those files for reference.

## Use BSO in your training script (Client)

1. Install Zeus package.

```Shell
pip install zeus-ml[bso]
```

2. Add [`BatchSizeOptimizer`][zeus.optimizer.batch_size.client.BatchSizeOptimizer] to your training script.

```Python
# Initialization
bso = BatchSizeOptimizer(
monitor=monitor,
server_url="http://127.0.0.1:8000",
job=JobParams(
job_id_prefix="mnist-dev",
default_batch_size=256,
batch_sizes=[32, 64, 256, 512, 1024, 4096, 2048],
max_epochs=100
),
)
# ... other codes
# Get batch size to use from the server
batch_size = bso.get_batch_size()
# ...
# beginning of the train
bso.on_train_begin()
# ...
# After evaluation
bso.on_evaluate(metric)
```

### Remark about the client

Training can fail if

1. It failed to converge within configured max_epochs
2. It exceeded the early stopping threshold which is configured by `beta_knob` in `JobSpec`

In that case, the optimizer will raise `ZeusBSOTrainFailError`. This means that the chosen batch size was not useful, and the BSO server will not give this batch size again. However, the user ***should re-launch the job*** so that the BSO server can give another batch size. The server will learn which batch size is useful and will converge to the batch size that causes the least cost as you launch the job multiple times.

## Kompose references

Refer [Kompose](https://kompose.io/) and [Kompose labels](https://github.com/kubernetes/kompose/blob/main/docs/user-guide.md) for more information.
13 changes: 6 additions & 7 deletions docs/extend.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,26 +9,25 @@ Users can implement custom policies to optimize batch size and power limits, and

## Interfaces

Zeus defines two abstract classes [`BatchSizeOptimizer`][zeus.policy.BatchSizeOptimizer] and [`PowerLimitOptimizer`][zeus.policy.PowerLimitOptimizer] in [`zeus.policy.interface`][zeus.policy.interface].
Zeus defines two abstract classes [`BatchSizeOptimizer`][zeus._legacy.policy.BatchSizeOptimizer] and [`PowerLimitOptimizer`][zeus._legacy.policy.PowerLimitOptimizer] in [`zeus._legacy.policy.interface`][zeus._legacy.policy.interface].
Each class optimizes the batch size and power limit of a recurring training job respectively.
As in our paper, the batch size optimizer is first invoked to decide which batch size to use, and then the power limit optimizer is invoked with both the job and the batch size chosen to decide which power limit to use.

You can find examples of policy implementations in [`zeus.policy.optimizer`][zeus.policy.optimizer].

You can find examples of policy implementations in [`zeus._legacy.policy.optimizer`][zeus._legacy.policy.optimizer].

## Plugging it into Zeus

There are two ways to run Zeus: trace-driven and end-to-end.

### Trace-driven Zeus

The Zeus simulator ([`Simulator`][zeus.simulate.Simulator]) accepts one [`BatchSizeOptimizer`][zeus.policy.BatchSizeOptimizer] and [`PowerLimitOptimizer`][zeus.policy.PowerLimitOptimizer] in its constructor.
The Zeus simulator ([`Simulator`][zeus._legacy.simulate.Simulator]) accepts one [`BatchSizeOptimizer`][zeus._legacy.policy.BatchSizeOptimizer] and [`PowerLimitOptimizer`][zeus._legacy.policy.PowerLimitOptimizer] in its constructor.
A full-example can be found in [`examples/trace_driven`](https://github.com/ml-energy/zeus/tree/master/examples/trace_driven/).

### End-to-end Zeus

There are two central components in end-to-end Zeus: [`ZeusMaster`][zeus.run.ZeusMaster] and [`ZeusDataLoader`][zeus.run.ZeusDataLoader].
The former takes charge of driving the entire optimization over recurring jobs, and accepts an instance of [`BatchSizeOptimizer`][zeus.policy.BatchSizeOptimizer] in its constructor.
The former takes charge of driving the entire optimization over recurring jobs, and accepts an instance of [`BatchSizeOptimizer`][zeus._legacy.policy.BatchSizeOptimizer] in its constructor.
The latter takes charge of JIT-profiling power in the background, determining the optimal power limit, and setting it.
Hence, the functionality of [`JITPowerLimitOptimizer`][zeus.policy.optimizer.JITPowerLimitOptimizer] is already tightly integrated into `ZeusDataLoader`.
Users will have to implement their own [`ZeusDataLoader`][zeus.run.ZeusDataLoader] in order to test another [`PowerLimitOptimizer`][zeus.policy.PowerLimitOptimizer] policy.
Hence, the functionality of [`JITPowerLimitOptimizer`][zeus._legacy.policy.optimizer.JITPowerLimitOptimizer] is already tightly integrated into `ZeusDataLoader`.
Users will have to implement their own [`ZeusDataLoader`][zeus.run.ZeusDataLoader] in order to test another [`PowerLimitOptimizer`][zeus._legacy.policy.PowerLimitOptimizer] policy.
2 changes: 2 additions & 0 deletions docs/gen_ref_pages.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
for path in sorted(Path("zeus").rglob("*.py")):
# Path to the generated markdown file.
doc_path = path.relative_to("zeus").with_suffix(".md")
if str(doc_path).find("batch_size/migration") != -1:
continue
full_doc_path = REF_DIR / doc_path

module_path = path.with_suffix("")
Expand Down
2 changes: 1 addition & 1 deletion examples/ZeusDataLoader/capriccio/run_zeus.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from pathlib import Path

from zeus.job import Job
from zeus.policy import PruningGTSBatchSizeOptimizer
from zeus._legacy.policy import PruningGTSBatchSizeOptimizer
from zeus.run import ZeusMaster
from zeus.util import FileAndConsole

Expand Down
2 changes: 1 addition & 1 deletion examples/ZeusDataLoader/cifar100/run_zeus.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from pathlib import Path

from zeus.job import Job
from zeus.policy import PruningGTSBatchSizeOptimizer
from zeus._legacy.policy import PruningGTSBatchSizeOptimizer
from zeus.run import ZeusMaster
from zeus.util import FileAndConsole

Expand Down
2 changes: 1 addition & 1 deletion examples/ZeusDataLoader/imagenet/run_zeus.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from pathlib import Path

from zeus.job import Job
from zeus.policy import PruningGTSBatchSizeOptimizer
from zeus._legacy.policy import PruningGTSBatchSizeOptimizer
from zeus.run import ZeusMaster
from zeus.util import FileAndConsole

Expand Down
Loading

0 comments on commit 611084c

Please sign in to comment.