Skip to content

Commit

Permalink
Update README - created a Pathways section.
Browse files Browse the repository at this point in the history
Removed a Pathways arg.
  • Loading branch information
RoshaniN committed Mar 27, 2024
1 parent 24f8d5b commit 86bdd91
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 40 deletions.
82 changes: 43 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,15 +125,6 @@ all zones.
--num-slices=4 --spot
```

* Cluster Create for Pathways:
Pathways compatible cluster can be created using `--enable-pathways`
```shell
python3 xpk.py cluster create \
--cluster xpk-pw-test \
--num-slices=4 --on-demand \
--tpu-type=v5litepod-16 \
--enable-pathways
```

* Cluster Create can be called again with the same `--cluster name` to modify
the number of slices or retry failed steps.
Expand Down Expand Up @@ -211,36 +202,6 @@ all zones.
--tpu-type=v5litepod-16
```

* Workload Create for Pathways:
Pathways workload can be submitted using `--use-pathways` on a Pathways enabled cluster (created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-pw-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--use-pathways \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='bash /usr/pathways/ifrt/maxtext_entrypoint.sh base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```

Regular workload can also be submitted on a Pathways enabled cluster (created with `--enable-pathways`)

Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-regular-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```

### Set `max-restarts` for production jobs

* `--max-restarts <value>`: By default, this is 0. This will restart the job ""
Expand Down Expand Up @@ -354,6 +315,49 @@ checkpointing so the job restarts near where it was interrupted.
python3 xpk.py workload list \
--cluster xpk-test --filter-by-job=$USER
```
## Pathways on XPK
* Cluster Create for Pathways:
Pathways compatible cluster can be created using `--enable-pathways`
```shell
python3 xpk.py cluster create \
--cluster xpk-pw-test \
--num-slices=4 --on-demand \
--tpu-type=v5litepod-16 \
--enable-pathways
```
* Workload Create for Pathways:
Pathways workload can be submitted using `--use-pathways` on a Pathways enabled cluster (created with `--enable-pathways`)
Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-pw-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--cluster xpk-pw-test \
--use-pathways \
--server-image=<Pathways server image> \
--proxy-server-image=<Pathways proxy server image> \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='bash /usr/pathways/ifrt/maxtext_entrypoint.sh base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```
Regular workload can also be submitted (by omitting `--use-pathways`) on a Pathways enabled cluster (i.e a cluster created with `--enable-pathways`)
Pathways workload example:
```shell
python3 xpk.py workload create \
--workload xpk-regular-test \
--num-slices=1 \
--tpu-type=v5litepod-16 \
--cluster xpk-pw-test \
--docker-name='user-workload' \
--docker-image=<maxtext docker image> \
--command='python3 MaxText/train.py MaxText/configs/base.yml base_output_directory=<output directory> dataset_path=<dataset path> per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1'
```
## Inspector
* Inspector provides debug info to understand cluster health, and why workloads are not running.
Expand Down
1 change: 0 additions & 1 deletion xpk.py
Original file line number Diff line number Diff line change
Expand Up @@ -3410,7 +3410,6 @@ def get_pathways_proxy_args(args) -> str:
- --pathways_ifrt_proxy_server_resource_manager={args.workload}-rm-0-0.{args.workload}:38677
- --pathways_ifrt_proxy_server_port=38676
- --pathways_tmp_dir_pattern={args.pathways_gcs_location}
- --pathways_xprof_trace_enable_bulk_upload=true
- --pathways_plaque_network=gcp"""
if args.use_pathways:
return yaml.format(args=args)
Expand Down

0 comments on commit 86bdd91

Please sign in to comment.