This is an AIchor demo project, please fell free for fork it if you intend on trying it out.
This project aims to get up to speed with AIchor by going through the whole process.
You can find multiple manifests samples in the manifests
directories. If you want to try hugging face accelerate for example, all you need to do is to copy it:
$ cp hugging-face-accelerate/manifests/single_worker/manifest.1-wrkr-1-a100-80gb.yaml manifest.yaml
# also works with
# cp smoke-test/manifests/manifest.kuberay.sample.yaml manifest.yaml
# cp smoke-test/manifests/manifest.pytorch.sample.yaml manifest.yaml
# cp parallel-jobs-demo/manifests/manifest.yaml manifest.yaml
$ git add manifest.yaml
$ git commit -m "exp: eriment" # commit has to start by "exp: " to trigger experiment
$ git push
This project works accross all AIchor operators. It runs a vanilla experiment:
- print chosen operator environment variables
- creates a tensorboard log with the commit message
- sleeps for x seconds
Use hugging face accelerate to setup the distribution with pytorch operator.
Demo project using jax distributed with processes spread accross multiple containers.
Run multiple jobs in parallel in a single AIchor experiment. Each job being a container. Using TF operator.
Demo project using pytorch distributed with processes spread accross multiple containers.
Demo project using ray[tune], distributed accross multiple containers thanks to kuberay.
Demo project using xgboost distributed with processes spread accross multiple containers.