RoFL supports the orchestration of servers on AWS out of the box to improve the ability to reproduce experiments. This document describes how to run experiments using Ansible.
cd
intoansible
folder- install pipenv
pip install pipenv
- pipenv set up
pipenv install
in root folder - run
pipenv shell
in root folder
- first run:
ansible-playbook analysis.yml -i inventory/analysis -e "exp=demo run=new"
- continue a run (with run id):
ansible-playbook analysis.yml -i inventory/analysis -e "exp=demo run=1611332286"
-
Start Microbenchmark and check for a fixed amount of time whether benchmark finished, then fetch results:
ansible-playbook microbench.yml -i inventory
-
Only start the microbenchmark (with specific
fp
andfrac
):ansible-playbook microbench.yml -i inventory --tags "start" -e "fp=16 frac=8"
-
When Microbenchmark is Running, Wait until finished and then fetch results:
ansible-playbook microbench.yml -i inventory --tags "result"
-
you can add
--ssh-common-args='-o StrictHostKeyChecking=no'
as argument which means that you don't have to typeyes
when trying to connect to a newly created ec2 instance.
Each experiment can consist of different configurations which are run after each other, defined in the experiments
key in the config file after the base_experiment
.
To start a new experiment, pass run=new
:
ansible-playbook e2ebench.yml -i inventory --ssh-common-args='-o StrictHostKeyChecking=no' -e "exp=mnist_e2e run=new"
This will set up the required machines and the configurations for the experiments. After the first configuration has finished, invoke the same command but this time with the run id of the current experiment:
ansible-playbook e2ebench.yml -i inventory --ssh-common-args='-o StrictHostKeyChecking=no' -e "exp=mnist_e2e run=<RUN_ID>"
This will retrieve the results of the first configuration of the experiment.
To launch the next configuration of the experiment, invoke the same command again with the <RUN_ID>
.
Note: The run id can be found in the experiment_results
directory and is currently the timestamp of when the experiment was started.
-
Most job configuration parameters are in the .yml config files (e.g.
experiments/mnist_basic.yml
) under thejob
key. -
Make sure the number of clients in the FL setup is divisible by the number of client machines. If this is not the case, the client -> machine division algorithm does not work properly. In the future, this should be easy to fix to allow for an unbalanced division.
-
Other configuration parameters such as the machine type and optimization (e.g., skylake) can be found in
group_vars/all/main.yml
.