Add simulator commands to readme

ZiyueXu77 · Sep 21, 2023 · 6af3c75 · 6af3c75
1 parent 35da789
commit 6af3c75
Showing 1 changed file with 41 additions and 19 deletions.
diff --git a/integration/nemo/examples/supervised_fine_tuning/README.md b/integration/nemo/examples/supervised_fine_tuning/README.md
@@ -71,10 +71,47 @@ python utils/combine_jsonl.py --file_list Data/Processed/alpaca/testing.jsonl Da
 ```
 
 ## Federated learning simulations
-Next, we are using NVFlare's [POC mode](https://nvflare.readthedocs.io/en/main/getting_started.html#setting-up-poc) to simulate 
-each client training on their own dataset locally, centralized training with all three dataset combined, and all three clients training together using the 
+We can either use NVFlare's [FL Simulator](https://nvflare.readthedocs.io/en/main/getting_started.html#the-fl-simulator) or [POC mode](https://nvflare.readthedocs.io/en/main/getting_started.html#setting-up-poc) to simulate federated learning experiments.
+
+First, we create the configuration files and modify them to include the current directory path to access the dataset and pre-trained LLM.
+At this point, we also modify the data path and local number of clients.
+
+We perform 5 experiments in total: training on each client's own dataset, on combined dataset, and all three clients training together using the 
 [FedAvg](https://arxiv.org/abs/1602.05629) algorithm implemented in NVFlare.
 
+### Job configurations 
+For single-site trainings, in a standard terminal, run
+```
+python utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_alpaca" --num_clients 1 --devices 1 --train_ds_files /workspace/Data/Processed/alpaca/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl
+python utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_dolly" --num_clients 1 --devices 1 --train_ds_files /workspace/Data/Processed/dolly/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl
+python utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_oasst1" --num_clients 1 --devices 1 --train_ds_files /workspace/Data/Processed/oasst1/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl
+python utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_combined" --num_clients 1 --devices 1 --train_ds_files /workspace/Data/Processed/combined/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl
+```
+and for FedAvg:
+```
+python utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_fedavg" --num_clients 3 --devices 1 --train_ds_files /workspace/Data/Processed/alpaca/training.jsonl /workspace/Data/Processed/dolly/training.jsonl /workspace/Data/Processed/oasst1/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl  /workspace/Data/Processed/combined/validation.jsonl  /workspace/Data/Processed/combined/validation.jsonl
+```
+Here, each client performs SFT for one local epoch before sending their local model updates to the server for aggregation. 
+
+Note that we used the combined validation set for all experiments, allowing for a direct comparison across all training settings.
+
+### Use FL Simulator
+We use the NVFlare simulator to run the FL training experiments, using the following commands:
+For local training on each dataset separately and on the combined dataset:
+```
+nvflare simulator jobs/gpt_sft_1.3B_alpaca -w workspace_simulator_alpaca -n 1 -gpu 0
+nvflare simulator jobs/gpt_sft_1.3B_dolly -w workspace_simulator_dolly -n 1 -gpu 0
+nvflare simulator jobs/gpt_sft_1.3B_oasst1 -w workspace_simulator_oasst1 -n 1 -gpu 0
+nvflare simulator jobs/gpt_sft_1.3B_combined -w workspace_simulator_combined -n 1 -gpu 0
+```
+For FedAvg:
+```
+nvflare simulator jobs/gpt_sft_1.3B_fedavg -w workspace_simulator_fedavg -n 3 -gpu 0,0,0
+```
+
+### Use POC mode
+Alternatively, we can also NVFlare's [POC mode](https://nvflare.readthedocs.io/en/main/getting_started.html#setting-up-poc) to simulate 
+
 #### 1. Local and Centralized SFT
 For single-site and centralized training experiments, we create the poc workspaces:
 ```
@@ -89,17 +126,6 @@ For better usability, open a new terminal and start the [admin command prompt](h
 nvflare poc start -p admin@nvidia.com
 ```
 
-We create the configuration files and modify them to include the current directory path to access the dataset and pre-trained LLM.
-At this point, we also modify the data path and local number of clients.
-
-In a standard terminal, run
-```
-python utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_alpaca" --num_clients 1 --devices 1 --train_ds_files /workspace/Data/Processed/alpaca/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl
-python utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_dolly" --num_clients 1 --devices 1 --train_ds_files /workspace/Data/Processed/dolly/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl
-python utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_oasst1" --num_clients 1 --devices 1 --train_ds_files /workspace/Data/Processed/oasst1/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl
-python utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_combined" --num_clients 1 --devices 1 --train_ds_files /workspace/Data/Processed/combined/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl
-```
-Note that we used the combined validation set for all experiments, allowing for a direct comparison across all training settings.
 
 Next, copy the jobs to temp workspace.
 ```
@@ -130,16 +156,12 @@ For better usability, open a new terminal and start the [admin command prompt](h
 nvflare poc start -p admin@nvidia.com
 ```
 
-First, create and modify the configuration files again. 
-Here, each client performs SFT for one local epoch before sending their local model updates to the server for aggregation. 
-```
-python3 utils/create_configs.py --job_folder "jobs/gpt_sft_1.3B_fedavg" --num_clients 3 --devices 1 --train_ds_files /workspace/Data/Processed/alpaca/training.jsonl /workspace/Data/Processed/dolly/training.jsonl /workspace/Data/Processed/oasst1/training.jsonl --validation_ds_files /workspace/Data/Processed/combined/validation.jsonl  /workspace/Data/Processed/combined/validation.jsonl  /workspace/Data/Processed/combined/validation.jsonl
-```
+
 Next, simulate the federated SFT using FedAvg, similarly to single-client experiments
 ```
 cp -r jobs/gpt_sft_1.3B_fedavg /tmp/nvflare/poc/example_project/prod_00/admin\@nvidia.com/transfer/
 ```
-and
+and to submit the FedAvg job
 ```
 submit_job gpt_sft_1.3B_fedavg
 ```