Skip to content

Commit

Permalink
Fix TF examples
Browse files Browse the repository at this point in the history
  • Loading branch information
YuanTingHsieh committed Oct 14, 2024
1 parent fa135de commit 3573a05
Show file tree
Hide file tree
Showing 6 changed files with 25 additions and 80 deletions.
20 changes: 6 additions & 14 deletions examples/advanced/job_api/tf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,8 @@ All examples in this folder are based on using [TensorFlow](https://tensorflow.o

## Simulated Federated Learning with CIFAR10 Using Tensorflow

This example shows `Tensorflow`-based classic Federated Learning
algorithms, namely FedAvg and FedOpt on CIFAR10
dataset. This example is analogous to [the example using `Pytorch`
This example demonstrates TensorFlow-based federated learning algorithms on the CIFAR-10 dataset.
This example is analogous to [the example using `Pytorch`
backend](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/cifar10/cifar10-sim)
on the same dataset, where same experiments
were conducted and analyzed. You should expect the same
Expand All @@ -21,7 +20,7 @@ client-side training logics (details in file
and the new
[`FedJob`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/job_config/api.py)
APIs were used to programmatically set up an
`nvflare` job to be exported or ran by simulator (details in file
NVFlare job to be exported or ran by simulator (details in file
[`tf_fl_script_runner_cifar10.py`](tf_fl_script_runner_cifar10.py)),
alleviating the need of writing job config files, simplifying
development process.
Expand Down Expand Up @@ -50,10 +49,7 @@ described below at once:
bash ./run_jobs.sh
```
The CIFAR10 dataset will be downloaded when running any experiment for
the first time. `Tensorboard` summary logs will be generated during
any experiment, and you can use `Tensorboard` to visualize the
training and validation process as the experiment runs. Data split
files, summary logs and results will be saved in a workspace
the first time. Data split files, summary logs and results will be saved in a workspace
directory, which defaults to `/tmp` and can be configured by setting
`--workspace` argument of the `tf_fl_script_runner_cifar10.py`
script.
Expand All @@ -65,12 +61,8 @@ script.
> `export TF_FORCE_GPU_ALLOW_GROWTH=true && export
> TF_GPU_ALLOCATOR=cuda_malloc_asyncp`
The set-up of all experiments in this example are kept the same as
[the example using `Pytorch`
backend](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/cifar10/cifar10-sim). Refer
to the `Pytorch` example for more details. Similar to the Pytorch
example, we here also use Dirichelet sampling on CIFAR10 data labels
to simulate data heterogeneity among data splits for different client
We use Dirichelet sampling (implementation from FedMA (https://github.com/IBM/FedMA)) on
CIFAR10 data labels to simulate data heterogeneity among data splits for different client
sites, controlled by an alpha value, ranging from 0 (not including 0)
to 1. A high alpha value indicates less data heterogeneity, i.e., an
alpha value equal to 1.0 would result in homogeneous data distribution
Expand Down
12 changes: 6 additions & 6 deletions examples/advanced/job_api/tf/run_jobs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ GPU_INDX=0
WORKSPACE=/tmp

# Run centralized training job
python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo centralized \
--n_clients 1 \
--num_rounds 25 \
Expand All @@ -39,7 +39,7 @@ python ./tf_fl_script_executor_cifar10.py \
# Run FedAvg with different alpha values
for alpha in 1.0 0.5 0.3 0.1; do

python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo fedavg \
--n_clients 8 \
--num_rounds 50 \
Expand All @@ -53,7 +53,7 @@ done


# Run FedOpt job
python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo fedopt \
--n_clients 8 \
--num_rounds 50 \
Expand All @@ -65,7 +65,7 @@ python ./tf_fl_script_executor_cifar10.py \


# Run FedProx job.
python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo fedprox \
--n_clients 8 \
--num_rounds 50 \
Expand All @@ -77,11 +77,11 @@ python ./tf_fl_script_executor_cifar10.py \


# Run scaffold job
python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo scaffold \
--n_clients 8 \
--num_rounds 50 \
--batch_size 64 \
--epochs 4 \
--alpha 0.1 \
--gpu $GPU_INDX
--gpu $GPU_INDX
34 changes: 11 additions & 23 deletions examples/getting_started/tf/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,21 @@
# Getting Started with NVFlare (TensorFlow)
[![TensorFlow Logo](https://upload.wikimedia.org/wikipedia/commons/a/ab/TensorFlow_logo.svg)](https://tensorflow.org/)

We provide several examples to quickly get you started using NVFlare's Job API.
We provide several examples to help you quickly get started with NVFlare.
All examples in this folder are based on using [TensorFlow](https://tensorflow.org/) as the model training framework.

## Simulated Federated Learning with CIFAR10 Using Tensorflow

This example shows `Tensorflow`-based classic Federated Learning
algorithms, namely FedAvg and FedOpt on CIFAR10
dataset. This example is analogous to [the example using `Pytorch`
backend](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/cifar10/cifar10-sim)
on the same dataset, where same experiments
were conducted and analyzed. You should expect the same
experimental results when comparing this example with the `Pytorch` one.
This example demonstrates TensorFlow-based federated learning algorithms,
FedAvg and FedOpt, on the CIFAR-10 dataset.

In this example, the latest Client APIs were used to implement
client-side training logics (details in file
[`cifar10_tf_fl_alpha_split.py`](src/cifar10_tf_fl_alpha_split.py)),
and the new
[`FedJob`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/job_config/api.py)
APIs were used to programmatically set up an
`nvflare` job to be exported or ran by simulator (details in file
NVFlare job to be exported or ran by simulator (details in file
[`tf_fl_script_runner_cifar10.py`](tf_fl_script_runner_cifar10.py)),
alleviating the need of writing job config files, simplifying
development process.
Expand Down Expand Up @@ -49,10 +44,7 @@ described below at once:
bash ./run_jobs.sh
```
The CIFAR10 dataset will be downloaded when running any experiment for
the first time. `Tensorboard` summary logs will be generated during
any experiment, and you can use `Tensorboard` to visualize the
training and validation process as the experiment runs. Data split
files, summary logs and results will be saved in a workspace
the first time. Data split files, summary logs and results will be saved in a workspace
directory, which defaults to `/tmp` and can be configured by setting
`--workspace` argument of the `tf_fl_script_runner_cifar10.py`
script.
Expand All @@ -64,12 +56,8 @@ script.
> `export TF_FORCE_GPU_ALLOW_GROWTH=true && export
> TF_GPU_ALLOCATOR=cuda_malloc_asyncp`
The set-up of all experiments in this example are kept the same as
[the example using `Pytorch`
backend](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/cifar10/cifar10-sim). Refer
to the `Pytorch` example for more details. Similar to the Pytorch
example, we here also use Dirichelet sampling on CIFAR10 data labels
to simulate data heterogeneity among data splits for different client
We use Dirichelet sampling (implementation from FedMA (https://github.com/IBM/FedMA)) on
CIFAR10 data labels to simulate data heterogeneity among data splits for different client
sites, controlled by an alpha value, ranging from 0 (not including 0)
to 1. A high alpha value indicates less data heterogeneity, i.e., an
alpha value equal to 1.0 would result in homogeneous data distribution
Expand Down Expand Up @@ -111,11 +99,11 @@ for alpha in 1.0 0.5 0.3 0.1; do
done
```

## 2. Results
## 3. Results

Now let's compare experimental results.

### 2.1 Centralized training vs. FedAvg for homogeneous split
### 3.1 Centralized training vs. FedAvg for homogeneous split
Let's first compare FedAvg with homogeneous data split
(i.e. `alpha=1.0`) and centralized training. As can be seen from the
figure and table below, FedAvg can achieve similar performance to
Expand All @@ -129,7 +117,7 @@ no difference in data distributions among different clients.

![Central vs. FedAvg](./figs/fedavg-vs-centralized.png)

### 2.2 Impact of client data heterogeneity
### 3.2 Impact of client data heterogeneity

Here we compare the impact of data heterogeneity by varying the
`alpha` value, where lower values cause higher heterogeneity. As can
Expand All @@ -145,7 +133,7 @@ as data heterogeneity becomes higher.

![Impact of client data
heterogeneity](./figs/fedavg-diff-alphas.png)

> [!NOTE]
> More examples can be found at https://nvidia.github.io/NVFlare.
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@
"The `FedJob` is used to define how controllers and executors are placed within a federated job using the `to(object, target)` routine.\n",
"\n",
"Here we use a TensorFlow `BaseFedJob`, where we can define the job name and the initial global model.\n",
"The `BaseFedJob` automatically configures components for model persistence, model selection, and TensorBoard streaming for convenience."
"The `BaseFedJob` automatically configures components for model persistence and model selection for convenience."
]
},
{
Expand Down
35 changes: 0 additions & 35 deletions examples/getting_started/tf/run_jobs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,38 +50,3 @@ for alpha in 1.0 0.5 0.3 0.1; do
--workspace $WORKSPACE

done


# Run FedOpt job
python ./tf_fl_script_runner_cifar10.py \
--algo fedopt \
--n_clients 8 \
--num_rounds 50 \
--batch_size 64 \
--epochs 4 \
--alpha 0.1 \
--gpu $GPU_INDX \
--workspace $WORKSPACE


# Run FedProx job.
python ./tf_fl_script_runner_cifar10.py \
--algo fedprox \
--n_clients 8 \
--num_rounds 50 \
--batch_size 64 \
--epochs 4 \
--fedprox_mu 1e-5 \
--alpha 0.1 \
--gpu $GPU_INDX


# Run scaffold job
python ./tf_fl_script_runner_cifar10.py \
--algo scaffold \
--n_clients 8 \
--num_rounds 50 \
--batch_size 64 \
--epochs 4 \
--alpha 0.1 \
--gpu $GPU_INDX
2 changes: 1 addition & 1 deletion examples/hello-world/hello-tf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ In scenarios where multiple clients are involved, you have to prevent TensorFlow
by setting the following flags.

```bash
TF_FORCE_GPU_ALLOW_GROWTH=true TF_GPU_ALLOCATOR=cuda_malloc_async
TF_FORCE_GPU_ALLOW_GROWTH=true TF_GPU_ALLOCATOR=cuda_malloc_async python3 fedavg_script_runner_tf.py
```

If you possess more GPUs than clients, a good strategy is to run one client on each GPU.
Expand Down

0 comments on commit 3573a05

Please sign in to comment.