Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Provide more details about the experiments? #1

Open
ddayzzz opened this issue May 1, 2020 · 15 comments
Open

Provide more details about the experiments? #1

ddayzzz opened this issue May 1, 2020 · 15 comments

Comments

@ddayzzz
Copy link

ddayzzz commented May 1, 2020

Good job! I am very interested in this work and I tried run the experiments mentioned in the paper. My questions are:

  • How to run the experiments(CIFAR10, MNIST, Shakespeare), it seems that only CIFAR10 experiment available now.
  • How the FedMA work? The term retrain and rematching confused me.

Thank you.

@hwang595
Copy link
Contributor

hwang595 commented May 5, 2020

Hi @ddayzzz, glad to hear your interest!

To answer your questions:

  1. Actually all CIFAR-10, MNIST, and Shakespeare experiments are available. Sorry for not clarifying this in a more detailed way. To run CIFAR-10 and MNIST experiments, you can refer to FedMA/run.sh. Set --model=moderate-cnn --dataset=cifar10 and --model=lenet --dataset=mnist will give you CIFAR-10 and MNIST experiments respectively. We didn't try the iterative FedMA experiments on MNIST though since PFNM already gives good performance. To run Shakespeare experiments, you can refer to FedMA/language_modeling/run_language_main.sh and FedMA/language_modeling/run_fedma_with_comm.sh. Running run_language_main.sh will complete the very first local training process and FedMA. It then saves three intermediate results i.e. lstm_matching_assignments, lstm_matching_shapes, and matched_global_weights. With those, you can run run_fedma_with_comm.sh, which tries to load those intermediate results as starting point and start the iterative FedMA process. The reason for organizing the scripts in this way is to avoid training the local models and matching them repeatedly, which also relates to your second question.
  2. To run FedMA, you can just set --comm_type=fedma. The term retrain exists since in the very beginning of federated learning, participating users will need to train their local models first (so it needs to be set as --retrain=True for the first time you run the experiment). In our simulated environment, after local training, all trained local models will be saved. Thus after the first time you run the experiment, you can choose to set --retrain= (leaving blank means False here) to avoid local training repeatedly.

Please feel free to report any issues and to create PRs. We will be happy to help and work with you.

@joaolcaas
Copy link

Hi @hwang595. I'm looking into your repo/paper and I saw that you guys did report mnist dataset using lenet on the paper that was published at ICLR. I'm trying to reproduce by change some lines at the code but some tricky problems are happening. How hard is it to reproduce the same result showed at matched averaging?

@hwang595
Copy link
Contributor

Hi @joaolcaas, thanks a lot for your interests in our work! Can you please provide more details on the issue you encountered for the MNIST experiments? Based on our tests, PFNM already provides a good accuracy on MNIST in both homogeneous and heterogeneous settings. Thus, this repo focuses more on the CIFAR-10 and Shakespeare experiments.

But I'm happy to help with resolving the issues in the MNIST experiment. Please also feel free to take a look at the PFNM github repo.

Thanks!

@joaolcaas
Copy link

Thanks, @hwang595. As you requested, more details below.

First I tried to run the experiment using this command to check how the algorithm behaves:
python main.py --model=lenet --dataset=mnist --lr=0.01 --retrain_lr=0.01 --batch-size=64 --epochs=10 --retrain_epochs=10 --n_nets=3 --partition=hetero-dir --comm_type=fedma --comm_round=10 --retrain=True --rematching=True

The output was the following:
image

I saw that block_patching does not support lenet architecture, so I add SimpleLenetContainerConvBlocks, a half-model with just convolutional layers and their operations from your lenet implementation. Then I tried to run again and got a different error. At BB_MAP loop, when layer_index reaches 3, the following error occurs:
image

That was as far as I could get.

@wildphoton
Copy link

Thanks, @hwang595. As you requested, more details below.

First I tried to run the experiment using this command to check how the algorithm behaves:
python main.py --model=lenet --dataset=mnist --lr=0.01 --retrain_lr=0.01 --batch-size=64 --epochs=10 --retrain_epochs=10 --n_nets=3 --partition=hetero-dir --comm_type=fedma --comm_round=10 --retrain=True --rematching=True

The output was the following:
image

I saw that block_patching does not support lenet architecture, so I add SimpleLenetContainerConvBlocks, a half-model with just convolutional layers and their operations from your lenet implementation. Then I tried to run again and got a different error. At BB_MAP loop, when layer_index reaches 3, the following error occurs:
image

That was as far as I could get.

@hwang595 Hi Hongyi, I met the same issue. I think in the released code, the shape estimator is not defined for LeNet.

@hwang595
Copy link
Contributor

Hi @joaolcaas @wildphoton, thanks for providing the detailed error messages. I can replicate your issues. As I mentioned since this repository focuses more on the CIFAR-10 and Shakespeare experiments, when I made the first commit, I didn't realize the MNIST+LeNet part of code is not up-to-date. Sorry about that!

I made the fixes, it should run without problem now. But please keep in mind that we still don't support the multi-round version of MNIST+LeNet since one round of FedMA already gives >97% accuracy and matches the accuracy we can expect by the ensemble method. But I'm happy to make further improvement to support multi-round FedMA if you are interested.

I'm happy to help more on your experiments and provide more detailed information!

@joaolcaas
Copy link

joaolcaas commented Jul 15, 2020

That was quicker than I expected, huge thanks @hwang595.

I'm testing for now, but you are saying that if I use comm_round greater than 1 the experiment will not work?

@hwang595
Copy link
Contributor

Hi @joaolcaas, use comm_round greater than 1 for any experiment other than MNIST+LeNet will work. For LeNet, we will need to adjust the code a little bit to make it work since we didn't conduct that experiment previously. But please let me know if you're interested in running multi-round FedMA on LeNet, I will make it work.

@joaolcaas
Copy link

@hwang595 yeah, it would be really amazing if you can do that!

@jefersonf
Copy link

Hi @hwang595, I understand that the experiments focus on well-known models and small datasets. What if we try large datasets and models with different architectures? I failed to make modifications like changing the model to, e.g., DenseNet and MobileNet, and adding another dataset with input shape greater than 32x32.

I've been trying the following scenario: Use ModerateCNNContainer and a dataset with input shape reduced to 224x224 and config. as it follows.

...
--model=moderate-cnn \ 
--dataset=<large-dataset>
--epochs=10 \
--retrain_epochs=5 \
--n_nets=3 \
--partition=hetero-dir \
--comm_type=fedma \
--comm_round=10 

Obs.: This dataset has only 10 examples per class in a total of 5 classes. This setting is used just to test the training pipeline.

But the training process takes too long and also it stucks in that part bellow.

image

Was that expected? Maybe I'm doing something wrong..

I've been concerned about the relationship of input dimensions to the general complexity that the Matching Average adds during communications. So, my questions are:

  1. Is there such a relationship? Training input size and FedMA communication time? If that's true, what can we do about it?
  2. By adding a different model, in which part of the code should I take care? Besides changing, for example, the input dimensions to 1x224x224? (I confess I get lost sometimes when it comes to data transformations/reshapes and everything else.)

@wildphoton
Copy link

Hi @joaolcaas @wildphoton, thanks for providing the detailed error messages. I can replicate your issues. As I mentioned since this repository focuses more on the CIFAR-10 and Shakespeare experiments, when I made the first commit, I didn't realize the MNIST+LeNet part of code is not up-to-date. Sorry about that!

I made the fixes, it should run without problem now. But please keep in mind that we still don't support the multi-round version of MNIST+LeNet since one round of FedMA already gives >97% accuracy and matches the accuracy we can expect by the ensemble method. But I'm happy to make further improvement to support multi-round FedMA if you are interested.

I'm happy to help more on your experiments and provide more detailed information!

@hwang595 Thanks for your quick response! I tried the updated code, but it gave me the following error by running

python main.py --model=lenet --dataset=mnist --lr=0.01 --retrain_lr=0.01 --batch-size=64 --epochs=2 --retrain_epochs=2 --n_nets=2 --partition=homo --comm_type=fedma --comm_round=1 --retrain=True

image

@joaolcaas
Copy link

Hi @joaolcaas @wildphoton, thanks for providing the detailed error messages. I can replicate your issues. As I mentioned since this repository focuses more on the CIFAR-10 and Shakespeare experiments, when I made the first commit, I didn't realize the MNIST+LeNet part of code is not up-to-date. Sorry about that!
I made the fixes, it should run without problem now. But please keep in mind that we still don't support the multi-round version of MNIST+LeNet since one round of FedMA already gives >97% accuracy and matches the accuracy we can expect by the ensemble method. But I'm happy to make further improvement to support multi-round FedMA if you are interested.
I'm happy to help more on your experiments and provide more detailed information!

@hwang595 Thanks for your quick response! I tried the updated code, but it gave me the following error by running

python main.py --model=lenet --dataset=mnist --lr=0.01 --retrain_lr=0.01 --batch-size=64 --epochs=2 --retrain_epochs=2 --n_nets=2 --partition=homo --comm_type=fedma --comm_round=1 --retrain=True

image

yep, the same error here, unfortunately

@hwang595
Copy link
Contributor

Hi @wildphoton @joaolcaas, thanks a lot for trying it out! Yes, that error is as expected since you already entered the fedma_comm function, which runs the multi-round FedMA. I will update the code base to make LeNet+MNIST compatible with multi-round FedMA soon. Please stay tuned!

But even before fedma_comm you should already finished one-round FedMA right e.g. the code finishes merging the 2 locally trained model?

@joaolcaas
Copy link

@hwang595 oh, I got it. Basically fedma with LeNet + MNIST will work only until this line, right? We have to pre-train the models and then make 1 comm_round of fedma without enter in fedma_comm.

I think that happened here, broke onlye inside fedma_comm

@wildphoton
Copy link

Hi @wildphoton @joaolcaas, thanks a lot for trying it out! Yes, that error is as expected since you already entered the fedma_comm function, which runs the multi-round FedMA. I will update the code base to make LeNet+MNIST compatible with multi-round FedMA soon. Please stay tuned!

But even before fedma_comm you should already finished one-round FedMA right e.g. the code finishes merging the 2 locally trained model?

So --comm_round=0 actually means round 1? If I understand it correctly, PFNM is basically one round FedMA?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants