The error in new machine when running distribution/fedavg #68

weiyikang · 2020-11-24T14:38:16Z

I think the FedML can use easily on another machine only by cloning the FedML without modifications. While the errors occur as follows:

Configures of computer:

4 * RTX 3090, cuda 11.1
pytorch1.7
According to CI-install.sh requiring the environment

run the command:
sh run_fedavg_distributed_pytorch.sh 4 4 1 4 cnn homo 2 1 32 0.0001 digit5 "./../../../data/Digit5" 0
There are 4 clients and 4 works.

The errors as follows(Fig.1 the warning of program, Fig.2 the 4 clients on GPUs may be wrong, the same process on all GPUs):

chaoyanghe · 2020-11-24T16:54:49Z

are you using the same version of the code? We updated a lot recently.

weiyikang · 2020-11-25T01:05:38Z

I used the previous version, I will try the latest version again.

weiyikang · 2020-11-25T03:02:37Z

I have used the latest version, FedAvg(distribution) on 4 * RTX3090 computer as shown in Fig.1, while the previous version on 8 * RTX2080 as shown in Fig.2.

Why there exist many 0MiB FedAvg(distributed) in the latest version?
What's the effect of 0MiB FedAvg(distributed)?
Why the same process(eg. Process FedAvg(distributed):1 ) in different GPUs, I think the same process should only load on one GPU.
Whether this phenomenon in Fig.1 caused by the software environment, not the latest version's features?

chaoyanghe · 2020-11-25T23:09:24Z

May I know how you set your "init_training_device" function?

weiyikang closed this as completed Nov 26, 2020

fedml-alex pushed a commit that referenced this issue Jun 4, 2022

update cl, add fedml logs, fixed #74, #68, #66

95100a1

chaoyanghe pushed a commit that referenced this issue Jun 6, 2022

update cl, add fedml logs, fixed #74, #68, #66

59246d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The error in new machine when running distribution/fedavg #68

The error in new machine when running distribution/fedavg #68

weiyikang commented Nov 24, 2020 •

edited

Loading

chaoyanghe commented Nov 24, 2020

weiyikang commented Nov 25, 2020

weiyikang commented Nov 25, 2020 •

edited

Loading

chaoyanghe commented Nov 25, 2020

The error in new machine when running distribution/fedavg #68

The error in new machine when running distribution/fedavg #68

Comments

weiyikang commented Nov 24, 2020 • edited Loading

chaoyanghe commented Nov 24, 2020

weiyikang commented Nov 25, 2020

weiyikang commented Nov 25, 2020 • edited Loading

chaoyanghe commented Nov 25, 2020

weiyikang commented Nov 24, 2020 •

edited

Loading

weiyikang commented Nov 25, 2020 •

edited

Loading