Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The error in new machine when running distribution/fedavg #68

Closed
weiyikang opened this issue Nov 24, 2020 · 4 comments
Closed

The error in new machine when running distribution/fedavg #68

weiyikang opened this issue Nov 24, 2020 · 4 comments

Comments

@weiyikang
Copy link

weiyikang commented Nov 24, 2020

I think the FedML can use easily on another machine only by cloning the FedML without modifications. While the errors occur as follows:

Configures of computer:

  1. 4 * RTX 3090, cuda 11.1
  2. pytorch1.7
  3. According to CI-install.sh requiring the environment

run the command:
sh run_fedavg_distributed_pytorch.sh 4 4 1 4 cnn homo 2 1 32 0.0001 digit5 "./../../../data/Digit5" 0
There are 4 clients and 4 works.

The errors as follows(Fig.1 the warning of program, Fig.2 the 4 clients on GPUs may be wrong, the same process on all GPUs):

image

image

@chaoyanghe
Copy link
Member

are you using the same version of the code? We updated a lot recently.

@weiyikang
Copy link
Author

I used the previous version, I will try the latest version again.

@weiyikang
Copy link
Author

weiyikang commented Nov 25, 2020

I have used the latest version, FedAvg(distribution) on 4 * RTX3090 computer as shown in Fig.1, while the previous version on 8 * RTX2080 as shown in Fig.2.

  1. Why there exist many 0MiB FedAvg(distributed) in the latest version?
  2. What's the effect of 0MiB FedAvg(distributed)?
  3. Why the same process(eg. Process FedAvg(distributed):1 ) in different GPUs, I think the same process should only load on one GPU.
  4. Whether this phenomenon in Fig.1 caused by the software environment, not the latest version's features?

image

image

@chaoyanghe
Copy link
Member

May I know how you set your "init_training_device" function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants