Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding advanced_pytorch example #1007

Merged
merged 29 commits into from
May 5, 2022
Merged

Conversation

cozek
Copy link
Contributor

@cozek cozek commented Jan 16, 2022

Reference Issues/PRs

#803

What does this implement/fix? Explain your changes.

Implements an advaned_pytorch example. Closely duplicates the advanced_tensorflow example.

Any other comments?

I added a --toy flag since my machine can't run the full 10 client simulation.
Please run the full simulation and let me know how it works. Happy to make any required changes.
I had a notebook for prototyping, removed it. Can add it in a later PR if needed.

@danieljanes
Copy link
Member

Hi @cozek , thanks for this :) we'll review the PR shortly. In the meantime, could you add a reference to the new advanced_pytorch example in README.md (close to where the advanced_tensorflow example is listed)?

sancarlim added a commit to sancarlim/flower that referenced this pull request Jan 17, 2022
## Reference Issues/PRs
Fix adap#1007

## What does this implement/fix? Explain your changes.
The original example send the data and model to device in utils.py but not in client.py, which may give an error.
sancarlim added a commit to sancarlim/flower that referenced this pull request Jan 17, 2022
## Reference Issues/PRs
Fix adap#1007

## What does this implement/fix? Explain your changes.
The original example send the data and model to device in utils.py but not in client.py, which may give an error.
@danieljanes danieljanes added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 19, 2022
@pedropgusmao
Copy link
Contributor

pedropgusmao commented Feb 16, 2022

Hi @cozek the PR is looking great. Could you just set DEVICE = torch.device("cpu") in :
https://github.com/cozek/flower/blob/d36170c6785ca08716dd6ee2b6a1f184d6015dfb/examples/advanced_pytorch/utils.py#L9
It is likely that we run out of CUDA memory if not.

@pedropgusmao
Copy link
Contributor

Another thing, each client appears to be downloading the CIFAR10 dataset. This creates problems when launching multiple clients. I suggest moving the download functionality to ./run.sh and simply loading the file later.

@cozek
Copy link
Contributor Author

cozek commented Feb 16, 2022

Another thing, each client appears to be downloading the CIFAR10 dataset. This creates problems when launching multiple clients. I suggest moving the download functionality to ./run.sh and simply loading the file later.

Done.

@pedropgusmao
Copy link
Contributor

@cozek , it's looking good.
I just ran a few tests and apparently running in CPU takes about 25min per round... a bit too much :).
To solve this CUDA_OUT_OF_MEMORY vs CPU long training times, I suggest the following:

  • Store the model as CPU type inside the clients and just use it with cuda during training (remembering to store it back as CPU).
  • Reduce the number o participating clients to 2 per round.
  • Reduce bach_size to 16.

You can then keep an eye on your GPU memory usage which hopefully won't be too much.

@cozek
Copy link
Contributor Author

cozek commented Mar 5, 2022

@cozek , it's looking good. I just ran a few tests and apparently running in CPU takes about 25min per round... a bit too much :). To solve this CUDA_OUT_OF_MEMORY vs CPU long training times, I suggest the following:

  • Store the model as CPU type inside the clients and just use it with cuda during training (remembering to store it back as CPU).
  • Reduce the number o participating clients to 2 per round.
  • Reduce bach_size to 16.

You can then keep an eye on your GPU memory usage which hopefully won't be too much.

Done.

@pedropgusmao
Copy link
Contributor

pedropgusmao commented Mar 9, 2022

Traceback (most recent call last):
  File "server.py", line 108, in <module>
    main()
  File "server.py", line 104, in main
    fl.server.start_server("0.0.0.0:8080", config={"num_rounds": 4}, strategy=strategy)
  File "/home/pedro/repos/advanced_pytorch/src/py/flwr/server/app.py", line 114, in start_server
    force_final_distributed_eval=force_final_distributed_eval,
  File "/home/pedro/repos/advanced_pytorch/src/py/flwr/server/app.py", line 148, in _fl
    hist = server.fit(num_rounds=config["num_rounds"])
  File "/home/pedro/repos/advanced_pytorch/src/py/flwr/server/server.py", line 87, in fit
    res = self.strategy.evaluate(parameters=self.parameters)
  File "/home/pedro/repos/advanced_pytorch/src/py/flwr/server/strategy/fedavg.py", line 178, in evaluate
    eval_res = self.eval_fn(weights)
  File "server.py", line 60, in evaluate
    loss, accuracy = utils.test(model, valset)
  File "/home/pedro/repos/advanced_pytorch/examples/advanced_pytorch/utils.py", line 86, in test
    images, labels = images.to(DEVICE), labels.to(DEVICE)
AttributeError: 'int' object has no attribute 'to'

I'm getting the error above. Maybe you are returning the labels as an integer (original CIFAR). Instead, it must be at Long Tensor (int64) .
Also, cuda_device in train() is not used.

@pedropgusmao
Copy link
Contributor

Hi @cozek,
I left a few suggestions on how to load the model only if needed and how to pass the flag whether or not to use gpus.
Let me know what you think.

…or doing a dry run; client no longer stores model as an attribute
@cozek
Copy link
Contributor Author

cozek commented Apr 28, 2022

Hi @cozek, I left a few suggestions on how to load the model only if needed and how to pass the flag whether or not to use gpus. Let me know what you think.

Thanks for the suggestions. Made the changes accordingly.

@pedropgusmao
Copy link
Contributor

@cozek @danieljanes looks good, just server-side evaluation is still in cpu only, which makes it a bit slow, but still, it is working properly.

@danieljanes danieljanes merged commit e0c48c2 into adap:main May 5, 2022
@danieljanes
Copy link
Member

Thanks for the PR @cozek & thanks for the review @pedropgusmao 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants