-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add E2E tests for PyTorch and TensorFlow examples #1842
Conversation
@danieljanes @charlesbvll I would like to either rename the top-level directory as mentioned here: #1842 (comment) into |
One more thought: We have quite some frameworks specific code in the tests, and I wonder if that's the best approach. If we test the examples, that would be covered, so we do not need to duplicate the effort here. I believe we should make the E2E minimal while testing our APIs. It would be more valuable to, e.g. test failure cases such as providing an invalid loss or, e.g. NumPy arrays with different shapes. @danieljanes @charlesbvll wdyt? |
My perspective is that we should make a clear distinction between testing examples and testing the framework. Examples are written with readability in mind, and that will often lead to situations where the example is not a good E2E test case. If we use examples as E2E tests, we might face difficulties in deciding whether to write them in a way that's good for the reader (our main objective) or good for our testing. I'm in favor of not using examples as E2E tests (and removing the ones we have from the testing). How to test the examples themselves (so that they don't break over time) is an open question, and it seems like we don't have a good answer yet. That being said, in the E2E testing of the framework, we need to cover each framework at least once. We need to ensure that the typical workflow works with each framework: sending parameters to the client node, deserializing them, "using" them in the ML framework (PyTorch, TensorFlow, ...), getting the out of the framework again, serializing them, and sending them back to the server. Once each framework has at least one E2E test, we can cover the rest of the E2E scenarios with more "low-level" way (for example, by just using NumPy NDArrays). |
If we are not considering the examples as E2E tests, I agree with testing including framework-specific code. I'll review the code more thoroughly. |
@charlesbvll When I run the test for TensorFlow I get this one: INFO flwr 2023-05-24 21:15:19,210 | app.py:165 | Starting Flower server, config: ServerConfig(num_rounds=3, round_timeout=None)
INFO flwr 2023-05-24 21:15:19,222 | app.py:179 | Flower ECE: gRPC server running (3 rounds), SSL is disabled
INFO flwr 2023-05-24 21:15:19,222 | server.py:89 | Initializing global parameters
INFO flwr 2023-05-24 21:15:19,222 | server.py:279 | Requesting initial parameters from one random client
Traceback (most recent call last):
File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
sys.exit(signum)
SystemExit: 15
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
lambda: len(self.clients) >= num_clients, timeout=timeout
File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 331, in wait_for
self.wait(waittime)
File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
sys.exit(signum)
SystemExit: 15
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "server.py", line 7, in <module>
config=fl.server.ServerConfig(num_rounds=3),
File "...me/development/adap/flower/src/py/flwr/server/app.py", line 185, in start_server
config=initialized_config,
File "...me/development/adap/flower/src/py/flwr/server/app.py", line 224, in run_fl
hist = server.fit(num_rounds=config.num_rounds, timeout=config.round_timeout)
File "...me/development/adap/flower/src/py/flwr/server/server.py", line 90, in fit
self.parameters = self._get_initial_parameters(timeout=timeout)
File "...me/development/adap/flower/src/py/flwr/server/server.py", line 280, in _get_initial_parameters
random_client = self._client_manager.sample(1)[0]
File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 180, in sample
self.wait_for(min_num_clients)
File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
lambda: len(self.clients) >= num_clients, timeout=timeout
File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 244, in __exit__
return self._lock.__exit__(*args)
RuntimeError: cannot release un-acquired lock
Training had an issue Can you check if you can reproduce it? |
@tanertopal I am unable to reproduce it but this is very weird for a few reasons:
The most plausible explanation might be that you have a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! Great work!
Just reviewed all again and the requested changes where applied.
What does this implement/fix? Explain your changes.
Improve current tests for PyTorch and TensorFlow examples by validating them from end to end.