Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add E2E tests for PyTorch and TensorFlow examples #1842

Merged
merged 65 commits into from
Jun 27, 2023
Merged

Conversation

charlesbvll
Copy link
Member

What does this implement/fix? Explain your changes.

Improve current tests for PyTorch and TensorFlow examples by validating them from end to end.

@charlesbvll charlesbvll marked this pull request as ready for review May 7, 2023 16:28
@tanertopal
Copy link
Member

tanertopal commented May 19, 2023

@danieljanes @charlesbvll I would like to either rename the top-level directory as mentioned here: #1842 (comment) into e2e or move all e2e tests into tests/e2e. IMO it's essential to indicate that the other tests are in the src directory as I want to avoid creating the impression that these are all the tests we have (or confusing contributors wrt to where to put a test).

@tanertopal
Copy link
Member

One more thought: We have quite some frameworks specific code in the tests, and I wonder if that's the best approach. If we test the examples, that would be covered, so we do not need to duplicate the effort here. I believe we should make the E2E minimal while testing our APIs. It would be more valuable to, e.g. test failure cases such as providing an invalid loss or, e.g. NumPy arrays with different shapes. @danieljanes @charlesbvll wdyt?

@danieljanes
Copy link
Member

One more thought: We have quite some frameworks specific code in the tests, and I wonder if that's the best approach. If we test the examples, that would be covered, so we do not need to duplicate the effort here. I believe we should make the E2E minimal while testing our APIs. It would be more valuable to, e.g. test failure cases such as providing an invalid loss or, e.g. NumPy arrays with different shapes. @danieljanes @charlesbvll wdyt?

My perspective is that we should make a clear distinction between testing examples and testing the framework. Examples are written with readability in mind, and that will often lead to situations where the example is not a good E2E test case. If we use examples as E2E tests, we might face difficulties in deciding whether to write them in a way that's good for the reader (our main objective) or good for our testing. I'm in favor of not using examples as E2E tests (and removing the ones we have from the testing). How to test the examples themselves (so that they don't break over time) is an open question, and it seems like we don't have a good answer yet.

That being said, in the E2E testing of the framework, we need to cover each framework at least once. We need to ensure that the typical workflow works with each framework: sending parameters to the client node, deserializing them, "using" them in the ML framework (PyTorch, TensorFlow, ...), getting the out of the framework again, serializing them, and sending them back to the server. Once each framework has at least one E2E test, we can cover the rest of the E2E scenarios with more "low-level" way (for example, by just using NumPy NDArrays).

@tanertopal
Copy link
Member

One more thought: We have quite some frameworks specific code in the tests, and I wonder if that's the best approach. If we test the examples, that would be covered, so we do not need to duplicate the effort here. I believe we should make the E2E minimal while testing our APIs. It would be more valuable to, e.g. test failure cases such as providing an invalid loss or, e.g. NumPy arrays with different shapes. @danieljanes @charlesbvll wdyt?

My perspective is that we should make a clear distinction between testing examples and testing the framework. Examples are written with readability in mind, and that will often lead to situations where the example is not a good E2E test case. If we use examples as E2E tests, we might face difficulties in deciding whether to write them in a way that's good for the reader (our main objective) or good for our testing. I'm in favor of not using examples as E2E tests (and removing the ones we have from the testing). How to test the examples themselves (so that they don't break over time) is an open question, and it seems like we don't have a good answer yet.

That being said, in the E2E testing of the framework, we need to cover each framework at least once. We need to ensure that the typical workflow works with each framework: sending parameters to the client node, deserializing them, "using" them in the ML framework (PyTorch, TensorFlow, ...), getting the out of the framework again, serializing them, and sending them back to the server. Once each framework has at least one E2E test, we can cover the rest of the E2E scenarios with more "low-level" way (for example, by just using NumPy NDArrays).

If we are not considering the examples as E2E tests, I agree with testing including framework-specific code. I'll review the code more thoroughly.

@tanertopal
Copy link
Member

tanertopal commented May 24, 2023

@charlesbvll When I run the test for TensorFlow I get this one:

INFO flwr 2023-05-24 21:15:19,210 | app.py:165 | Starting Flower server, config: ServerConfig(num_rounds=3, round_timeout=None)
INFO flwr 2023-05-24 21:15:19,222 | app.py:179 | Flower ECE: gRPC server running (3 rounds), SSL is disabled
INFO flwr 2023-05-24 21:15:19,222 | server.py:89 | Initializing global parameters
INFO flwr 2023-05-24 21:15:19,222 | server.py:279 | Requesting initial parameters from one random client
Traceback (most recent call last):
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
    sys.exit(signum)
SystemExit: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
    lambda: len(self.clients) >= num_clients, timeout=timeout
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 331, in wait_for
    self.wait(waittime)
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
    sys.exit(signum)
SystemExit: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "server.py", line 7, in <module>
    config=fl.server.ServerConfig(num_rounds=3),
  File "...me/development/adap/flower/src/py/flwr/server/app.py", line 185, in start_server
    config=initialized_config,
  File "...me/development/adap/flower/src/py/flwr/server/app.py", line 224, in run_fl
    hist = server.fit(num_rounds=config.num_rounds, timeout=config.round_timeout)
  File "...me/development/adap/flower/src/py/flwr/server/server.py", line 90, in fit
    self.parameters = self._get_initial_parameters(timeout=timeout)
  File "...me/development/adap/flower/src/py/flwr/server/server.py", line 280, in _get_initial_parameters
    random_client = self._client_manager.sample(1)[0]
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 180, in sample
    self.wait_for(min_num_clients)
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
    lambda: len(self.clients) >= num_clients, timeout=timeout
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 244, in __exit__
    return self._lock.__exit__(*args)
RuntimeError: cannot release un-acquired lock
Training had an issue

Can you check if you can reproduce it?

@charlesbvll
Copy link
Member Author

charlesbvll commented May 24, 2023

@charlesbvll When I run the test for TensorFlow I get this one:

INFO flwr 2023-05-24 21:15:19,210 | app.py:165 | Starting Flower server, config: ServerConfig(num_rounds=3, round_timeout=None)
INFO flwr 2023-05-24 21:15:19,222 | app.py:179 | Flower ECE: gRPC server running (3 rounds), SSL is disabled
INFO flwr 2023-05-24 21:15:19,222 | server.py:89 | Initializing global parameters
INFO flwr 2023-05-24 21:15:19,222 | server.py:279 | Requesting initial parameters from one random client
Traceback (most recent call last):
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
    sys.exit(signum)
SystemExit: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
    lambda: len(self.clients) >= num_clients, timeout=timeout
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 331, in wait_for
    self.wait(waittime)
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
    sys.exit(signum)
SystemExit: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "server.py", line 7, in <module>
    config=fl.server.ServerConfig(num_rounds=3),
  File "...me/development/adap/flower/src/py/flwr/server/app.py", line 185, in start_server
    config=initialized_config,
  File "...me/development/adap/flower/src/py/flwr/server/app.py", line 224, in run_fl
    hist = server.fit(num_rounds=config.num_rounds, timeout=config.round_timeout)
  File "...me/development/adap/flower/src/py/flwr/server/server.py", line 90, in fit
    self.parameters = self._get_initial_parameters(timeout=timeout)
  File "...me/development/adap/flower/src/py/flwr/server/server.py", line 280, in _get_initial_parameters
    random_client = self._client_manager.sample(1)[0]
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 180, in sample
    self.wait_for(min_num_clients)
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
    lambda: len(self.clients) >= num_clients, timeout=timeout
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 244, in __exit__
    return self._lock.__exit__(*args)
RuntimeError: cannot release un-acquired lock
Training had an issue

Can you check if you can reproduce it?

@tanertopal I am unable to reproduce it but this is very weird for a few reasons:

  • You shouldn't be able to call the script from outside the e2e/tensorflow folder, it should error out because it won't find ../test.sh
  • There shouldn't be any interactions with Ray as this is not using the simulation engine at all and it is not part of the dependencies
  • I have had some issues before when other Flower processes where running and I forgot to kill them, that might be a possible cause of your issue (even though I don't really understand the first 2 errors)

The most plausible explanation might be that you have a test.sh script sitting at the root of the Flower dir which is being called by the symlink.

e2e/bare/test.sh Outdated Show resolved Hide resolved
e2e/pytorch/test.sh Outdated Show resolved Hide resolved
e2e/tensorflow/test.sh Outdated Show resolved Hide resolved
e2e/bare/test.sh Outdated Show resolved Hide resolved
@tanertopal tanertopal enabled auto-merge (squash) June 27, 2023 15:16
Copy link
Member

@tanertopal tanertopal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Great work!

@tanertopal tanertopal dismissed danieljanes’s stale review June 27, 2023 16:43

Just reviewed all again and the requested changes where applied.

@tanertopal tanertopal merged commit 4f621c6 into main Jun 27, 2023
@tanertopal tanertopal deleted the add-e2e-tests branch June 27, 2023 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants