Add E2E tests for PyTorch and TensorFlow examples #1842

charlesbvll · 2023-05-04T13:18:01Z

What does this implement/fix? Explain your changes.

Improve current tests for PyTorch and TensorFlow examples by validating them from end to end.

…dd-e2e-tests

.github/workflows/examples.yml

tanertopal · 2023-05-19T10:55:26Z

@danieljanes @charlesbvll I would like to either rename the top-level directory as mentioned here: #1842 (comment) into e2e or move all e2e tests into tests/e2e. IMO it's essential to indicate that the other tests are in the src directory as I want to avoid creating the impression that these are all the tests we have (or confusing contributors wrt to where to put a test).

tanertopal · 2023-05-19T11:38:57Z

One more thought: We have quite some frameworks specific code in the tests, and I wonder if that's the best approach. If we test the examples, that would be covered, so we do not need to duplicate the effort here. I believe we should make the E2E minimal while testing our APIs. It would be more valuable to, e.g. test failure cases such as providing an invalid loss or, e.g. NumPy arrays with different shapes. @danieljanes @charlesbvll wdyt?

danieljanes · 2023-05-19T12:03:14Z

One more thought: We have quite some frameworks specific code in the tests, and I wonder if that's the best approach. If we test the examples, that would be covered, so we do not need to duplicate the effort here. I believe we should make the E2E minimal while testing our APIs. It would be more valuable to, e.g. test failure cases such as providing an invalid loss or, e.g. NumPy arrays with different shapes. @danieljanes @charlesbvll wdyt?

My perspective is that we should make a clear distinction between testing examples and testing the framework. Examples are written with readability in mind, and that will often lead to situations where the example is not a good E2E test case. If we use examples as E2E tests, we might face difficulties in deciding whether to write them in a way that's good for the reader (our main objective) or good for our testing. I'm in favor of not using examples as E2E tests (and removing the ones we have from the testing). How to test the examples themselves (so that they don't break over time) is an open question, and it seems like we don't have a good answer yet.

That being said, in the E2E testing of the framework, we need to cover each framework at least once. We need to ensure that the typical workflow works with each framework: sending parameters to the client node, deserializing them, "using" them in the ML framework (PyTorch, TensorFlow, ...), getting the out of the framework again, serializing them, and sending them back to the server. Once each framework has at least one E2E test, we can cover the rest of the E2E scenarios with more "low-level" way (for example, by just using NumPy NDArrays).

tanertopal · 2023-05-19T12:08:02Z

One more thought: We have quite some frameworks specific code in the tests, and I wonder if that's the best approach. If we test the examples, that would be covered, so we do not need to duplicate the effort here. I believe we should make the E2E minimal while testing our APIs. It would be more valuable to, e.g. test failure cases such as providing an invalid loss or, e.g. NumPy arrays with different shapes. @danieljanes @charlesbvll wdyt?

My perspective is that we should make a clear distinction between testing examples and testing the framework. Examples are written with readability in mind, and that will often lead to situations where the example is not a good E2E test case. If we use examples as E2E tests, we might face difficulties in deciding whether to write them in a way that's good for the reader (our main objective) or good for our testing. I'm in favor of not using examples as E2E tests (and removing the ones we have from the testing). How to test the examples themselves (so that they don't break over time) is an open question, and it seems like we don't have a good answer yet.

That being said, in the E2E testing of the framework, we need to cover each framework at least once. We need to ensure that the typical workflow works with each framework: sending parameters to the client node, deserializing them, "using" them in the ML framework (PyTorch, TensorFlow, ...), getting the out of the framework again, serializing them, and sending them back to the server. Once each framework has at least one E2E test, we can cover the rest of the E2E scenarios with more "low-level" way (for example, by just using NumPy NDArrays).

If we are not considering the examples as E2E tests, I agree with testing including framework-specific code. I'll review the code more thoroughly.

tanertopal · 2023-05-24T19:48:23Z

@charlesbvll When I run the test for TensorFlow I get this one:

INFO flwr 2023-05-24 21:15:19,210 | app.py:165 | Starting Flower server, config: ServerConfig(num_rounds=3, round_timeout=None)
INFO flwr 2023-05-24 21:15:19,222 | app.py:179 | Flower ECE: gRPC server running (3 rounds), SSL is disabled
INFO flwr 2023-05-24 21:15:19,222 | server.py:89 | Initializing global parameters
INFO flwr 2023-05-24 21:15:19,222 | server.py:279 | Requesting initial parameters from one random client
Traceback (most recent call last):
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
    sys.exit(signum)
SystemExit: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
    lambda: len(self.clients) >= num_clients, timeout=timeout
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 331, in wait_for
    self.wait(waittime)
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
    sys.exit(signum)
SystemExit: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "server.py", line 7, in <module>
    config=fl.server.ServerConfig(num_rounds=3),
  File "...me/development/adap/flower/src/py/flwr/server/app.py", line 185, in start_server
    config=initialized_config,
  File "...me/development/adap/flower/src/py/flwr/server/app.py", line 224, in run_fl
    hist = server.fit(num_rounds=config.num_rounds, timeout=config.round_timeout)
  File "...me/development/adap/flower/src/py/flwr/server/server.py", line 90, in fit
    self.parameters = self._get_initial_parameters(timeout=timeout)
  File "...me/development/adap/flower/src/py/flwr/server/server.py", line 280, in _get_initial_parameters
    random_client = self._client_manager.sample(1)[0]
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 180, in sample
    self.wait_for(min_num_clients)
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
    lambda: len(self.clients) >= num_clients, timeout=timeout
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 244, in __exit__
    return self._lock.__exit__(*args)
RuntimeError: cannot release un-acquired lock
Training had an issue

Can you check if you can reproduce it?

charlesbvll · 2023-05-24T20:17:08Z

@charlesbvll When I run the test for TensorFlow I get this one:

INFO flwr 2023-05-24 21:15:19,210 | app.py:165 | Starting Flower server, config: ServerConfig(num_rounds=3, round_timeout=None)
INFO flwr 2023-05-24 21:15:19,222 | app.py:179 | Flower ECE: gRPC server running (3 rounds), SSL is disabled
INFO flwr 2023-05-24 21:15:19,222 | server.py:89 | Initializing global parameters
INFO flwr 2023-05-24 21:15:19,222 | server.py:279 | Requesting initial parameters from one random client
Traceback (most recent call last):
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
    sys.exit(signum)
SystemExit: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
    lambda: len(self.clients) >= num_clients, timeout=timeout
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 331, in wait_for
    self.wait(waittime)
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
  File "...me/development/adap/flower/.venv/lib/python3.7/site-packages/ray/_private/worker.py", line 1727, in sigterm_handler
    sys.exit(signum)
SystemExit: 15

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "server.py", line 7, in <module>
    config=fl.server.ServerConfig(num_rounds=3),
  File "...me/development/adap/flower/src/py/flwr/server/app.py", line 185, in start_server
    config=initialized_config,
  File "...me/development/adap/flower/src/py/flwr/server/app.py", line 224, in run_fl
    hist = server.fit(num_rounds=config.num_rounds, timeout=config.round_timeout)
  File "...me/development/adap/flower/src/py/flwr/server/server.py", line 90, in fit
    self.parameters = self._get_initial_parameters(timeout=timeout)
  File "...me/development/adap/flower/src/py/flwr/server/server.py", line 280, in _get_initial_parameters
    random_client = self._client_manager.sample(1)[0]
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 180, in sample
    self.wait_for(min_num_clients)
  File "...me/development/adap/flower/src/py/flwr/server/client_manager.py", line 126, in wait_for
    lambda: len(self.clients) >= num_clients, timeout=timeout
  File "...me/.pyenv/versions/3.7.15/lib/python3.7/threading.py", line 244, in __exit__
    return self._lock.__exit__(*args)
RuntimeError: cannot release un-acquired lock
Training had an issue

Can you check if you can reproduce it?

@tanertopal I am unable to reproduce it but this is very weird for a few reasons:

You shouldn't be able to call the script from outside the e2e/tensorflow folder, it should error out because it won't find ../test.sh
There shouldn't be any interactions with Ray as this is not using the simulation engine at all and it is not part of the dependencies
I have had some issues before when other Flower processes where running and I forgot to kill them, that might be a possible cause of your issue (even though I don't really understand the first 2 errors)

The most plausible explanation might be that you have a test.sh script sitting at the root of the Flower dir which is being called by the symlink.

e2e/bare/test.sh

e2e/pytorch/test.sh

e2e/tensorflow/test.sh

.github/workflows/e2e.yml

e2e/bare/test.sh

tanertopal

lgtm! Great work!

Just reviewed all again and the requested changes where applied.

charlesbvll and others added 18 commits May 4, 2023 15:17

Add E2E tests for PyTorch and TensorFlow examples

61e2824

Correct bash syntax

6f5dd06

Remove sleep command

65a3d4c

Use wait instead of while loop

2bc8fdd

Remove pkill instruction

730fe31

Remove output from clients

cf0cd5a

Add quotes to variable

776f81b

Add debug statement

90830d8

Use tmp file

12cf6cd

Use subshell for server

fa49e05

Add standalone tests

8d0d95f

Fix pytorch test

04222c9

Add more delay for training

0bb99a7

Add test for tensorflow

a9ffe70

Debug path

8575190

Add check for tf

4977354

Display pytorch output

79f7f19

Merge branch 'main' into add-e2e-tests

9e47db4

charlesbvll marked this pull request as ready for review May 7, 2023 16:28

charlesbvll requested review from danieljanes and tanertopal as code owners May 7, 2023 16:28

charlesbvll added 8 commits May 7, 2023 18:44

Improve failing

fc362ca

Merge branch 'add-e2e-tests' of https://github.com/adap/flower into a…

8876a28

…dd-e2e-tests

Use relative improvement

1abd9d3

Add bare example

03abd4e

Fix server to work with null loss

93edb6b

Use dev version for tests

dda1843

Remove sleeps for bare test

fcdef26

Merge branch 'main' into add-e2e-tests

3f0f51b

tanertopal reviewed May 8, 2023

View reviewed changes

.github/workflows/examples.yml Outdated Show resolved Hide resolved

charlesbvll and others added 3 commits May 15, 2023 15:19

Set upper bound for tensorflow

2a6048f

Merge branch 'main' into add-e2e-tests

2199106

Merge branch 'main' into add-e2e-tests

38d9b47

Merge branch 'main' into add-e2e-tests

4441d54

charlesbvll and others added 5 commits May 19, 2023 14:26

Rename tests into e2e

4736d11

Make tests executable

f8a2ca5

Merge branch 'main' into add-e2e-tests

c2df97e

Merge branch 'main' into add-e2e-tests

26b3ded

Merge branch 'main' into add-e2e-tests

492e183

Fix merge conflicts

3626c3e

tanertopal reviewed Jun 27, 2023

View reviewed changes

e2e/bare/test.sh Outdated Show resolved Hide resolved

e2e/pytorch/test.sh Outdated Show resolved Hide resolved

e2e/tensorflow/test.sh Outdated Show resolved Hide resolved

tanertopal and others added 5 commits June 27, 2023 10:09

Apply suggestions from code review

7f24732

Remove symlinks

8185945

Remove trap

3774ea8

Merge branch 'main' into add-e2e-tests

ee43635

Use bash instead of sh

092de34

tanertopal reviewed Jun 27, 2023

View reviewed changes

.github/workflows/e2e.yml Show resolved Hide resolved

tanertopal reviewed Jun 27, 2023

View reviewed changes

e2e/bare/test.sh Outdated Show resolved Hide resolved

Update e2e/bare/test.sh

fce0c4e

tanertopal enabled auto-merge (squash) June 27, 2023 15:16

tanertopal approved these changes Jun 27, 2023

View reviewed changes

tanertopal merged commit 4f621c6 into main Jun 27, 2023

tanertopal deleted the add-e2e-tests branch June 27, 2023 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add E2E tests for PyTorch and TensorFlow examples #1842

Add E2E tests for PyTorch and TensorFlow examples #1842

charlesbvll commented May 4, 2023

tanertopal commented May 19, 2023 •

edited

Loading

tanertopal commented May 19, 2023

danieljanes commented May 19, 2023

tanertopal commented May 19, 2023

tanertopal commented May 24, 2023 •

edited

Loading

charlesbvll commented May 24, 2023 •

edited

Loading

tanertopal left a comment

Add E2E tests for PyTorch and TensorFlow examples #1842

Add E2E tests for PyTorch and TensorFlow examples #1842

Conversation

charlesbvll commented May 4, 2023

What does this implement/fix? Explain your changes.

tanertopal commented May 19, 2023 • edited Loading

tanertopal commented May 19, 2023

danieljanes commented May 19, 2023

tanertopal commented May 19, 2023

tanertopal commented May 24, 2023 • edited Loading

charlesbvll commented May 24, 2023 • edited Loading

tanertopal left a comment

Choose a reason for hiding this comment

tanertopal commented May 19, 2023 •

edited

Loading

tanertopal commented May 24, 2023 •

edited

Loading

charlesbvll commented May 24, 2023 •

edited

Loading