Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code_update_test break in CI #2625

Closed
achamayou opened this issue May 27, 2021 · 6 comments
Closed

code_update_test break in CI #2625

achamayou opened this issue May 27, 2021 · 6 comments
Assignees
Labels
bug ci Issues related to continuous integration

Comments

@achamayou
Copy link
Member

achamayou commented May 27, 2021

https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=26078&view=logs&j=8f3dc89c-3708-5926-47e7-27120a268dab&t=efdbc34d-c6b6-5dc6-e05c-1751d4c47c20&l=23732

57: Traceback (most recent call last):
57:   File "/mnt/build/1/s/tests/code_update.py", line 214, in <module>
57:     run(args)
57:   File "/mnt/build/1/s/tests/code_update.py", line 203, in run
57:     test_update_all_nodes(network, args)
57:   File "/mnt/build/1/s/tests/suite/test_requirements.py", line 20, in wrapper
57:     return func(*args, **kwargs)
57:   File "/mnt/build/1/s/tests/code_update.py", line 149, in test_update_all_nodes
57:     primary, _ = network.find_nodes()
57:   File "/mnt/build/1/s/tests/infra/network.py", line 809, in find_nodes
57:     backups = self.find_backups(primary=primary, timeout=timeout)
57:   File "/mnt/build/1/s/tests/infra/network.py", line 790, in find_backups
57:     return [n for n in self.get_joined_nodes() if n != primary]
57:   File "/mnt/build/1/s/tests/infra/network.py", line 790, in <listcomp>
57:     return [n for n in self.get_joined_nodes() if n != primary]
57:   File "/mnt/build/1/s/tests/infra/node.py", line 120, in __eq__
57:     return self.local_node_id == other.local_node_id
57: AttributeError: 'NoneType' object has no attribute 'local_node_id'
51/64 Test #57: code_update_test ..................***Failed   49.50 sec
@achamayou achamayou added the bug label May 27, 2021
@achamayou
Copy link
Member Author

Similar issue in https://dev.azure.com/MSRC-CCF/CCF/_build/results?buildId=26080&view=logs&j=5435e0ac-25e5-5426-50be-61b0d0ea8d34&t=63d660c3-6f76-573a-0154-fd0bcb096ded

61: Traceback (most recent call last):
61:   File "/mnt/build/1/s/tests/code_update.py", line 214, in <module>
61:     run(args)
61:   File "/mnt/build/1/s/tests/code_update.py", line 206, in run
61:     test_verify_quotes(network, args)
61:   File "/mnt/build/1/s/tests/suite/test_requirements.py", line 20, in wrapper
61:     return func(*args, **kwargs)
61:   File "/mnt/build/1/s/tests/code_update.py", line 44, in test_verify_quotes
61:     check_can_progress(primary)
61:   File "/mnt/build/1/s/tests/infra/checker.py", line 42, in check_can_progress
61:     with node.client() as c:
61: AttributeError: 'NoneType' object has no attribute 'client'
50/66 Test #61: code_update_test ..................***Failed   49.17 sec
test 69

@eddyashton
Copy link
Member

Looks like primary is None, coming from find_primary and specifically:

    def _get_node_by_service_id(self, node_id):
        return next((node for node in self.nodes if node.node_id == node_id), None)

So primary_id is not None, but we no longer have a node with this ID in the self.nodes collection.

@eddyashton
Copy link
Member

eddyashton commented May 27, 2021

Specifically, primary_id is a valid node ID, but its a node that has now been removed.

07:29:53.875 | INFO     | infra.network:wait_for_new_primary:913 - New primary is 1 (24fee0625aff5e559abb6730df75a2c3fad523b7be2b3b8a7bf16895251bec7d) in term 3
07:29:54.054 | INFO     | infra.consortium:retire_node:284 - Retiring node 1
07:29:57.596 | INFO     | infra.node:stop:295 - Stopping node 1
07:29:57.741 | INFO     | infra.network:find_primary:783 - [2||] GET /node/network
07:29:57.741 | INFO     | infra.network:find_primary:783 - 200 @3.81 {"current_view":3,"primary_id":"24fee0625aff5e559abb6730df75a2c3fad523b7be2b3b8a7bf16895251bec7d","service_certificate":"-----BEGIN CERTIFICATE-----\nMIIBuDCCAT6gAwIBAgIRAP0IMc6LEZP+hL6DCL40S2MwCgYIKoZIzj0EAwMwFjEU\nMBIGA1UEAwwLQ0NGIE5ldHdvcmswHhcNMjEwMzEx + 559 chars

So node 2 still thinks node 1 is primary, shortly after 1 was removed from network.nodes. This seems like a valid situation, but our infra clearly doesn't handle it and we must not have seen it previously. Still trying to work out what has changed to trigger this.

@eddyashton
Copy link
Member

Another error noticed in these logs, that doesn't explain the crash but should be fixed, is that a retired node can know (and report) the new, correct primary, but reports the incorrect term.

61: 07:27:30.734 | INFO     | infra.network:wait_for_new_primary:912 - [3||] GET /node/network
61: 07:27:30.734 | INFO     | infra.network:wait_for_new_primary:912 - 200 @3.80 {"current_view":3,"primary_id":"7aeaf291fee26196a2993750855a685c75076a9b178f8db00ec45251bbf1c25e","service_certificate":"-----BEGIN CERTIFICATE-----\nMIIBtzCCAT6gAwIBAgIRAJI7cpqqOIz8sp4uE5zFg8cwCgYIKoZIzj0EAwMwFjEU\nMBIGA1UEAwwLQ0NGIE5ldHdvcmswHhcNMjEwMzEx + 559 chars
61: 07:27:30.734 | INFO     | infra.network:wait_for_new_primary:913 - New primary is 1 (7aeaf291fee26196a2993750855a685c75076a9b178f8db00ec45251bbf1c25e) in term 3

This is correct. We ask node 3 who the primary is, they correctly say it is node 1 in term 3.

61: 07:27:33.997 | INFO     | infra.network:wait_for_new_primary:912 - [1||] GET /node/network
61: 07:27:33.998 | INFO     | infra.network:wait_for_new_primary:912 - 200 @3.84 {"current_view":3,"primary_id":"f40b8aedf3ba7dfc6cc03290cffac48569bfc2de241fbf5a22e4be827190d45b","service_certificate":"-----BEGIN CERTIFICATE-----\nMIIBtzCCAT6gAwIBAgIRAJI7cpqqOIz8sp4uE5zFg8cwCgYIKoZIzj0EAwMwFjEU\nMBIGA1UEAwwLQ0NGIE5ldHdvcmswHhcNMjEwMzEx + 559 chars
61: 07:27:33.998 | INFO     | infra.network:wait_for_new_primary:913 - New primary is 4 (f40b8aedf3ba7dfc6cc03290cffac48569bfc2de241fbf5a22e4be827190d45b) in term 3

Shortly afterwards, after node 1 is retired, we ask node 1 who the primary is. They report node 4 (correct, they've won an election in the new term once 1 is retired), but in the wrong term (reported as term 3, really term 4).

@achamayou
Copy link
Member Author

@eddyashton yes, that's because Retired does "too little" now and should act more like "Follower" as suggested in #2606. I'm working on this.

@eddyashton eddyashton added the ci Issues related to continuous integration label Jun 14, 2021
@achamayou
Copy link
Member Author

Waiting on #2801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ci Issues related to continuous integration
Projects
None yet
Development

No branches or pull requests

2 participants