Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe node removal #4008

Merged
merged 33 commits into from
Jul 28, 2022
Merged

Safe node removal #4008

merged 33 commits into from
Jul 28, 2022

Conversation

achamayou
Copy link
Member

@achamayou achamayou commented Jul 1, 2022

Fix for #1713

  • Add a retired_committed node flag in the KV. Nodes are transitioned to this flag being set by the primary observing that their RETIRED state for them has been committed.
  • Add a GET /node/network/removable_nodes endpoint to query RETIRED && retired_committed nodes - this allows the operator to find out what nodes can be safely shut down
  • Add an endpoint to flush RETIRED && retired_committed nodes from the KV to avoid KV growth

We may choose to go with a "fresh read" for step 2 instead, removing the need for step 3 and allowing us to elide the retired_committed step and go straight to removal. This involves turning GET /node/network/removable_nodes into a write, for which operators would need to not forget to check tx/status. This seems unergonomic right now.

Once we have blocking transactions though, we probably want to revisit this.

  • Update operator documentation to explain safe removal flow.

@ghost
Copy link

ghost commented Jul 1, 2022

safe_node_removal@48093 aka 20220728.7 vs main ewma over 20 builds from 47853 to 48083

Click to see table

main

build_id build_number tpcc_sgx_cft^ tpcc_sgx_cft_mem ls_sgx_cft^ ls_sgx_cft_mem ls_jwt_sgx_cft^ ls_jwt_sgx_cft_mem ls_js_sgx_cft^ ls_js_sgx_cft_mem ls_v8_sgx_cft^ ls_v8_sgx_cft_mem ls_full_js_sgx_cft^ ls_full_js_sgx_cft_mem ls_full_v8_sgx_cft^ ls_full_v8_sgx_cft_mem ls_js_jwt_sgx_cft^ ls_js_jwt_sgx_cft_mem hist_sgx_cft^ RB put (/s)^ CHAMP put (/s)^ RB get (/s)^ CHAMP get (/s)^
47853 20220720.9 6475.54 8.40213e+07 20535.6 1.58639e+07 5649.05 1.56017e+07 2507.96 1.00967e+07 1639.93 1.64762e+08 2126.48 9.31027e+06 1426.87 9.76528e+07 1936.18 9.04813e+06 18213.7 905586 1.36989e+06 9.22514e+06 3.58042e+07
47861 20220720.12 6509.14 8.48077e+07 20721.5 1.63882e+07 5694.92 1.56017e+07 2412.68 1.00967e+07 1645.05 1.64762e+08 2121.19 9.31027e+06 1433.2 9.81771e+07 1929.28 9.04813e+06 24288.4 905908 1.38433e+06 9.02596e+06 3.57417e+07
47872 20220720.17 6465.82 8.42835e+07 20517.4 1.63882e+07 5622.72 1.50774e+07 2557.56 1.03588e+07 1617.88 1.66597e+08 2162.39 9.31027e+06 1412.22 9.79149e+07 1950.88 9.04813e+06 18062.2 899732 1.40551e+06 9.26265e+06 3.65708e+07
47881 20220720.21 6369.6 8.40213e+07 20382.1 1.69124e+07 5643.45 1.53396e+07 2557.82 1.00967e+07 1616.96 1.645e+08 2120.13 9.57242e+06 1409.7 9.84392e+07 1931.12 8.78598e+06 22172 893135 1.39882e+06 9.19615e+06 3.59273e+07
47897 20220720.28 6377.58 8.3497e+07 20748.1 1.69124e+07 5738.23 1.56017e+07 2569.08 1.03588e+07 1637.76 1.66335e+08 2128.14 1.21939e+07 1440.04 9.81771e+07 1934.55 9.04813e+06 20951.1 883890 1.36386e+06 9.17958e+06 3.54933e+07
47905 20220721.2 6708.02 8.32349e+07 20841.9 1.58639e+07 5644.58 1.56017e+07 2562.93 1.00967e+07 1621.37 1.66597e+08 2123.05 9.31027e+06 1427.49 9.81771e+07 1978.68 8.78598e+06 23318.4 898512 1.35198e+06 9.30901e+06 3.55556e+07
47909 20220721.4 6659.57 8.3497e+07 20657 1.58639e+07 5693.2 1.53396e+07 2558.29 1.00967e+07 1632.89 1.66859e+08 2128.53 9.31027e+06 1443.32 9.87014e+07 1918.81 9.04813e+06 22526.7 854860 1.36351e+06 9.24179e+06 3.58669e+07
47923 20220721.10 6395.09 8.3497e+07 20406.3 1.58639e+07 5573.28 1.53396e+07 2520.68 1.2456e+07 1609.11 1.66859e+08 2114.86 9.31027e+06 1421.32 9.84392e+07 1875.17 9.04813e+06 22713.6 904947 1.37182e+06 9.26688e+06 3.58663e+07
47926 20220722.1 6559.88 8.32349e+07 19974.9 1.6126e+07 5745.67 1.50774e+07 2554.17 1.00967e+07 1666.79 1.65024e+08 2132.5 9.31027e+06 1439.08 9.84392e+07 1940.42 8.78598e+06 18130.5 889049 1.35819e+06 9.27121e+06 3.58042e+07
47939 20220722.7 5985.98 8.37592e+07 19798.5 1.58639e+07 5651.98 1.53396e+07 2551.84 1.00967e+07 1635.45 1.67121e+08 2125.5 9.31027e+06 1420.71 9.81771e+07 1976.48 8.78598e+06 18489.4 888720 1.37669e+06 9.36002e+06 3.58036e+07
47944 20220722.9 6418.19 8.37592e+07 20099.7 1.6126e+07 5601.76 1.56017e+07 2549.59 1.00967e+07 1589.25 1.66859e+08 2126.97 9.31027e+06 1416.11 9.81771e+07 1934.91 9.04813e+06 18668.2 905590 1.36016e+06 9.31328e+06 3.59292e+07
47950 20220722.11 6346.72 8.42835e+07 20498 1.66503e+07 5696.14 1.48153e+07 2547.54 1.00967e+07 1593.79 1.66597e+08 2087.82 9.31027e+06 1417.11 9.76528e+07 1929.17 8.78598e+06 20113.6 900215 1.37311e+06 9.21273e+06 3.56794e+07
47964 20220725.3 6445.58 8.45456e+07 20264.1 1.6126e+07 5715.87 1.53396e+07 2543.74 1.00967e+07 1612.58 1.66597e+08 2108.59 9.31027e+06 1435.99 9.76528e+07 1937.49 1.0621e+07 18993.5 902517 1.33865e+06 9.39876e+06 3.48893e+07
47991 20220726.3 6424.54 8.29727e+07 20484.1 1.58639e+07 5660.34 1.53396e+07 2552.74 1.00967e+07 1590.25 1.66597e+08 2122.4 9.04813e+06 1429.67 9.81771e+07 1926.78 8.78598e+06 20464.4 890933 1.38154e+06 9.48144e+06 3.62478e+07
47999 20220726.6 6463.07 8.3497e+07 20411.6 1.58639e+07 5679.05 1.56017e+07 2544.17 1.00967e+07 1621.25 1.66859e+08 2164.94 9.31027e+06 1426.71 9.84392e+07 2024.37 9.04813e+06 17785.5 908399 1.36251e+06 9.39872e+06 3.65714e+07
48019 20220726.14 6543.72 8.42835e+07 20091.2 1.56017e+07 5601.14 1.50774e+07 2499.24 9.83456e+06 1600.03 1.645e+08 2122.92 9.31027e+06 1420.35 9.79149e+07 2014.48 9.04813e+06 19942.2 903392 1.39861e+06 9.02994e+06 3.59298e+07
48039 20220726.21 6604.8 8.45456e+07 20061.9 1.63882e+07 5657.52 1.53396e+07 2462.46 1.29803e+07 1599.33 1.67121e+08 2133.23 9.04813e+06 1444.67 9.79149e+07 1932.82 9.04813e+06 20529.4 898398 1.37154e+06 9.20445e+06 3.59292e+07
48056 20220727.3 7376.83 8.48077e+07 20455.4 1.58639e+07 5645.94 1.53396e+07 2553.28 1.00967e+07 1612.2 1.66597e+08 2117.56 9.31027e+06 1424.76 9.81771e+07 1932.75 8.78598e+06 18180.7 905868 1.38004e+06 9.22514e+06 3.58669e+07
48076 20220728.1 6377.5 8.45456e+07 20423.6 1.6126e+07 5695.92 1.53396e+07 2536.06 1.03588e+07 1631.52 1.66859e+08 2129.7 9.31027e+06 1436.56 9.81771e+07 1937.91 9.04813e+06 19899.1 861326 1.3754e+06 9.20855e+06 3.55549e+07
48083 20220728.4 6537.16 8.32349e+07 20590.5 1.6126e+07 5720.58 1.56017e+07 2631.03 9.83456e+06 1599.78 1.64762e+08 2126.15 9.31027e+06 1454.7 9.81771e+07 1946.79 9.04813e+06 22317.3 890155 1.36259e+06 9.20855e+06 3.58669e+07

safe_node_removal

build_id build_number tpcc_sgx_cft^ tpcc_sgx_cft_mem ls_sgx_cft^ ls_sgx_cft_mem ls_jwt_sgx_cft^ ls_jwt_sgx_cft_mem ls_js_sgx_cft^ ls_js_sgx_cft_mem ls_v8_sgx_cft^ ls_v8_sgx_cft_mem ls_full_js_sgx_cft^ ls_full_js_sgx_cft_mem ls_full_v8_sgx_cft^ ls_full_v8_sgx_cft_mem ls_js_jwt_sgx_cft^ ls_js_jwt_sgx_cft_mem hist_sgx_cft^ RB put (/s)^ CHAMP put (/s)^ RB get (/s)^ CHAMP get (/s)^
48062 20220727.5 6518.06 8.37592e+07 20221.7 1.66503e+07 5634.52 1.53396e+07 2486.51 9.83456e+06 1603.58 1.66335e+08 2119.87 1.29803e+07 1419.45 9.79149e+07 1934.54 8.78598e+06 20410.3 890662 1.35944e+06 9.20441e+06 3.58663e+07
48068 20220727.7 6477.66 8.3497e+07 20443.6 1.58639e+07 5622.92 1.53396e+07 2542.52 1.0621e+07 1582.88 1.645e+08 2091.43 1.19317e+07 1417.01 9.84392e+07 1933.21 9.04813e+06 23835.2 889696 1.37292e+06 9.24184e+06 3.65062e+07
48074 20220727.9 6603.53 8.3497e+07 20302.5 1.58639e+07 5616.88 1.56017e+07 2513.23 1.32424e+07 1629.13 1.66597e+08 2128.94 9.31027e+06 1433.28 9.81771e+07 1901.15 9.04813e+06 20274 901641 1.38079e+06 9.21688e+06 3.57417e+07
48090 20220728.6 6598.33 8.42835e+07 20468.5 1.56017e+07 5606.93 1.50774e+07 2550.58 1.03588e+07 1632.61 1.64762e+08 2124.35 9.31027e+06 1452.72 9.87014e+07 1944.46 9.04813e+06 18116.8 884812 1.38705e+06 9.25018e+06 3.65062e+07
48093 20220728.7 6473.96 8.37592e+07 20464.4 1.58639e+07 5637.9 1.56017e+07 2555.91 1.00967e+07 1600.3 1.66335e+08 2126.34 9.31027e+06 1417.12 9.73907e+07 1931.8 8.78598e+06 18197.5 906146 1.36196e+06 9.26688e+06 3.58663e+07

images

@achamayou
Copy link
Member Author

achamayou commented Jul 22, 2022

Some notes for Monday:

Interestingly, the LTS fails with:

Exception in virtual void aft::Aft<consensus::LedgerEnclave>::recv_message(const ccf::NodeId &, const uint8_t *, size_t) [LedgerProxy = consensus::LedgerEnclave]

Occurring in a 2.0.4 node, which suggests that a new node is sending messages the old nodes are unable to cope with. I was initially suspicious that this may be because the nodes are removed from the table too early, and signature verification fails, but removing the node removal call reveals that is not the case!

@achamayou
Copy link
Member Author

The LTS break is caused by the enum extension causing a deserialisation break on the older nodes. The fix, discussed with @eddyashton is to remove the additional state and use a boolean flag instead.

@achamayou achamayou marked this pull request as ready for review July 25, 2022 17:44
@achamayou achamayou requested a review from a team July 25, 2022 17:44
@achamayou achamayou added the 2.x-todo PRs which should be backported to 2.x label Jul 26, 2022
@jumaffre
Copy link
Contributor

This needs a changelog entry and a new section in the operation docs here and here.

tests/infra/consortium.py Outdated Show resolved Hide resolved
tests/infra/consortium.py Outdated Show resolved Hide resolved
@achamayou
Copy link
Member Author

This needs a changelog entry and a new section in the operation docs here and here.

Yes, I've updated those.

@achamayou achamayou requested review from jumaffre and eddyashton July 27, 2022 20:15
@achamayou achamayou merged commit dfe98cb into microsoft:main Jul 28, 2022
@github-actions
Copy link

💔 All backports failed

Status Branch Result
release/2.x Backport failed because of merge conflicts

You might need to backport the following PRs to release/2.x:
- Raft: count election quorums in all active configurations (#4018)

Manual backport

To create the backport manually run:

backport --pr 4008

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

achamayou added a commit to achamayou/CCF that referenced this pull request Jul 28, 2022
achamayou added a commit that referenced this pull request Jul 29, 2022
@jumaffre jumaffre added the backported This PR was successfully backported to LTS branch label Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x-todo PRs which should be backported to 2.x auto-backport Automatically backport this PR to LTS branch backported This PR was successfully backported to LTS branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants