Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channels fix: Resolve lts_compatibility test failures #3035

Merged
merged 4 commits into from
Oct 1, 2021

Conversation

eddyashton
Copy link
Member

Since #2801 we've had a high rate of failures in the lts_compatibility build. After investigating the logs I believe I've found the root cause. Opening as a PR with a forced repro initially, and then I'll add the resolution.

A healthy join of a new-node to old-node looks like this:

35:10.928 |03|       Node has now joined the network as node n[3=1735]: all domains (node_state.h:578)
35:10.967 |00| Initiating node channel with 3=1735. (channels.h:713)
35:11.035 |00| Node channel with 3=1735 is now established. (channels.h:690)
35:11.059 |03|       Node channel with n[0=2f90] is now established. (channels.h:732)

The failures we're seeing look like this:

35:17.470 |04|         Node has now joined the network as node n[4=e75f]: all domains (node_state.h:578)
35:17.508 |00| Initiating node channel with 4=e75f. (channels.h:713)
35:17.614 |00| Initiating node channel with 4=e75f. (channels.h:713)
35:17.625 |00| Node channel with 4=e75f is now established. (channels.h:690)
35:17.636 |04|         <- n[0=2f90] (WAITING_FOR_FINAL): Peer certificate verification failed (channels.h:548)
35:17.636 |04|         Node channel with n[0=2f90] cannot receive authenticated message: not established, status=WAITING_FOR_FINAL (channels.h:833)
35:17.636 |04|         Dropped invalid message from n[0=2f90] (raft.h:1001)

The key diagnostic is that repeated Initiating node channel from Node 0. It attempts to initiate many times before the node is ready, but if it makes multiple attempts late enough (when the target is actually listening and processing them), then they get out-of-sync.

Looking at another repro with verbose logs, this is what's happening on the receiving node (trimmed and [DESCRIBED] for readability):

09:37.515 |03|       <- n[1=29e7] (INACTIVE): recv_key_exchange_init(945 bytes, false) (channels.h:337)
09:37.518 |03|       <- n[1=29e7] (INACTIVE): New peer certificate: [CERT FOR NODE 0] (channels.h:661)
09:37.518 |03|       <- n[1=29e7] (INACTIVE): Verifying peer signature with peer certificate serial [CERT FOR NODE 0] (channels.h:676)
09:37.523 |03|       <- n[1=29e7] (INACTIVE): recv_key_exchange_init: version=1 ks=[NODE 0'S KEY SHARE] sig=[SIG A] pc=[CERT FOR NODE 0] salt=[SALT FROM NODE 0] (channels.h:420)
09:37.523 |03|       [Channel to n[1=29e7]] Advancing to state WAITING_FOR_FINAL (from INACTIVE) (state_machine.h:44)
09:37.525 |03|       -> n[1=29e7] (WAITING_FOR_FINAL): send_key_exchange_response: oks=[MY KEY SHARE A], serialised_signed_share=[SIGNED [MY KEY SHARE A]] (channels.h:263)
09:37.525 |03|       <- n[1=29e7] (WAITING_FOR_FINAL): recv_key_exchange_init(946 bytes, false) (channels.h:337)
09:37.528 |03|       <- n[1=29e7] (WAITING_FOR_FINAL): New peer certificate: [CERT FOR NODE 0] (channels.h:661)
09:37.528 |03|       <- n[1=29e7] (WAITING_FOR_FINAL): Verifying peer signature with peer certificate serial [CERT FOR NODE 0] (channels.h:676)
09:37.533 |03|       <- n[1=29e7] (WAITING_FOR_FINAL): recv_key_exchange_init: version=1 ks=[NODE 0'S KEY SHARE] sig=[SIG B] pc=[CERT FOR NODE 0] salt=[SALT FROM NODE 0] (channels.h:420)
09:37.533 |03|       [Channel to n[1=29e7]] Advancing to state WAITING_FOR_FINAL (from WAITING_FOR_FINAL) (state_machine.h:44)
09:37.536 |03|       -> n[1=29e7] (WAITING_FOR_FINAL): send_key_exchange_response: oks=[MY KEY SHARE B], serialised_signed_share=[SIGNED [MY KEY SHARE B]] (channels.h:263)

In short, the peer makes multiple initiation attempts with the same key share (ie, same context), while the new code resets its context (regenerates its own key share) every time it receives. The fix is to stick with the old context for local state, potentially using a new share from their end, and only ingest it destructively when we come to deriving the shared secret.

src/node/node_state.h Outdated Show resolved Hide resolved
@ghost
Copy link

ghost commented Sep 30, 2021

lts_compat_fixup@33604 aka 20211001.7 vs main ewma over 20 builds from 33210 to 33585

Click to see table
build_id build_number tpcc_sgx_cft^ tpcc_sgx_cft_mem tpcc_sgx_bft^ tpcc_sgx_bft_mem ls_sgx_cft^ ls_sgx_cft_mem ls_jwt_sgx_cft^ ls_jwt_sgx_cft_mem ls_js_sgx_cft^ ls_js_sgx_cft_mem ls_full_js_sgx_cft^ ls_full_js_sgx_cft_mem ls_js_jwt_sgx_cft^ ls_js_jwt_sgx_cft_mem CHAMP put (/s)^ CHAMP get (/s)^ Historical query (/s)
33210 20210922.21 6572.27 9.26721e+07 2138.21 1.52965e+08 24738 1.87475e+07 4366.94 1.53396e+07 2791.68 1.19317e+07 2588.76 1.00967e+07 1810.41 8.52384e+06 1.472e+06 3.61837e+07 nan
33220 20210923.1 6655.7 9.13613e+07 2091.76 1.90976e+08 23782.2 1.7961e+07 4381.4 1.6126e+07 2765.93 1.2456e+07 2597.45 1.29803e+07 1801.18 1.03588e+07 1.45546e+06 3.58036e+07 nan
33225 20210923.3 6404.89 9.21478e+07 2214.17 1.4746e+08 24410.3 1.84853e+07 4339 1.6126e+07 2795.25 1.40289e+07 2664.78 1.53396e+07 1845.28 8.52384e+06 1.45838e+06 3.58669e+07 nan
33232 20210923.5 6722.52 9.16235e+07 2111.1 1.50082e+08 24556.7 1.84853e+07 4348.28 1.6126e+07 2731.5 1.4291e+07 2672.09 1.48153e+07 1833.19 8.2617e+06 1.4755e+06 3.62471e+07 nan
33250 20210923.10 6703.78 9.10992e+07 2216.08 1.45363e+08 24733.8 1.84853e+07 4422.71 1.58639e+07 2812.87 1.37667e+07 2649.07 1.45532e+07 1847.12 8.52384e+06 1.45176e+06 3.58669e+07 nan
33274 20210923.17 6513.88 9.24099e+07 2146.82 1.50868e+08 24649.4 1.82232e+07 4420.79 1.58639e+07 2752.64 1.03588e+07 2636.23 1.0621e+07 1757.44 8.52384e+06 1.45816e+06 3.58036e+07 nan
33290 20210924.1 6778.1 9.13613e+07 2153.03 1.42217e+08 24718.2 1.7961e+07 4454.64 1.58639e+07 2778.31 1.50774e+07 2650.82 1.0621e+07 1847.15 8.52384e+06 1.44336e+06 3.55549e+07 nan
33323 20210924.8 6648.29 9.10992e+07 2325.31 1.37499e+08 24492.1 1.84853e+07 4435.33 1.58639e+07 2754.79 1.32424e+07 2662.83 1.50774e+07 1842.5 8.2617e+06 1.45495e+06 3.65714e+07 nan
33347 20210927.1 6795.6 9.05749e+07 2224.82 1.86782e+08 24442.3 1.87475e+07 4366.67 1.58639e+07 2713.81 1.45532e+07 2608.99 1.08831e+07 1833.75 8.52384e+06 1.45672e+06 3.58669e+07 nan
33409 20210928.1 6578.77 9.05749e+07 2314.14 1.40382e+08 24021.5 1.84853e+07 4153 1.58639e+07 2713.83 1.40289e+07 2617.3 1.0621e+07 1821.2 8.52384e+06 1.45269e+06 3.58042e+07 nan
33414 20210928.3 6823.35 9.26721e+07 2115.43 1.54538e+08 23482.4 1.84853e+07 4462.24 1.56017e+07 2814.96 1.03588e+07 2663.44 1.45532e+07 1841.21 8.2617e+06 1.46389e+06 3.58042e+07 nan
33439 20210928.9 6594.27 9.26721e+07 2029.8 1.52179e+08 24509.1 1.7961e+07 4252 1.58639e+07 2769.02 1.27181e+07 2619.68 1.27181e+07 1829.44 8.2617e+06 1.48319e+06 3.61831e+07 nan
33463 20210928.16 6708.27 9.03128e+07 1960.42 1.56111e+08 24671.4 1.92717e+07 4374.95 1.58639e+07 2721.99 1.2456e+07 2661.72 1.32424e+07 1782.22 8.2617e+06 1.48083e+06 3.65714e+07 nan
33473 20210929.1 6719.57 9.08371e+07 2168.04 1.50082e+08 24512.7 1.84853e+07 4579.37 1.58639e+07 2767.98 1.4291e+07 2681.94 1.45532e+07 1799.29 8.2617e+06 1.45712e+06 3.59923e+07 nan
33488 20210929.6 6683.77 9.31964e+07 2210.7 1.43528e+08 24425.2 1.87475e+07 4578.76 1.58639e+07 2755.82 1.27181e+07 2630.47 1.45532e+07 1809.53 8.52384e+06 1.44847e+06 3.58042e+07 nan
33520 20210930.1 6822.64 9.08371e+07 2090.59 1.50868e+08 24554.6 1.82232e+07 4564.43 1.58639e+07 2783.98 1.45532e+07 2683.75 1.48153e+07 1813.26 8.52384e+06 1.47412e+06 3.66369e+07 nan
33535 20210930.6 6695.29 9.16235e+07 2014.27 1.58732e+08 24593.4 1.92717e+07 4536.47 1.56017e+07 2795.35 1.29803e+07 2673.57 1.08831e+07 1821.18 7.99955e+06 1.45972e+06 3.56174e+07 2088.74
33549 20210930.10 6538.65 9.10992e+07 1752.16 1.50868e+08 24704.4 1.84853e+07 4484.44 1.53396e+07 2751.75 1.00967e+07 2628.74 1.48153e+07 1767.93 8.2617e+06 1.4683e+06 3.58669e+07 2271.86
33556 20210930.12 6343.73 9.34585e+07 2026.17 1.61878e+08 23494 1.84853e+07 4529.14 1.56017e+07 2790.84 1.2456e+07 2680.92 1.37667e+07 1824.25 8.2617e+06 1.4531e+06 3.50085e+07 2256.22
33585 20211001.1 6790.56 9.16235e+07 1986.75 1.52965e+08 24322.5 1.7961e+07 4368 1.58639e+07 2780.5 1.4291e+07 2582.07 1.4291e+07 1824.07 8.2617e+06 1.47794e+06 3.54933e+07 2286.49

images

@achamayou
Copy link
Member

Looks reasonable to me.

@eddyashton eddyashton marked this pull request as ready for review October 1, 2021 12:26
@eddyashton eddyashton requested a review from a team October 1, 2021 12:26
@eddyashton eddyashton changed the title [DRAFT] Channels fix: Resolve lts_compatibility test failures Channels fix: Resolve lts_compatibility test failures Oct 1, 2021
@achamayou achamayou merged commit 0ecad5c into microsoft:main Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants