Return errors when session consistency would be broken #4351

eddyashton · 2022-10-14T14:44:52Z

Resolves #3952.

This currently implements a softer variant of what was previously discussed. Rather than killing sessions, we return HTTP errors. Even after returning such an error, we will not kill the session - the user may ask us multiple times over the same session and get repeated errors, or even ask us a pure command (non-transactional and thus not inconsistent) and get a real response back.

I've added an end-to-end test that I think covers everything, but I'm considering a stochastic for broader coverage: spammy clients confirming they see a consistently ratcheting TxID (or this new error), while we load the service/cause elections/kill nodes.

…consistency_loss

ccf-bot · 2022-10-14T15:01:58Z

kill_session_on_consistency_loss@51658 aka 20221020.5 vs main ewma over 20 builds from 51225 to 51650

Click to see table

main

build_id	build_number	tpcc_virtual_cft^	ls_virtual_cft^	tpcc_sgx_cft^	tpcc_sgx_cft_mem	ls_jwt_virtual_cft^	ls_js_virtual_cft^	ls_full_js_virtual_cft^	ls_sgx_cft^	ls_sgx_cft_mem	ls_js_jwt_virtual_cft^	ls_jwt_sgx_cft^	ls_jwt_sgx_cft_mem	ls_js_sgx_cft^	ls_js_sgx_cft_mem	ls_full_js_sgx_cft^	ls_full_js_sgx_cft_mem	hist_sgx_cft^	ls_js_jwt_sgx_cft^	ls_js_jwt_sgx_cft_mem	RB put (/s)^	CHAMP put (/s)^	RB get (/s)^	CHAMP get (/s)^
51225	20221014.5	10867.1	39581.7	6372.16	8.32349e+07	10231	4666.19	3424.22	20080.7	1.66503e+07	3231.91	5542.74	1.56017e+07	2420.92	9.83456e+06	2035.03	9.31027e+06	29800.6	1867.36	9.31027e+06	917146	1.37633e+06	9.13466e+06	3.53707e+07
51268	20221014.24	10451.7	39975	6175.05	8.29727e+07	10156.6	4589.1	3598.76	19398.9	1.69124e+07	3219.98	5576.74	1.53396e+07	2422.77	9.57242e+06	2057.44	9.57242e+06	26259.1	1870.29	9.04813e+06	900135	1.40408e+06	9.39445e+06	3.65714e+07
51312	20221014.44	10700	40972.9	6144.6	8.32349e+07	10237.2	4689.8	3562.02	19380	1.69124e+07	3414.58	5583.26	1.53396e+07	2409.54	9.57242e+06	2026.31	9.31027e+06	25408.5	1865.88	9.04813e+06	891980	1.37652e+06	9.26688e+06	3.44781e+07
51323	20221017.3	11697.2	43984.1	6253.13	8.3497e+07	10061.3	4390.77	3444.48	19369.6	1.6126e+07	3280.7	5600.96	1.56017e+07	2434.56	9.57242e+06	2039.86	9.31027e+06	23812.4	1907.97	9.04813e+06	891786	1.36989e+06	9.20028e+06	3.58042e+07
51356	20221017.18	10476.1	42021.6	6294.69	8.29727e+07	10516.1	4260.13	3424.86	19384.8	1.6126e+07	3349.05	5564.24	1.56017e+07	2427.01	9.83456e+06	2032.63	9.57242e+06	28408.8	1879.5	9.31027e+06	902005	1.36871e+06	9.25445e+06	3.58042e+07
51369	20221017.23	10815.5	43886	6182.92	8.3497e+07	10496.8	4265.57	3591.34	20141.6	1.63882e+07	3287.98	5526.98	1.50774e+07	2428.79	9.57242e+06	2033.49	1.35046e+07	31812.7	1869.38	9.04813e+06	892565	1.36669e+06	9.15508e+06	3.48887e+07
51401	20221017.36	11574	42017.4	6401.49	8.32349e+07	10681.1	4325.12	3465.53	19869.8	1.6126e+07	3295.72	5662.52	1.53396e+07	2430.97	9.57242e+06	2063.81	9.31027e+06	24574.1	1835.87	9.31027e+06	903431	1.36769e+06	9.25436e+06	3.55556e+07
51413	20221018.3	11013.5	43526	6278.03	8.21863e+07	10297.9	4667.91	3685.52	19757.1	1.6126e+07	3264.67	5570.73	1.56017e+07	2430.11	1.00967e+07	2027.11	9.31027e+06	31329.2	1869.16	9.04813e+06	902039	1.34692e+06	9.26693e+06	3.57411e+07
51429	20221018.9	11110.8	40278	6224.69	8.29727e+07	10337.7	4266.39	3578.13	19543	1.63882e+07	3221.39	5628.46	1.58639e+07	2438.33	9.57242e+06	2036.01	9.31027e+06	26283.5	1849.43	9.31027e+06	907997	1.36879e+06	9.21269e+06	3.49488e+07
51448	20221018.17	10970.3	40387.1	6358.96	8.29727e+07	10141	4816.31	3583.18	20736.7	1.6126e+07	3365.91	6546.68	1.58639e+07	2491.75	9.83456e+06	2114.99	9.57242e+06	27143.6	2043.52	9.04813e+06	896039	1.36332e+06	9.17555e+06	3.58042e+07
51462	20221018.22	11528.7	40554.5	6364.65	8.21863e+07	10737.9	4313.46	3591.18	20872.7	1.63882e+07	3248.26	6477.21	1.53396e+07	2504.11	9.57242e+06	2088.99	9.31027e+06	27237.4	2108.59	9.04813e+06	905588	1.38284e+06	9.26269e+06	3.58669e+07
51484	20221018.31	11594.7	40594.8	6322.76	8.29727e+07	10515.9	4413.7	3471.44	20910.3	1.6126e+07	3375.31	6470.15	1.58639e+07	2541.74	1.00967e+07	2082.94	9.57242e+06	28596.7	1993.75	9.31027e+06	886076	1.34268e+06	9.00216e+06	3.5128e+07
51516	20221018.43	11003.7	41607.1	6379.93	8.37592e+07	10624.7	4282.55	3524.01	20709.1	1.71746e+07	3224.96	6410.82	1.53396e+07	2492.92	9.83456e+06	2081.91	9.57242e+06	23047.8	1992.92	9.04813e+06	885966	1.395e+06	9.17974e+06	3.58669e+07
51542	20221018.51	11060	40657.4	6396.41	8.3497e+07	10727.4	4326.02	3603.46	20985.3	1.6126e+07	3255.85	6446.03	1.58639e+07	2499.1	9.83456e+06	2154.7	9.57242e+06	25612.3	1998.32	9.04813e+06	881385	1.36551e+06	9.27532e+06	3.58663e+07
51552	20221018.55	8720.44	34911.7	6425.99	8.3497e+07	9261.82	4085.74	3347.54	20823.3	1.63882e+07	3197.54	6505.01	1.58639e+07	2485.32	1.00967e+07	2095.18	9.31027e+06	24649.6	2010.82	9.31027e+06	917639	1.37977e+06	9.27952e+06	3.63121e+07
51560	20221019.3	11252.2	37041	6359.85	8.3497e+07	10152.8	4368.85	3403.13	20749	1.63882e+07	3365.06	6418.94	1.58639e+07	2490.69	9.83456e+06	2105.81	9.31027e+06	28778.6	2036.1	9.04813e+06	908117	1.37624e+06	9.41168e+06	3.66362e+07
51568	20221019.7	10865.4	37837.3	5745.73	8.37592e+07	10050.4	4574.09	3426.37	17946.8	1.71746e+07	3507.21	6045.94	1.53396e+07	2356.59	9.83456e+06	1999.41	1.08831e+07	21129.7	1975.93	9.31027e+06	905101	1.37374e+06	9.29629e+06	3.54933e+07
51596	20221019.17	9898.13	36578.3	5697	8.29727e+07	9638.42	4265.05	3362.88	17157.3	1.69124e+07	3199.59	6161.39	1.56017e+07	2411.73	9.57242e+06	2017.64	9.31027e+06	23518.8	1931.15	9.31027e+06	911715	1.37145e+06	9.17148e+06	3.58669e+07
51626	20221019.27	11126.6	42087	5640.83	8.29727e+07	10128.4	4248.36	3590.06	17643	1.58639e+07	3264.35	6155.07	1.53396e+07	2345.98	9.57242e+06	1981.41	9.57242e+06	27871.7	1930.93	9.04813e+06	911634	1.36897e+06	9.06993e+06	3.51884e+07
51650	20221020.3	11472.4	41541.8	5786.4	8.27106e+07	10031.5	4301.53	3539.77	18023.8	1.58639e+07	3209.56	6135.44	1.56017e+07	2369.1	9.57242e+06	1991.09	9.57242e+06	28253.1	1934.73	9.04813e+06	905029	1.39168e+06	9.22934e+06	3.56174e+07

kill_session_on_consistency_loss

build_id	build_number	tpcc_virtual_cft^	ls_virtual_cft^	tpcc_sgx_cft^	tpcc_sgx_cft_mem	ls_jwt_virtual_cft^	ls_js_virtual_cft^	ls_full_js_virtual_cft^	ls_sgx_cft^	ls_sgx_cft_mem	ls_js_jwt_virtual_cft^	ls_jwt_sgx_cft^	ls_jwt_sgx_cft_mem	ls_js_sgx_cft^	ls_js_sgx_cft_mem	ls_full_js_sgx_cft^	ls_full_js_sgx_cft_mem	hist_sgx_cft^	ls_js_jwt_sgx_cft^	ls_js_jwt_sgx_cft_mem	RB put (/s)^	CHAMP put (/s)^	RB get (/s)^	CHAMP get (/s)^
51274	20221014.26	11260.7	43151.2	7117	8.37592e+07	10485.9	4423.25	3581.22	19402.8	1.6126e+07	3232.39	5643.84	1.56017e+07	2418.25	9.57242e+06	1958.97	9.31027e+06	32502	1870.25	9.04813e+06	907634	1.36624e+06	9.30905e+06	3.56174e+07
51333	20221017.7	10899.9	41282.1	6283.41	8.27106e+07	10516.5	4301.41	3425.97	19400.7	1.63882e+07	3195.27	5560.89	1.56017e+07	2475.64	9.57242e+06	1913.8	9.57242e+06	23431.3	1880.5	9.04813e+06	903791	1.36342e+06	9.25855e+06	3.58036e+07
51641	20221019.33	11263.5	41378.1	5598.25	8.29727e+07	10035	4384.74	3407.19	17380	1.63882e+07	3232.94	6043.72	1.56017e+07	2451.62	9.57242e+06	1993.54	1.27181e+07	23539.8	1914.53	9.31027e+06	813856	1.38809e+06	9.22103e+06	3.65714e+07
51658	20221020.5	11022.6	39407.3	5776.65	8.29727e+07	9935	4347.67	3439.07	17874.8	1.6126e+07	3237.11	6169.49	1.56017e+07	2355.4	9.57242e+06	2055.53	9.31027e+06	30578.5	1939.14	9.31027e+06	890468	1.37569e+06	9.21273e+06	3.56794e+07

…consistency_loss

achamayou · 2022-10-19T15:30:27Z

src/http/http_endpoint.h

+ auto header_begin = std::search(
+ response.begin(), response.end(), target.begin(), target.end());
+ auto header_name_end = std::find(header_begin, response.end(), ':');
+ auto header_value_end = std::find(header_name_end, response.end(), '\r');


That seems unfortunate at best, and possibly insecure at worst if it's possible to add http::headers::CCF_TX_ID inline as a header value, followed by a bogus value. We should at least look for "\r{}", http::headers::CCF_TX_ID, I think.

x-my-evil-header: x-ms-ccf-transaction-id: 9.99\r x-ms-ccf-transaction-id: 2.1

tests/partitions_test.py

achamayou · 2022-10-19T15:37:37Z

I'm a bit skeptical about the tradeoffs involved in going above and beyond when we know for sure there has been an election, to still attempt to accurately report status. It seems to me that immediately returning an error and shutting down the connection is cleaner to implement and to reason about in terms of availability.

A batching client cannot avoid having to implement a backtracking procedure, as far as I can tell anyway.

achamayou · 2022-10-19T15:40:52Z

Perhaps something we could do in that situation is respond with errors to all further requests, generically at first, and then with the last committed transaction id in the last term the session wrote to.

POST -> COMMITTED 5.16
POST -> PENDING 5.17
*** ELECTION ***
POST -> ERROR
POST -> ERROR
POST -> ERROR, LAST COMMIT WAS 5.16 - NOW ON 6.21
CLOSE CONNECTION

achamayou · 2022-10-20T08:32:44Z

Summary of discussion with @heidihoward yesterday: the tradeoff for trying to do better than just dropping the connection is unclear, particularly in an environment where many client libraries use connection pools and will not expose sessions directly. It is unlikely that a client will write logic that correctly handles these responses.

eddyashton · 2022-10-20T09:43:00Z

I'm a bit skeptical about the tradeoffs involved in going above and beyond when we know for sure there has been an election, to still attempt to accurately report status. It seems to me that immediately returning an error and shutting down the connection is cleaner to implement and to reason about in terms of availability.

A batching client cannot avoid having to implement a backtracking procedure, as far as I can tell anyway.

Summary of discussion with @heidihoward yesterday: the tradeoff for trying to do better than just dropping the connection is unclear, particularly in an environment where many client libraries use connection pools and will not expose sessions directly. It is unlikely that a client will write logic that correctly handles these responses.

My view is that this approach was easier to implement than closing the session (application code returning an application error, rather than affecting session lifetimes), that it makes little difference to clients which don't handle this behaviour (is your session closed aggressively or do you get permanent errors? Could you or do you want to handle one of these under the hood?), and that it's much nicer in the case where there's a human-in-the-loop request flow (harmless elections are invisible and allow you to proceed, you get a readable error if your state was lost that helps you backtrack).

EDIT - On reflection, there's really 2 changes here from what we initially envisaged. A is whether this is "an election affects all active sessions", or "per-request, we look at the reported TxIDs" (with a middle-ground of "after an election, we do FOO for every new request on previously active sessions"). B is whether we close the connection, send an error response and close the connection (requires waiting for the next request), or keep the session open but continue to return errors.

achamayou · 2022-10-20T09:51:30Z

@eddyashton in the case of an automatic client with a connection pool, keeping the connection open but returning errors for every subsequent request is clearly suboptimal, even if it's for a while until we can provide information about where the rollback took place. It results in a much higher rate of failure going forward than closing the connection.

eddyashton · 2022-10-20T15:10:21Z

Parking this PR for now, and will try a simpler approach of aggressively killing sessions on election.

eddyashton added 9 commits October 13, 2022 10:48

Initial implementation

ab0134d

Sketching a test

be9073e

Insert a way to record txid in forwarded responses

d791e33

An expensive off-by-one error

1db2911

Add helper method to resize network

05ba86b

A complete test

555a8b4

Cleanup and tweak comment

f4f7972

doot

c7edf25

doot

4f59d36

eddyashton requested a review from a team as a code owner October 14, 2022 14:44

Merge branch 'main' of github.com:microsoft/CCF into kill_session_on_…

3b4e320

…consistency_loss

eddyashton and others added 3 commits October 17, 2022 08:36

Oops

82c6027

Merge branch 'main' of github.com:microsoft/CCF into kill_session_on_…

030f48c

…consistency_loss

Merge branch 'main' into kill_session_on_consistency_loss

a49f29e

achamayou reviewed Oct 19, 2022

View reviewed changes

tests/partitions_test.py Outdated Show resolved Hide resolved

achamayou reviewed Oct 19, 2022

View reviewed changes

tests/partitions_test.py Outdated Show resolved Hide resolved

eddyashton added 2 commits October 20, 2022 08:08

Stricter search for txID header

b160dcd

Restore disabled tests

cee91ef

eddyashton closed this Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return errors when session consistency would be broken #4351

Return errors when session consistency would be broken #4351

eddyashton commented Oct 14, 2022

ccf-bot commented Oct 14, 2022 •

edited

Loading

achamayou Oct 19, 2022 •

edited

Loading

achamayou commented Oct 19, 2022

achamayou commented Oct 19, 2022

achamayou commented Oct 20, 2022

eddyashton commented Oct 20, 2022 •

edited

Loading

achamayou commented Oct 20, 2022

eddyashton commented Oct 20, 2022

Return errors when session consistency would be broken #4351

Return errors when session consistency would be broken #4351

Conversation

eddyashton commented Oct 14, 2022

ccf-bot commented Oct 14, 2022 • edited Loading

achamayou Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

achamayou commented Oct 19, 2022

achamayou commented Oct 19, 2022

achamayou commented Oct 20, 2022

eddyashton commented Oct 20, 2022 • edited Loading

achamayou commented Oct 20, 2022

eddyashton commented Oct 20, 2022

ccf-bot commented Oct 14, 2022 •

edited

Loading

achamayou Oct 19, 2022 •

edited

Loading

eddyashton commented Oct 20, 2022 •

edited

Loading