Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust ringbuffer sizes #5866

Merged
merged 8 commits into from
Dec 13, 2023
Merged

Adjust ringbuffer sizes #5866

merged 8 commits into from
Dec 13, 2023

Conversation

achamayou
Copy link
Member

@achamayou achamayou commented Dec 12, 2023

Resolve #5672. A 4 times increase is in line with the change previously made in 5ecf0a1.

As expected, there is not a substantial impact on performance, but this will allow bigger messages.

The js batched stress test does fail most of the time on debug builds. Superficially this is because the assert for the node crashing is not triggered, but in fact an earlier election causes the test to wait on a node that is by now a backup and not expected to crash. The election itself is triggered by very high I/O pauses when it happens:

2023-12-13T13:02:10.058005Z        100 [info ] ../src/host/time_bound_logger.h:54   | Operation took too long (  1.866s): Reading ledger entries from 30 to 30

I suspect the reason this is limited to debug is the very large amount of logging that probably causes IOPS throttling. But I am not sure why the larger RB/Fragment/Msg sizes, and the reduction in fragmentation has apparently worsened this.

Edit: what happens is that instead of crashing very quickly, the primary gets further in writing out large transactions, producing a bigger ledger. The writes are eventually so slow that they time out, causing an election. At that point we try to submit an even bigger batch on a backup, crashing it as expected.

But then our logic expects the first primary (nodes[0]) to have crashed, which has not happened and will not happen.

@ghost
Copy link

ghost commented Dec 12, 2023

rb_adjustments@79445 aka 20231213.11 vs main ewma over 20 builds from 79105 to 79434

Click to see table

main

build_id build_number tlc_3node_fixed_duration_s tlc_3node_fixed_states tlc_atomic_reconfig_duration_s tlc_atomic_reconfig_states tlc_reconfig_duration_s tlc_reconfig_states tlc_sim_traces tlc_sim_levelmean Commit latency factor tpcc_virtual_cft^ ls_virtual_cft^ pi_ls_virtual_cft^ pi_basic_virtual_cft^ pi_basic_js_virtual_cft^ ls_jwt_virtual_cft^ pi_ls_jwt_virtual_cft^ ls_js_virtual_cft^ pi_basic_mt_virtual_cft^ ls_full_js_virtual_cft^ ls_js_jwt_virtual_cft^ hist_sgx_cft^ tpcc_sgx_cft^ tpcc_sgx_cft_mem pi_basic_mt_sgx_cft^ pi_basic_mt_sgx_cft_mem ls_sgx_cft^ ls_sgx_cft_mem pi_ls_sgx_cft^ pi_ls_sgx_cft_mem pi_basic_sgx_cft^ pi_basic_sgx_cft_mem pi_basic_js_sgx_cft^ pi_basic_js_sgx_cft_mem ls_jwt_sgx_cft^ ls_jwt_sgx_cft_mem pi_ls_jwt_sgx_cft^ pi_ls_jwt_sgx_cft_mem ls_js_sgx_cft^ ls_js_sgx_cft_mem ls_full_js_sgx_cft^ ls_full_js_sgx_cft_mem ls_js_jwt_sgx_cft^ ls_js_jwt_sgx_cft_mem RB put (/s)^ CHAMP put (/s)^ RB get (/s)^ CHAMP get (/s)^
79105 20231204.2 6 86496 428 1.2541e+07 239 6.31473e+06 2385 403 0.815873 17179.5 45756.3 47733.8 54949.6 4389.4 17055.4 19057.1 17280.4 73754 14880.9 10265.5 46153.5 5602.98 8.59996e+07 27999.5 2.30851e+07 14008.7 1.88908e+07 14125.9 1.05021e+07 15572.3 1.25993e+07 1437 1.05021e+07 6851.44 1.88908e+07 6973.1 6.30784e+06 5750.64 1.67936e+07 5733.89 1.88908e+07 3965.95 1.67936e+07 840283 1.17731e+06 8.13547e+06 3.07046e+07
79118 20231204.7 6 86496 433 1.2541e+07 239 6.31473e+06 2246 403 0.84316 17330.1 45759.3 47756.9 54515.7 4374.4 17110.2 19464.3 14917.2 75841.7 14816.1 10346.3 46677 5635.13 8.80968e+07 27945 2.30851e+07 14068.5 1.88908e+07 14171.9 1.05021e+07 15653.9 1.46964e+07 1432.9 1.25993e+07 6802.32 1.67936e+07 6974.7 6.30784e+06 5792.64 1.67936e+07 5741.3 1.67936e+07 3997.22 1.67936e+07 825189 1.18013e+06 8.15465e+06 3.09674e+07
79134 20231204.11 6 86496 419 1.2541e+07 238 6.31473e+06 2306 403 0.815447 17141.9 45772 48406.7 55596 4381.4 17174.3 19191.6 17496.7 78037.6 14891.7 9761.62 41803.1 5641.99 8.59996e+07 28102.5 2.51822e+07 14020.8 1.88908e+07 14079.7 1.05021e+07 15626.3 1.46964e+07 1427.2 1.25993e+07 7228.64 1.67936e+07 6997.8 6.30784e+06 5768.48 1.67936e+07 5727.26 1.67936e+07 3996.59 1.67936e+07 828515 1.17548e+06 8.154e+06 3.14453e+07
79141 20231204.13 7 86496 437 1.2541e+07 236 6.31473e+06 2347 403 0.803932 17326.7 45803.4 48363.9 54732.4 4380.2 17105.9 19712.6 17469.6 68327.2 14760 9752.86 42083.3 5596.8 8.59996e+07 27783.3 2.51822e+07 14021.8 1.88908e+07 14166.7 1.05021e+07 15588.9 1.46964e+07 1424.7 1.25993e+07 6827.57 1.67936e+07 6930.3 6.30784e+06 5758.06 1.67936e+07 5444.11 1.67936e+07 3958.22 1.67936e+07 837008 1.16971e+06 8.15053e+06 3.08527e+07
79155 20231205.3 7 86496 445 1.2541e+07 245 6.31473e+06 2148 403 0.815693 17317.4 45707.7 48471.2 55405.6 4408 17078.1 19050.9 17418.2 62066.6 14930.7 10292.2 43564.2 5640.34 8.59996e+07 28082.1 2.51822e+07 14015.6 1.88908e+07 14121.4 1.05021e+07 15576.6 1.25993e+07 1421.7 1.25993e+07 7225.51 1.67936e+07 6929.8 6.30784e+06 5803.52 1.67936e+07 5488.73 1.67936e+07 3996.47 1.67936e+07 828148 1.18265e+06 8.14427e+06 3.06798e+07
79174 20231205.9 7 86496 427 1.2541e+07 247 6.31473e+06 2348 403 0.832702 17291.4 45646.2 48780.5 54658 4392.8 17097.2 19587.1 17435.5 59992.5 14830.4 10290.4 42419.1 5588.32 8.59996e+07 28158.4 2.30851e+07 14034.2 1.88908e+07 14128.5 1.05021e+07 15569 1.25993e+07 1437.6 1.25993e+07 7243.3 1.67936e+07 6934 6.30784e+06 5805.18 1.67936e+07 5487.13 1.67936e+07 3984.99 1.67936e+07 818484 1.18115e+06 8.15014e+06 3.08411e+07
79191 20231206.3 6 86496 425 1.2541e+07 232 6.31473e+06 2348 403 0.807565 17240.9 45694.4 47829.7 55088.2 4390.5 17224.7 19225.6 17518.6 80506.2 16586.7 9748.83 44489 5586.2 8.59996e+07 27937.7 2.51822e+07 13984 1.88908e+07 14096.6 1.05021e+07 15550.9 1.46964e+07 1433.3 1.05021e+07 6873.92 1.88908e+07 6934.2 6.30784e+06 5757.97 1.67936e+07 5477.84 1.67936e+07 3988.75 1.67936e+07 811835 1.17162e+06 8.15462e+06 3.06807e+07
79212 20231206.9 6 86496 422 1.2541e+07 237 6.31473e+06 2341 403 0.787436 17165.4 45737.6 43246.8 55022.1 4400.7 17059.5 19190.5 17109.9 76936.9 14889.3 9772.13 45135.1 5550.7 8.59996e+07 27757.3 2.30851e+07 13991.9 1.88908e+07 14088.3 1.05021e+07 15563.8 1.46964e+07 1431.7 1.25993e+07 6829.75 1.67936e+07 7031.6 6.30784e+06 5794.64 1.67936e+07 5485.33 1.67936e+07 3993.8 1.67936e+07 836211 1.18495e+06 8.16066e+06 3.20255e+07
79229 20231207.3 6 86496 427 1.2541e+07 240 6.31473e+06 2349 403 0.776247 17255.2 45542.6 48633.4 53726.7 4386.5 17128.4 19449.2 17242 75875.6 14480.2 10294.2 43978.5 5576.6 8.59996e+07 27926.5 2.30851e+07 13955.9 1.88908e+07 14087.4 1.05021e+07 15602.3 1.46964e+07 1420.3 1.25993e+07 7247.5 1.67936e+07 7153.1 6.30784e+06 5768.83 1.67936e+07 5480.6 1.67936e+07 3988.57 1.67936e+07 807709 1.17338e+06 8.14541e+06 3.0748e+07
79234 20231207.5 6 86496 421 1.2541e+07 235 6.31473e+06 2368 403 0.787179 17330.6 45876.2 49246.5 53613.6 4405.8 17443.4 19927.1 17334.1 83093.4 14833.1 9850.31 42217.5 5559.41 8.59996e+07 27907.2 2.30851e+07 14029 1.88908e+07 14143.7 1.05021e+07 15552.2 1.46964e+07 1426.4 1.25993e+07 6881.89 1.67936e+07 6972.6 6.30784e+06 5798.24 1.67936e+07 5492.31 1.88908e+07 3987.77 1.67936e+07 834683 1.17911e+06 8.13234e+06 3.12109e+07
79265 20231207.15 6 86496 418 1.2541e+07 232 6.31473e+06 2369 403 0.787657 17117.4 45634.9 49763.9 53593.9 4364.3 17149.3 19160.8 17062.5 83400.4 15007.2 9901.93 45208.2 5571.27 8.59996e+07 28128.3 2.51822e+07 14013.5 1.88908e+07 14038.2 1.05021e+07 15575 1.46964e+07 1429.5 1.25993e+07 7214.04 1.67936e+07 6968.4 6.30784e+06 5796.77 1.67936e+07 5482.18 1.67936e+07 3997.21 1.67936e+07 844021 1.18032e+06 8.15491e+06 3.079e+07
79268 20231208.2 6 86496 436 1.2541e+07 232 6.31473e+06 2275 403 0.785068 17335.8 45512 48794.1 53682.9 4409.2 17151.7 19765.2 17409.1 88746.9 14956.2 9763.8 44756.5 5622.4 8.59996e+07 28108.2 2.51822e+07 14132.8 1.67936e+07 14180.7 1.05021e+07 15657.7 1.46964e+07 1434 1.25993e+07 6845.5 1.67936e+07 6976.9 6.30784e+06 5785.97 1.67936e+07 5729.53 1.67936e+07 3991.09 1.67936e+07 834932 1.18108e+06 8.15261e+06 3.08053e+07
79292 20231208.9 6 86496 427 1.2541e+07 237 6.31473e+06 2408 403 0.797384 17338.6 45729.5 48178.4 54133 4443 17002.7 19108 17651.6 80975.3 14964 10050.8 35885.4 5531.68 8.59996e+07 27684.3 2.51822e+07 14006.4 1.88908e+07 14091.2 1.05021e+07 15504.4 1.46964e+07 1436.4 1.25993e+07 6801.38 1.67936e+07 7135.7 6.30784e+06 5797.37 1.67936e+07 5478.95 1.67936e+07 3957.99 1.67936e+07 819873 1.17941e+06 8.13551e+06 3.12467e+07
79308 20231208.12 7 86496 420 1.2541e+07 235 6.31473e+06 2391 403 0.801386 17436.2 45781 49272.7 54718.5 4426.2 17304.6 19277.9 17480.2 78390.1 15002.8 9762.56 45705.6 5607.69 8.59996e+07 28212 2.30851e+07 14051.2 1.67936e+07 14091 1.05021e+07 15501.9 1.46964e+07 1434 1.25993e+07 7256.97 1.67936e+07 7079.1 6.30784e+06 5800.38 1.67936e+07 5451.23 1.67936e+07 3992.9 1.67936e+07 834092 1.18295e+06 8.14395e+06 3.07526e+07
79317 20231208.15 6 86496 420 1.2541e+07 241 6.31473e+06 2262 403 0.808619 17307.4 53155.6 56932.4 61359.2 4636.8 20997.1 21096.2 17801.8 79114.7 17657.1 11585.4 41880.6 5551.82 8.59996e+07 27807.3 2.30851e+07 14015.7 1.88908e+07 14114.4 1.05021e+07 15543.7 1.46964e+07 1435.1 1.25993e+07 7188.29 1.67936e+07 6913.9 6.30784e+06 5771.41 1.67936e+07 5421.15 1.67936e+07 3989.04 1.67936e+07 839369 1.18189e+06 8.15225e+06 3.04359e+07
79332 20231211.2 7 86496 432 1.2541e+07 231 6.31473e+06 2304 403 0.823766 17433.7 52927 56003.9 61022.8 4599.2 21051.1 21582 17445.1 76565.5 17594.9 11551.9 45474 5612.64 8.59996e+07 27810 2.30851e+07 14003.1 1.88908e+07 14079.5 1.05021e+07 15466.3 1.25993e+07 1435.6 1.25993e+07 6811.41 1.67936e+07 6925.1 6.30784e+06 5768.4 1.67936e+07 5468.54 1.67936e+07 3978.67 1.67936e+07 839138 1.18428e+06 8.08027e+06 3.07757e+07
79357 20231212.4 6 86496 432 1.2541e+07 242 6.31473e+06 2399 403 0.797083 17429.2 52984.5 56319 61280.2 4532.6 20801.4 22012.4 17643 90921.5 17170 11561.7 45359.7 5638.77 8.59996e+07 27519.2 2.51822e+07 14115.2 1.67936e+07 14088 1.05021e+07 15713.4 1.25993e+07 1435.1 1.05021e+07 6838.13 1.67936e+07 6976.5 6.30784e+06 5804.09 1.67936e+07 5497.14 1.88908e+07 3997.04 1.67936e+07 834796 1.18343e+06 8.15332e+06 3.14801e+07
79380 20231212.12 6 86496 430 1.2541e+07 235 6.31473e+06 2276 403 0.773242 17262.8 53208.5 55601.6 61170.8 4550.4 20701.5 21695.3 17743.3 77796 17512.3 11736.7 45417.5 5592.9 8.59996e+07 27820.3 2.51822e+07 14015.4 1.88908e+07 14143.2 1.05021e+07 15551.8 1.46964e+07 1430.7 1.25993e+07 6835.22 1.67936e+07 7038.9 6.30784e+06 5795.89 1.67936e+07 5455.38 1.67936e+07 3995.96 1.67936e+07 840421 1.1803e+06 8.14742e+06 3.0842e+07
79417 20231213.3 7 86496 442 1.2541e+07 238 6.31473e+06 2231 403 0.799203 17331.5 54074.6 55998.3 60854.9 4549.1 20830.3 21405.9 17583.1 89315.5 17498.7 11795.5 40941.2 5526.57 8.59996e+07 27977.8 2.51822e+07 13978 1.88908e+07 14050.6 1.05021e+07 15412.9 1.46964e+07 1422.3 1.25993e+07 7246.14 1.88908e+07 6986.5 6.30784e+06 5779.44 1.67936e+07 5496.45 1.67936e+07 3973.3 1.67936e+07 829829 1.18044e+06 8.14858e+06 3.06005e+07
79434 20231213.8 6 86496 421 1.2541e+07 237 6.31473e+06 2233 403 0.816491 17275 53213.7 57513.6 61723.7 4644.4 20960 22217.8 17470.1 92614.6 17491.8 11851.5 45577.4 5530.83 8.59996e+07 27808 2.51822e+07 13984.3 1.88908e+07 14104.9 1.05021e+07 15480.1 1.25993e+07 1431.7 1.25993e+07 7211.96 1.67936e+07 6886.8 6.30784e+06 5788.56 1.67936e+07 5440.74 1.67936e+07 3983.96 1.67936e+07 827161 1.18375e+06 8.14508e+06 3.08313e+07

rb_adjustments

build_id build_number pi_basic_mt_sgx_cft^ pi_basic_mt_sgx_cft_mem Commit latency factor tpcc_sgx_cft^ tpcc_sgx_cft_mem pi_basic_mt_virtual_cft^ tpcc_virtual_cft^ ls_sgx_cft^ ls_sgx_cft_mem pi_ls_sgx_cft^ pi_ls_sgx_cft_mem pi_basic_sgx_cft^ pi_basic_sgx_cft_mem ls_virtual_cft^ pi_ls_virtual_cft^ pi_basic_virtual_cft^ pi_basic_js_virtual_cft^ tlc_3node_fixed_duration_s tlc_3node_fixed_states tlc_atomic_reconfig_duration_s tlc_atomic_reconfig_states tlc_reconfig_duration_s tlc_reconfig_states ls_jwt_virtual_cft^ pi_ls_jwt_virtual_cft^ pi_basic_js_sgx_cft^ pi_basic_js_sgx_cft_mem ls_js_virtual_cft^ ls_jwt_sgx_cft^ ls_jwt_sgx_cft_mem ls_full_js_virtual_cft^ pi_ls_jwt_sgx_cft^ pi_ls_jwt_sgx_cft_mem ls_js_jwt_virtual_cft^ ls_js_sgx_cft^ ls_js_sgx_cft_mem ls_full_js_sgx_cft^ ls_full_js_sgx_cft_mem ls_js_jwt_sgx_cft^ ls_js_jwt_sgx_cft_mem hist_sgx_cft^ RB put (/s)^ CHAMP put (/s)^ RB get (/s)^ CHAMP get (/s)^ tlc_sim_traces tlc_sim_levelmean
79401 20231212.19 28101.1 2.51822e+07 0.81927 5623.13 8.59996e+07 89314.1 17391.1 14074.4 1.88908e+07 14177.4 1.05021e+07 15720.5 1.25993e+07 52800.4 57220.3 61516.1 4453.2 6 86496 425 1.2541e+07 239 6.31473e+06 21164.5 17938.8 1427.3 1.25993e+07 17648.6 7237.41 1.88908e+07 17493.6 7035.6 6.30784e+06 11675.2 5749.97 1.67936e+07 5479.83 1.67936e+07 4002.9 1.67936e+07 47942 838171 1.18105e+06 8.14797e+06 3.06669e+07 2199 403
79407 20231212.21 28004 2.51822e+07 0.825778 5542.23 8.59996e+07 98853.4 17380.5 14046.9 1.88908e+07 14082.7 1.05021e+07 15530.2 1.25993e+07 53100 55073.7 61133.1 4564.8 6 86496 438 1.2541e+07 244 6.31473e+06 20936.7 21483.7 1433.5 1.25993e+07 17320.6 6829.33 1.88908e+07 17457 6987.7 6.30784e+06 11793.4 5750.5 1.67936e+07 5483.58 1.67936e+07 3801.42 1.67936e+07 45259.5 823049 1.18387e+06 8.1529e+06 3.09141e+07 2357 403
79432 20231213.7 27970.3 2.30851e+07 0.826558 5576.32 8.59996e+07 69346.9 17365 14041.3 1.88908e+07 14143.8 1.05021e+07 15622 1.25993e+07 52785.9 48070.9 60522.5 4562.9 6 86496 431 1.2541e+07 244 6.31473e+06 21284.6 21533.2 1425.6 1.25993e+07 17510.9 7244.35 1.67936e+07 17579.1 7040.3 6.30784e+06 11870.9 5766.58 1.67936e+07 5449.09 1.67936e+07 3989.65 1.67936e+07 42200.5 832052 1.17715e+06 8.15303e+06 3.05736e+07 2401 403
79439 20231213.9 27848.9 2.30851e+07 0.81029 5618.82 8.59996e+07 92036.8 17303.6 14081.4 1.88908e+07 14183.5 1.05021e+07 15724 1.25993e+07 53122.5 57217.7 60780.9 4600.2 6 86496 441 1.2541e+07 240 6.31473e+06 21195.9 22116.1 1435 1.25993e+07 17202.6 7197.35 1.67936e+07 17454 7130.1 6.30784e+06 11800.3 5752.07 1.67936e+07 5488.37 1.88908e+07 3983.84 1.67936e+07 43527.9 839602 1.18161e+06 8.1304e+06 3.07148e+07 2174 403
79445 20231213.11 27787.6 2.30851e+07 0.813949 5615.6 8.59996e+07 76057.1 17358.9 14067.9 1.88908e+07 14195.2 1.05021e+07 15719.2 1.25993e+07 53104.6 57141.9 61049.1 4624.8 6 86496 429 1.2541e+07 239 6.31473e+06 20914.3 21153.7 1431 1.25993e+07 17817.2 6887.07 1.67936e+07 17380.8 7053.4 6.30784e+06 11633.8 5768.86 1.67936e+07 5486.14 1.88908e+07 3976.97 1.67936e+07 45910.5 842935 1.18045e+06 8.13499e+06 3.15528e+07 2376 403

images

@achamayou achamayou marked this pull request as ready for review December 13, 2023 15:28
@achamayou achamayou requested a review from a team December 13, 2023 15:28
@achamayou achamayou merged commit 085a22c into microsoft:main Dec 13, 2023
22 checks passed
maxtropets added a commit to maxtropets/CCF that referenced this pull request Jul 17, 2024
The default was increased in microsoft#5866, which made the destruction test to
fail more ofter because of unexpected destruction flow. This was
intended to fail on SGX due to out-of-memory and on virtual due to
max ringbuffer message size overflow.

After increasing the message limit we get the primary to slow (in
debug build, at least), so election happens and is messing up the
network, and so test can fail on missing primary or other states
it doesn't expect.

We mitigate the issue here by rolling back msg limit, but the proper
fix might be to rewrite the test to make it more robust.
maxtropets added a commit to maxtropets/CCF that referenced this pull request Jul 17, 2024
The default was increased in microsoft#5866, which made the destruction test to
fail more ofter because of unexpected destruction flow. This was
intended to fail on SGX due to out-of-memory and on virtual due to
max ringbuffer message size overflow.

After increasing the message limit we get the primary to slow (in
debug build, at least), so election happens and is messing up the
network, and so test can fail on missing primary or other states
it doesn't expect.

We mitigate the issue here by rolling back msg limit, but the proper
fix might be to rewrite the test to make it more robust.
maxtropets added a commit to maxtropets/CCF that referenced this pull request Jul 18, 2024
The default was increased in microsoft#5866, which made the destruction test to
fail more ofter because of unexpected destruction flow. This was
intended to fail on SGX due to out-of-memory and on virtual due to
max ringbuffer message size overflow.

After increasing the message limit we get the primary to slow (in
debug build, at least), so election happens and is messing up the
network, and so test can fail on missing primary or other states
it doesn't expect.

We mitigate the issue here by rolling back msg limit, but the proper
fix might be to rewrite the test to make it more robust.
maxtropets added a commit to maxtropets/CCF that referenced this pull request Jul 18, 2024
The default was increased in microsoft#5866, which made the destruction test to
fail more ofter because of unexpected destruction flow. This was
intended to fail on SGX due to out-of-memory and on virtual due to
max ringbuffer message size overflow.

After increasing the message limit we get the primary to slow (in
debug build, at least), so election happens and is messing up the
network, and so test can fail on missing primary or other states
it doesn't expect.

We mitigate the issue here by rolling back msg limit, but the proper
fix might be to rewrite the test to make it more robust.
maxtropets added a commit to maxtropets/CCF that referenced this pull request Jul 18, 2024
The default was increased in microsoft#5866, which made the destruction test to
fail more ofter because of unexpected destruction flow. This was
intended to fail on SGX due to out-of-memory and on virtual due to
max ringbuffer message size overflow.

After increasing the message limit we get the primary to slow (in
debug build, at least), so election happens and is messing up the
network, and so test can fail on missing primary or other states
it doesn't expect.

We mitigate the issue here by rolling back msg limit, but the proper
fix might be to rewrite the test to make it more robust.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit default ringbuffer limits
1 participant