With light simulated load, validators have difficulty creating blocks in time #5507

arirubinstein · 2022-06-03T23:24:23Z

Describe the bug

With a light simulated load (10 ag-solos making AMM trades), many validators have issues processing the data and participating in consensus, with multiple timeouts and rounds needed to create a block. It's possible that the default tendermint round timeouts need to be increased to accomodate for the expected load from swingset, as it appears the default limits are being reached and resulting in base+delta*round increases. Alternatively, it may be required to do less work per block on the swingset side in order to beat the timeout.

To Reproduce

Steps to reproduce the behavior:

go to https://devnet.explorer.agoric.net/agoric/uptime
see large amount of missed blocks on all validators

Expected behavior

with nominal load, all validators should not have any issues voting on blocks.

warner · 2022-06-20T20:08:05Z

We looked at a run from the last weekend, with the #5152 timestamps in the slog. In one particular block, two (of the five) validators needed a massive 1000s to complete the block. Looking into the slogfiles, we found a 1000s+ delay in two different deliveries (validator 1 had this delay in delivery N, validator 2 had this delay in a different delivery M). In one of these two deliveries, we saw the delay occurred after the slogfile recorded deliver-result arriving on the kernel, and before it recorded crank-finish:

{"type":"syscall-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":30,"replay":false,"ksr":["ok",null],"vsr":["ok",null],"time":1655696811.177324,"monotime":184524.70294137407}
{"type":"clist","crankNum":1877999,"mode":"export","vatID":"v14","kobj":"ko53942","vobj":"o+10/12786","time":1655696811.178376,"monotime":184524.70399383403}
{"type":"clist","crankNum":1877999,"mode":"drop","vatID":"v14","kobj":"kp239132","vobj":"p-28033","time":1655696811.178857,"monotime":184524.70447407317}
{"type":"syscall","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":31,"replay":false,"ksc":["resolve","v14",[["kp239132",false,{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["ko53942"]}]]],"vsc":["resolve",[["p-28033",false,{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["o+10/12786"]}]]],"time":1655696811.17927,"monotime":184524.70488823295}
{"type":"syscall-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":31,"replay":false,"ksr":["ok",null],"vsr":["ok",null],"time":1655696811.180703,"monotime":184524.70632053496}
{"type":"deliver-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"replay":false,"dr":["ok",null,{"meterType":"xs-meter-14","currentHeapCount":170287,"compute":464821,"allocate":46268448,"timestamps":[1655696811.133889,1655696811.134471,1655696811.135465,1655696811.138214,1655696811.139035,1655696811.13971,1655696811.141204,1655696811.145195,1655696811.146256,1655696811.146424,1655696811.147452,1655696811.147547,1655696811.148417,1655696811.148527,1655696811.153673,1655696811.153829,1655696811.154585,1655696811.154723,1655696811.15551,1655696811.155651,1655696811.156327,1655696811.156477,1655696811.157194,1655696811.157275,1655696811.158055,1655696811.158148,1655696811.15889,1655696811.159212,1655696811.160024,1655696811.160486,1655696811.161386,1655696811.161589,1655696811.162284,1655696811.162383,1655696811.163111,1655696811.163263,1655696811.164098,1655696811.164257,1655696811.164993,1655696811.165132,1655696811.165775,1655696811.165872,1655696811.166512,1655696811.166636,1655696811.167436,1655696811.167829,1655696811.168731,1655696811.169383,1655696811.170305,1655696811.170418,1655696811.171149,1655696811.171242,1655696811.171991,1655696811.172174,1655696811.172982,1655696811.173122,1655696811.173919,1655696811.174091,1655696811.175057,1655696811.175525,1655696811.176414,1655696811.176575,1655696811.177444,1655696811.177629,1655696811.180835,1655696811.181082]}],"time":1655696811.188952,"monotime":184524.71457022213}
{"type":"crank-finish","crankNum":1877999,"crankhash":"1b21056353eaa96fa6d2c4253e85c396246bfc20a50c049869124e8f0806df91","activityhash":"b4a68f0b93aa9b9759a12fe6c8b924b8240baf233eb2a93b6c336bd8b2446132","time":1655697969.622659,"monotime":185683.14827987814}

running this through jq -c '. as $foo| [ $foo.monotime, $foo]' shows the jump in monotime between the last two lines:

[184524.70488823295,{"type":"syscall","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":31,"replay":false,"ksc":["resolve","v14",[["kp239132",false,{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["ko53942"]}]]],"vsc":["resolve",[["p-28033",false,{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["o+10/12786"]}]]],"time":1655696811.17927,"monotime":184524.70488823295}]
[184524.70632053496,{"type":"syscall-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":31,"replay":false,"ksr":["ok",null],"vsr":["ok",null],"time":1655696811.180703,"monotime":184524.70632053496}]
[184524.71457022213,{"type":"deliver-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"replay":false,"dr":["ok",null,{"meterType":"xs-meter-14","currentHeapCount":170287,"compute":464821,"allocate":46268448,"timestamps":[1655696811.133889,1655696811.134471,1655696811.135465,1655696811.138214,1655696811.139035,1655696811.13971,1655696811.141204,1655696811.145195,1655696811.146256,1655696811.146424,1655696811.147452,1655696811.147547,1655696811.148417,1655696811.148527,1655696811.153673,1655696811.153829,1655696811.154585,1655696811.154723,1655696811.15551,1655696811.155651,1655696811.156327,1655696811.156477,1655696811.157194,1655696811.157275,1655696811.158055,1655696811.158148,1655696811.15889,1655696811.159212,1655696811.160024,1655696811.160486,1655696811.161386,1655696811.161589,1655696811.162284,1655696811.162383,1655696811.163111,1655696811.163263,1655696811.164098,1655696811.164257,1655696811.164993,1655696811.165132,1655696811.165775,1655696811.165872,1655696811.166512,1655696811.166636,1655696811.167436,1655696811.167829,1655696811.168731,1655696811.169383,1655696811.170305,1655696811.170418,1655696811.171149,1655696811.171242,1655696811.171991,1655696811.172174,1655696811.172982,1655696811.173122,1655696811.173919,1655696811.174091,1655696811.175057,1655696811.175525,1655696811.176414,1655696811.176575,1655696811.177444,1655696811.177629,1655696811.180835,1655696811.181082]}],"time":1655696811.188952,"monotime":184524.71457022213}]
[185683.14827987814,{"type":"crank-finish","crankNum":1877999,"crankhash":"1b21056353eaa96fa6d2c4253e85c396246bfc20a50c049869124e8f0806df91","activityhash":"b4a68f0b93aa9b9759a12fe6c8b924b8240baf233eb2a93b6c336bd8b2446132","time":1655697969.622659,"monotime":185683.14827987814}]

This exonerates the worker, at least for this particular stall. Between deliver-result and crank-finish, the kernel doesn't do anything CPU-intensive: it checks metering (not enabled here), checks for vat termination (which didn't happen here), commits the crank buffer (which does some number of writes to a pending LMDB transaction, generally proportional to the number of syscalls that occurred, only 31 in this case), and then maybe saves a heap snapshot.

We can't yet rule out a horribly slow heap snapshot write: we should land #5437 to capture that data in the slogfile.

Assuming it's not a slow heap snapshot, our hypothesis is that the Node.js/agd process experienced some horrible GC-related slowdown. One data point which supports this hypothesis is that the 1000s stalls were observed in the two validators which had not been rebooted since the testnet was started. The vague idea is that some kind of memory pressure is building up in the agd process, relieved by a reboot (or not), and then reaches some critical threshold at nearly (but not exactly) the same time in multiple validators, during some kernel activity which triggers a sweep that, somehow, takes over 15 minutes.

We also observe the resident memory usage on the validator growing from testnet start to a plateau about 12-24 hours later, then remains flat for quite a while. This points to some sort of memory leak.

We're slightly suspicious of the intricate interaction between the golang and Node.js sides of agd, it seems like an opportunity for leaks. I don't believe the kernel is as likely to be involved, however there is definitely a second bug in which the kernel object table (and almost certainly the vat c-lists) are growing even though the generated load is not supposed to be accumulative. All the kernel data is stored in the DB, so lots of kernel objects should not mean lots of kernel memory. But it's still worth considering, and we're investigating that second bug independently.

refs #5507 This removes the `Promise.race([vatCancelled, result])` calls which might be triggering the V8 bug in which heavily-used-but-unresolved promises accumulate state forever.

warner · 2022-06-22T20:56:20Z

The V8 heap snapshot analysis revealed a large number of promiseReaction records being retained. We found that a pattern of using Promise.race([vatCancelled, result]) (specifically in xsnap.js) triggers an arguably-a-bug in V8 (and probably most spec-conformant engines), in which a long-lived Promise used as the usual loser in Promise.race causes every result of the winner to be retained in memory until it finally settles.

Upstream tickets exist for Node.js nodejs/node#17469 and V8 https://bugs.chromium.org/p/v8/issues/detail?id=9858 , with the most detailed explanation in nodejs/node#17469 (comment) .

In agd, this was retaining the memory of every message ever received by every online xsnap worker. We think this was causing the memory footprint of the V8 engine to grow horrendously, making GC in the kernel/agd process take forever (over 1000s after 12-24 hours of load).

There is a way to rewrite Promise.race that avoids the worst of the problem (retaining the winner's results).. @mhofman is developing/testing that fix. I think there's still a leak of the promiseReaction records, though, so memory retention is still O(N) in the number of deliveries we make.

The tickets speculate on an alternate way of managing Promises in the engine that might help, but I don't see any progress being made on them.

I think we need to understand this engine limitation and adjust our expectations about memory performance of long-lived Promises. We might need to adopt some guidelines like "avoid ever calling .then more than once per Promise", or "don't use Promise.race", or "avoid long-lived Promises". It's unfortunate.

Early testing suggests that fixing this problem makes the horrendous stalls go away.

Other mitigations would include (and may still be necessary, given the deeper bug):

evict vat workers every once in a while, whether the LRU cache likes them or not
reboot the validator process every few hours
change the vatCancelled logic in xsnap.js somehow

mhofman · 2022-06-22T23:58:53Z

There is a way to rewrite Promise.race that avoids the worst of the problem (retaining the winner's results).. @mhofman is developing/testing that fix. I think there's still a leak of the promiseReaction records, though, so memory retention is still O(N) in the number of deliveries we make.

The alternative racePromises in endojs/endo#1222 does not leak reactions either (well technically it leaks one, not N).

Tartuffo · 2022-07-29T16:08:06Z

@arirubinstein do you know if we validated that @mhofman's PR fixed this?

mhofman · 2022-08-26T20:45:55Z

Re-reading this issue, it seems that we uncovered the Promise.race leak while investigating this, but that the original report is completely orthogonal. While the Promise.race bug was fixed, the original issue that validators timeout still exist. I'm not convinced trying to do less Swingset work is the right approach, and I'm not sure what the consequences are for increasing the tendermint voting timeouts. Regardless, this probably shouldn't be in Review/QA.

mhofman · 2023-01-24T20:54:13Z

We still haven't characterized what the cause of these validator timeouts are. I believe I was assigned to this when we realized we had excessive kernel memory usage because of buggy Promise.race, which caused slowdowns. We have since fixed that, but not gone back to the original issue. Un-assigning myself.

mhofman · 2024-09-12T00:16:40Z

Large loads still seem to occur in production, in particular related to oracle price push. While it's unknown why some of these (in particular one coming in late for a previous round) cause abnormally large wall clock time being spend in swingset's end block (issue TBD to investigate), it does cause apparent stalls of the mainnet chain.

We do have an hypothesis on the source of the chain slowdown, that would be tested by #10075.

arirubinstein added bug Something isn't working cosmic-swingset package: cosmic-swingset labels Jun 3, 2022

Tartuffo added the SwingSet package: SwingSet label Jun 8, 2022

warner mentioned this issue Jun 9, 2022

xsnap message timestamp logging/reporting #5152

Open

mhofman mentioned this issue Jun 21, 2022

feat(cosmic-swingset): Add memory stats to new commit block slog event #5637

Merged

warner mentioned this issue Jun 22, 2022

investigate object leak in devnet AMM loadgen #5640

Closed

This was referenced Jun 22, 2022

fix: workaround leak through promise resolvers endojs/endo#1222

Merged

fix(xsnap): do not leak through vat termination race #5643

Merged

dckc assigned arirubinstein and mhofman Jul 5, 2022

Tartuffo added this to the Mainnet 1 milestone Jul 19, 2022

Tartuffo added the MUST-HAVE label Jul 19, 2022

dckc added the performance Performance related issues label Jul 20, 2022

dckc mentioned this issue Aug 19, 2022

agd does not support joining with state sync #3769

Closed

3 tasks

Tartuffo removed this from the Mainnet 1 RC0 milestone Sep 21, 2022

mhofman mentioned this issue Sep 30, 2022

Stream snapshots between kernel and xsnap worker #6363

Closed

Tartuffo unassigned arirubinstein Nov 4, 2022

Tartuffo added migrate-backlog and removed migrate-backlog labels Nov 17, 2022

ivanlei added the vaults_triage DO NOT USE label Dec 5, 2022

mhofman removed their assignment Jan 24, 2023

ivanlei added this to the Vaults RC0 milestone Feb 1, 2023

ivanlei removed the MUST-HAVE label Mar 20, 2023

ivanlei modified the milestones: Vaults Functional Testing, Vaults Validation Apr 28, 2023

ivanlei removed this from the Vaults Validation milestone May 8, 2023

toliaqat assigned JeancarloBarrios Sep 9, 2024

mhofman mentioned this issue Sep 12, 2024

Investigate impact of long blocking abci handlers on p2p network #10075

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

With light simulated load, validators have difficulty creating blocks in time #5507

With light simulated load, validators have difficulty creating blocks in time #5507

arirubinstein commented Jun 3, 2022

warner commented Jun 20, 2022

warner commented Jun 22, 2022

mhofman commented Jun 22, 2022

Tartuffo commented Jul 29, 2022

mhofman commented Aug 26, 2022

mhofman commented Jan 24, 2023

mhofman commented Sep 12, 2024

With light simulated load, validators have difficulty creating blocks in time #5507

With light simulated load, validators have difficulty creating blocks in time #5507

Comments

arirubinstein commented Jun 3, 2022

Describe the bug

To Reproduce

Expected behavior

warner commented Jun 20, 2022

warner commented Jun 22, 2022

mhofman commented Jun 22, 2022

Tartuffo commented Jul 29, 2022

mhofman commented Aug 26, 2022

mhofman commented Jan 24, 2023

mhofman commented Sep 12, 2024