Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

With light simulated load, validators have difficulty creating blocks in time #5507

Open
arirubinstein opened this issue Jun 3, 2022 · 7 comments
Assignees
Labels
bug Something isn't working cosmic-swingset package: cosmic-swingset performance Performance related issues SwingSet package: SwingSet vaults_triage DO NOT USE

Comments

@arirubinstein
Copy link
Contributor

Describe the bug

With a light simulated load (10 ag-solos making AMM trades), many validators have issues processing the data and participating in consensus, with multiple timeouts and rounds needed to create a block. It's possible that the default tendermint round timeouts need to be increased to accomodate for the expected load from swingset, as it appears the default limits are being reached and resulting in base+delta*round increases. Alternatively, it may be required to do less work per block on the swingset side in order to beat the timeout.

To Reproduce

Steps to reproduce the behavior:

  1. go to https://devnet.explorer.agoric.net/agoric/uptime
  2. see large amount of missed blocks on all validators

Expected behavior

with nominal load, all validators should not have any issues voting on blocks.

Screen Shot 2022-06-03 at 4 23 51 PM

@arirubinstein arirubinstein added bug Something isn't working cosmic-swingset package: cosmic-swingset labels Jun 3, 2022
@Tartuffo Tartuffo added the SwingSet package: SwingSet label Jun 8, 2022
@warner
Copy link
Member

warner commented Jun 20, 2022

We looked at a run from the last weekend, with the #5152 timestamps in the slog. In one particular block, two (of the five) validators needed a massive 1000s to complete the block. Looking into the slogfiles, we found a 1000s+ delay in two different deliveries (validator 1 had this delay in delivery N, validator 2 had this delay in a different delivery M). In one of these two deliveries, we saw the delay occurred after the slogfile recorded deliver-result arriving on the kernel, and before it recorded crank-finish:

{"type":"syscall-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":30,"replay":false,"ksr":["ok",null],"vsr":["ok",null],"time":1655696811.177324,"monotime":184524.70294137407}
{"type":"clist","crankNum":1877999,"mode":"export","vatID":"v14","kobj":"ko53942","vobj":"o+10/12786","time":1655696811.178376,"monotime":184524.70399383403}
{"type":"clist","crankNum":1877999,"mode":"drop","vatID":"v14","kobj":"kp239132","vobj":"p-28033","time":1655696811.178857,"monotime":184524.70447407317}
{"type":"syscall","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":31,"replay":false,"ksc":["resolve","v14",[["kp239132",false,{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["ko53942"]}]]],"vsc":["resolve",[["p-28033",false,{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["o+10/12786"]}]]],"time":1655696811.17927,"monotime":184524.70488823295}
{"type":"syscall-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":31,"replay":false,"ksr":["ok",null],"vsr":["ok",null],"time":1655696811.180703,"monotime":184524.70632053496}
{"type":"deliver-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"replay":false,"dr":["ok",null,{"meterType":"xs-meter-14","currentHeapCount":170287,"compute":464821,"allocate":46268448,"timestamps":[1655696811.133889,1655696811.134471,1655696811.135465,1655696811.138214,1655696811.139035,1655696811.13971,1655696811.141204,1655696811.145195,1655696811.146256,1655696811.146424,1655696811.147452,1655696811.147547,1655696811.148417,1655696811.148527,1655696811.153673,1655696811.153829,1655696811.154585,1655696811.154723,1655696811.15551,1655696811.155651,1655696811.156327,1655696811.156477,1655696811.157194,1655696811.157275,1655696811.158055,1655696811.158148,1655696811.15889,1655696811.159212,1655696811.160024,1655696811.160486,1655696811.161386,1655696811.161589,1655696811.162284,1655696811.162383,1655696811.163111,1655696811.163263,1655696811.164098,1655696811.164257,1655696811.164993,1655696811.165132,1655696811.165775,1655696811.165872,1655696811.166512,1655696811.166636,1655696811.167436,1655696811.167829,1655696811.168731,1655696811.169383,1655696811.170305,1655696811.170418,1655696811.171149,1655696811.171242,1655696811.171991,1655696811.172174,1655696811.172982,1655696811.173122,1655696811.173919,1655696811.174091,1655696811.175057,1655696811.175525,1655696811.176414,1655696811.176575,1655696811.177444,1655696811.177629,1655696811.180835,1655696811.181082]}],"time":1655696811.188952,"monotime":184524.71457022213}
{"type":"crank-finish","crankNum":1877999,"crankhash":"1b21056353eaa96fa6d2c4253e85c396246bfc20a50c049869124e8f0806df91","activityhash":"b4a68f0b93aa9b9759a12fe6c8b924b8240baf233eb2a93b6c336bd8b2446132","time":1655697969.622659,"monotime":185683.14827987814}

running this through jq -c '. as $foo| [ $foo.monotime, $foo]' shows the jump in monotime between the last two lines:

[184524.70488823295,{"type":"syscall","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":31,"replay":false,"ksc":["resolve","v14",[["kp239132",false,{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["ko53942"]}]]],"vsc":["resolve",[["p-28033",false,{"body":"{\"@qclass\":\"slot\",\"iface\":\"Alleged: BLD payment\",\"index\":0}","slots":["o+10/12786"]}]]],"time":1655696811.17927,"monotime":184524.70488823295}]
[184524.70632053496,{"type":"syscall-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"syscallNum":31,"replay":false,"ksr":["ok",null],"vsr":["ok",null],"time":1655696811.180703,"monotime":184524.70632053496}]
[184524.71457022213,{"type":"deliver-result","crankNum":1877999,"vatID":"v14","deliveryNum":79364,"replay":false,"dr":["ok",null,{"meterType":"xs-meter-14","currentHeapCount":170287,"compute":464821,"allocate":46268448,"timestamps":[1655696811.133889,1655696811.134471,1655696811.135465,1655696811.138214,1655696811.139035,1655696811.13971,1655696811.141204,1655696811.145195,1655696811.146256,1655696811.146424,1655696811.147452,1655696811.147547,1655696811.148417,1655696811.148527,1655696811.153673,1655696811.153829,1655696811.154585,1655696811.154723,1655696811.15551,1655696811.155651,1655696811.156327,1655696811.156477,1655696811.157194,1655696811.157275,1655696811.158055,1655696811.158148,1655696811.15889,1655696811.159212,1655696811.160024,1655696811.160486,1655696811.161386,1655696811.161589,1655696811.162284,1655696811.162383,1655696811.163111,1655696811.163263,1655696811.164098,1655696811.164257,1655696811.164993,1655696811.165132,1655696811.165775,1655696811.165872,1655696811.166512,1655696811.166636,1655696811.167436,1655696811.167829,1655696811.168731,1655696811.169383,1655696811.170305,1655696811.170418,1655696811.171149,1655696811.171242,1655696811.171991,1655696811.172174,1655696811.172982,1655696811.173122,1655696811.173919,1655696811.174091,1655696811.175057,1655696811.175525,1655696811.176414,1655696811.176575,1655696811.177444,1655696811.177629,1655696811.180835,1655696811.181082]}],"time":1655696811.188952,"monotime":184524.71457022213}]
[185683.14827987814,{"type":"crank-finish","crankNum":1877999,"crankhash":"1b21056353eaa96fa6d2c4253e85c396246bfc20a50c049869124e8f0806df91","activityhash":"b4a68f0b93aa9b9759a12fe6c8b924b8240baf233eb2a93b6c336bd8b2446132","time":1655697969.622659,"monotime":185683.14827987814}]

This exonerates the worker, at least for this particular stall. Between deliver-result and crank-finish, the kernel doesn't do anything CPU-intensive: it checks metering (not enabled here), checks for vat termination (which didn't happen here), commits the crank buffer (which does some number of writes to a pending LMDB transaction, generally proportional to the number of syscalls that occurred, only 31 in this case), and then maybe saves a heap snapshot.

We can't yet rule out a horribly slow heap snapshot write: we should land #5437 to capture that data in the slogfile.

Assuming it's not a slow heap snapshot, our hypothesis is that the Node.js/agd process experienced some horrible GC-related slowdown. One data point which supports this hypothesis is that the 1000s stalls were observed in the two validators which had not been rebooted since the testnet was started. The vague idea is that some kind of memory pressure is building up in the agd process, relieved by a reboot (or not), and then reaches some critical threshold at nearly (but not exactly) the same time in multiple validators, during some kernel activity which triggers a sweep that, somehow, takes over 15 minutes.

We also observe the resident memory usage on the validator growing from testnet start to a plateau about 12-24 hours later, then remains flat for quite a while. This points to some sort of memory leak.

We're slightly suspicious of the intricate interaction between the golang and Node.js sides of agd, it seems like an opportunity for leaks. I don't believe the kernel is as likely to be involved, however there is definitely a second bug in which the kernel object table (and almost certainly the vat c-lists) are growing even though the generated load is not supposed to be accumulative. All the kernel data is stored in the DB, so lots of kernel objects should not mean lots of kernel memory. But it's still worth considering, and we're investigating that second bug independently.

warner added a commit that referenced this issue Jun 21, 2022
refs #5507

This removes the `Promise.race([vatCancelled, result])` calls which
might be triggering the V8 bug in which heavily-used-but-unresolved
promises accumulate state forever.
@warner
Copy link
Member

warner commented Jun 22, 2022

The V8 heap snapshot analysis revealed a large number of promiseReaction records being retained. We found that a pattern of using Promise.race([vatCancelled, result]) (specifically in xsnap.js) triggers an arguably-a-bug in V8 (and probably most spec-conformant engines), in which a long-lived Promise used as the usual loser in Promise.race causes every result of the winner to be retained in memory until it finally settles.

Upstream tickets exist for Node.js nodejs/node#17469 and V8 https://bugs.chromium.org/p/v8/issues/detail?id=9858 , with the most detailed explanation in nodejs/node#17469 (comment) .

In agd, this was retaining the memory of every message ever received by every online xsnap worker. We think this was causing the memory footprint of the V8 engine to grow horrendously, making GC in the kernel/agd process take forever (over 1000s after 12-24 hours of load).

There is a way to rewrite Promise.race that avoids the worst of the problem (retaining the winner's results).. @mhofman is developing/testing that fix. I think there's still a leak of the promiseReaction records, though, so memory retention is still O(N) in the number of deliveries we make.

The tickets speculate on an alternate way of managing Promises in the engine that might help, but I don't see any progress being made on them.

I think we need to understand this engine limitation and adjust our expectations about memory performance of long-lived Promises. We might need to adopt some guidelines like "avoid ever calling .then more than once per Promise", or "don't use Promise.race", or "avoid long-lived Promises". It's unfortunate.

Early testing suggests that fixing this problem makes the horrendous stalls go away.

Other mitigations would include (and may still be necessary, given the deeper bug):

  • evict vat workers every once in a while, whether the LRU cache likes them or not
  • reboot the validator process every few hours
  • change the vatCancelled logic in xsnap.js somehow

@mhofman
Copy link
Member

mhofman commented Jun 22, 2022

There is a way to rewrite Promise.race that avoids the worst of the problem (retaining the winner's results).. @mhofman is developing/testing that fix. I think there's still a leak of the promiseReaction records, though, so memory retention is still O(N) in the number of deliveries we make.

The alternative racePromises in endojs/endo#1222 does not leak reactions either (well technically it leaks one, not N).

@Tartuffo Tartuffo added this to the Mainnet 1 milestone Jul 19, 2022
@dckc dckc added the performance Performance related issues label Jul 20, 2022
@Tartuffo
Copy link
Contributor

@arirubinstein do you know if we validated that @mhofman's PR fixed this?

@mhofman
Copy link
Member

mhofman commented Aug 26, 2022

Re-reading this issue, it seems that we uncovered the Promise.race leak while investigating this, but that the original report is completely orthogonal. While the Promise.race bug was fixed, the original issue that validators timeout still exist. I'm not convinced trying to do less Swingset work is the right approach, and I'm not sure what the consequences are for increasing the tendermint voting timeouts. Regardless, this probably shouldn't be in Review/QA.

@mhofman
Copy link
Member

mhofman commented Jan 24, 2023

We still haven't characterized what the cause of these validator timeouts are. I believe I was assigned to this when we realized we had excessive kernel memory usage because of buggy Promise.race, which caused slowdowns. We have since fixed that, but not gone back to the original issue. Un-assigning myself.

@mhofman mhofman removed their assignment Jan 24, 2023
@ivanlei ivanlei added this to the Vaults RC0 milestone Feb 1, 2023
@ivanlei ivanlei removed the MUST-HAVE label Mar 20, 2023
@mhofman
Copy link
Member

mhofman commented Sep 12, 2024

Large loads still seem to occur in production, in particular related to oracle price push. While it's unknown why some of these (in particular one coming in late for a previous round) cause abnormally large wall clock time being spend in swingset's end block (issue TBD to investigate), it does cause apparent stalls of the mainnet chain.

We do have an hypothesis on the source of the chain slowdown, that would be tested by #10075.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cosmic-swingset package: cosmic-swingset performance Performance related issues SwingSet package: SwingSet vaults_triage DO NOT USE
Projects
None yet
Development

No branches or pull requests

7 participants