-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With light simulated load, validators have difficulty creating blocks in time #5507
Comments
We looked at a run from the last weekend, with the #5152 timestamps in the slog. In one particular block, two (of the five) validators needed a massive 1000s to complete the block. Looking into the slogfiles, we found a 1000s+ delay in two different deliveries (validator 1 had this delay in delivery N, validator 2 had this delay in a different delivery M). In one of these two deliveries, we saw the delay occurred after the slogfile recorded
running this through
This exonerates the worker, at least for this particular stall. Between We can't yet rule out a horribly slow heap snapshot write: we should land #5437 to capture that data in the slogfile. Assuming it's not a slow heap snapshot, our hypothesis is that the Node.js/ We also observe the resident memory usage on the validator growing from testnet start to a plateau about 12-24 hours later, then remains flat for quite a while. This points to some sort of memory leak. We're slightly suspicious of the intricate interaction between the golang and Node.js sides of |
refs #5507 This removes the `Promise.race([vatCancelled, result])` calls which might be triggering the V8 bug in which heavily-used-but-unresolved promises accumulate state forever.
The V8 heap snapshot analysis revealed a large number of Upstream tickets exist for Node.js nodejs/node#17469 and V8 https://bugs.chromium.org/p/v8/issues/detail?id=9858 , with the most detailed explanation in nodejs/node#17469 (comment) . In There is a way to rewrite The tickets speculate on an alternate way of managing Promises in the engine that might help, but I don't see any progress being made on them. I think we need to understand this engine limitation and adjust our expectations about memory performance of long-lived Promises. We might need to adopt some guidelines like "avoid ever calling Early testing suggests that fixing this problem makes the horrendous stalls go away. Other mitigations would include (and may still be necessary, given the deeper bug):
|
The alternative |
@arirubinstein do you know if we validated that @mhofman's PR fixed this? |
Re-reading this issue, it seems that we uncovered the Promise.race leak while investigating this, but that the original report is completely orthogonal. While the Promise.race bug was fixed, the original issue that validators timeout still exist. I'm not convinced trying to do less Swingset work is the right approach, and I'm not sure what the consequences are for increasing the tendermint voting timeouts. Regardless, this probably shouldn't be in Review/QA. |
We still haven't characterized what the cause of these validator timeouts are. I believe I was assigned to this when we realized we had excessive kernel memory usage because of buggy Promise.race, which caused slowdowns. We have since fixed that, but not gone back to the original issue. Un-assigning myself. |
Large loads still seem to occur in production, in particular related to oracle price push. While it's unknown why some of these (in particular one coming in late for a previous round) cause abnormally large wall clock time being spend in swingset's end block (issue TBD to investigate), it does cause apparent stalls of the mainnet chain. We do have an hypothesis on the source of the chain slowdown, that would be tested by #10075. |
Describe the bug
With a light simulated load (10 ag-solos making AMM trades), many validators have issues processing the data and participating in consensus, with multiple timeouts and rounds needed to create a block. It's possible that the default tendermint round timeouts need to be increased to accomodate for the expected load from swingset, as it appears the default limits are being reached and resulting in base+delta*round increases. Alternatively, it may be required to do less work per block on the swingset side in order to beat the timeout.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
with nominal load, all validators should not have any issues voting on blocks.
The text was updated successfully, but these errors were encountered: