Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mainnet block 9074082 took 2-6 minutes to process #7185

Open
mhofman opened this issue Mar 18, 2023 · 0 comments
Open

Mainnet block 9074082 took 2-6 minutes to process #7185

mhofman opened this issue Mar 18, 2023 · 0 comments
Labels
bug Something isn't working chain-incident cosmic-swingset package: cosmic-swingset SwingSet package: SwingSet telemetry v1_triaged DO NOT USE vaults_triage DO NOT USE xsnap the XS execution tool

Comments

@mhofman
Copy link
Member

mhofman commented Mar 18, 2023

Describe the bug

Around 2023-03-16T11:19.00Z, block 9074082 took almost 6 minutes to execute (because of how blockTime is decided, a long block seem attributed to block 9074083 instead, but the actual slow block is actually the previous one).

Investigating the trace produced by our follower node, we notice that the culprit is a Provision transaction taking 314s on the follower node on a block that overall took 340s, without hitting the computron limit (see #6857)

image

A wallet provision normally takes 384 deliveries. This trace has 389 because of a bringOutYourDead delivery followed by a few drop/retire exports, but these are not responsible for the slowdown, "only" contributing 6s.

Most deliveries are taking a few dozen or hundred milliseconds, but some go up to 27s! It's not clear what is taking time in these deliveries (we need #6399 to help answer that question). No heap snapshots are taken in this run.

image

In this block, the vast majority of the time (320s) is spent in vat18, which is the smart wallet.
image

In vat18, most of the time (314s) in the block is spent on notify deliveries.
image

Looking at the list of recent provision transactions, while these always take some time to process, this occurrence is definitely an outlier.

Height Time Duration Computrons Deliveries Syscalls
9074082 2023-03-16T11:19Z 314s 6,193,733,400 389 1639
9061393 2023-03-15T12:29Z 96s 6,191,167,300 390 1467
9054290 2023-03-14T23:53Z 61s 6,182,391,200 384 1332
9046118 2023-03-14T09:20Z 44s 6,186,293,400 384 1331
9044598 2023-03-14T06:38Z 36s 6,182,160,800 384 1326
9040445 2023-03-13T23:16Z 31s 6,182,478,900 384 1324

While there is slight variance in deliveries, syscalls and computrons, they do not reflect the variance in wall clock time. The variance in deliveries is explained by a BOYD call triggering in the latest cases. The variance in syscalls can be explained by the LRU cache which currently spans deliveries (see #6693). The variance in computrons can be explained by the previous 2 variances and slightly different payloads.

While we're aware not all computrons are currently resulting in equivalent wall time spent, our running assumption is that an overall transaction doing roughly the same execution will take roughly the same amount of time.

One cause for wall clock variance could be accumulated state. There is 2 forms of this: heap state and virtualized state included in syscalls.

During the period above, it seems vat18's heap size as measured by snapshots was around 350MB. This is a very large heap, well outside the XS design parameters, and any organic gc would be very costly. However the size was roughly stable and at first sight seem to not correlate sufficiently.

We would need to look at slog files to get a sense of the size of serialized syscall payloads to get a sense of whether virtualized state had an impact. This information is currently not available through open telemetry. We also do not have durations for syscalls themselves (see #6399).

Given the lack of notable variance in the execution itself, the cause for the observed wall clock slowdown is likely the same as the large worker processes slowdown when staying in memory vs when restarted from snapshots (aka for the exact same deliveries). The current hypothesis is that such workers suffer from a fragmented memory (see #6661). This should be mitigated by both reducing the size of the worker through more effective virtualization of state, and through forced restarts of the workers on snapshots (#6943).

We should verify that other factors can be ruled out, such as the size of serialized state going over syscalls doesn't grow over time, whether in vatstore or in vstorage. We should also capture/report more information on the execution of deliveries, such as allocation and occurrences of garbage collection, and attempt to compare delivery and syscall timing between these transactions (this is unfortunately complicated by the current LRU cache and BOYD schedule).

@mhofman mhofman added bug Something isn't working SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset xsnap the XS execution tool telemetry labels Mar 18, 2023
@ivanlei ivanlei added the vaults_triage DO NOT USE label Mar 20, 2023
@ivanlei ivanlei added this to the Vaults Validation milestone Apr 28, 2023
@ivanlei ivanlei removed this from the Vaults Validation milestone May 8, 2023
@ivanlei ivanlei added the v1_triaged DO NOT USE label May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working chain-incident cosmic-swingset package: cosmic-swingset SwingSet package: SwingSet telemetry v1_triaged DO NOT USE vaults_triage DO NOT USE xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

3 participants