Mainnet block 9074082 took 2-6 minutes to process #7185

mhofman · 2023-03-18T01:08:28Z

Describe the bug

Around 2023-03-16T11:19.00Z, block 9074082 took almost 6 minutes to execute (because of how blockTime is decided, a long block seem attributed to block 9074083 instead, but the actual slow block is actually the previous one).

Investigating the trace produced by our follower node, we notice that the culprit is a Provision transaction taking 314s on the follower node on a block that overall took 340s, without hitting the computron limit (see #6857)

A wallet provision normally takes 384 deliveries. This trace has 389 because of a bringOutYourDead delivery followed by a few drop/retire exports, but these are not responsible for the slowdown, "only" contributing 6s.

Most deliveries are taking a few dozen or hundred milliseconds, but some go up to 27s! It's not clear what is taking time in these deliveries (we need #6399 to help answer that question). No heap snapshots are taken in this run.

In this block, the vast majority of the time (320s) is spent in vat18, which is the smart wallet.

In vat18, most of the time (314s) in the block is spent on notify deliveries.

Looking at the list of recent provision transactions, while these always take some time to process, this occurrence is definitely an outlier.

Height	Time	Duration	Computrons	Deliveries	Syscalls
9074082	2023-03-16T11:19Z	314s	6,193,733,400	389	1639
9061393	2023-03-15T12:29Z	96s	6,191,167,300	390	1467
9054290	2023-03-14T23:53Z	61s	6,182,391,200	384	1332
9046118	2023-03-14T09:20Z	44s	6,186,293,400	384	1331
9044598	2023-03-14T06:38Z	36s	6,182,160,800	384	1326
9040445	2023-03-13T23:16Z	31s	6,182,478,900	384	1324

While there is slight variance in deliveries, syscalls and computrons, they do not reflect the variance in wall clock time. The variance in deliveries is explained by a BOYD call triggering in the latest cases. The variance in syscalls can be explained by the LRU cache which currently spans deliveries (see #6693). The variance in computrons can be explained by the previous 2 variances and slightly different payloads.

While we're aware not all computrons are currently resulting in equivalent wall time spent, our running assumption is that an overall transaction doing roughly the same execution will take roughly the same amount of time.

One cause for wall clock variance could be accumulated state. There is 2 forms of this: heap state and virtualized state included in syscalls.

During the period above, it seems vat18's heap size as measured by snapshots was around 350MB. This is a very large heap, well outside the XS design parameters, and any organic gc would be very costly. However the size was roughly stable and at first sight seem to not correlate sufficiently.

We would need to look at slog files to get a sense of the size of serialized syscall payloads to get a sense of whether virtualized state had an impact. This information is currently not available through open telemetry. We also do not have durations for syscalls themselves (see #6399).

Given the lack of notable variance in the execution itself, the cause for the observed wall clock slowdown is likely the same as the large worker processes slowdown when staying in memory vs when restarted from snapshots (aka for the exact same deliveries). The current hypothesis is that such workers suffer from a fragmented memory (see #6661). This should be mitigated by both reducing the size of the worker through more effective virtualization of state, and through forced restarts of the workers on snapshots (#6943).

We should verify that other factors can be ruled out, such as the size of serialized state going over syscalls doesn't grow over time, whether in vatstore or in vstorage. We should also capture/report more information on the execution of deliveries, such as allocation and occurrences of garbage collection, and attempt to compare delivery and syscall timing between these transactions (this is unfortunately complicated by the current LRU cache and BOYD schedule).

The text was updated successfully, but these errors were encountered:

mhofman added bug Something isn't working SwingSet package: SwingSet cosmic-swingset package: cosmic-swingset xsnap the XS execution tool telemetry labels Mar 18, 2023

ivanlei added the vaults_triage DO NOT USE label Mar 20, 2023

turadg mentioned this issue Mar 30, 2023

test for vaults perf #7280

Merged

ivanlei added this to the Vaults Validation milestone Apr 28, 2023

ivanlei removed this from the Vaults Validation milestone May 8, 2023

ivanlei added the v1_triaged DO NOT USE label May 22, 2023

dckc added the chain-incident label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mainnet block 9074082 took 2-6 minutes to process #7185

Mainnet block 9074082 took 2-6 minutes to process #7185

mhofman commented Mar 18, 2023

Mainnet block 9074082 took 2-6 minutes to process #7185

Mainnet block 9074082 took 2-6 minutes to process #7185

Comments

mhofman commented Mar 18, 2023

Describe the bug