Mainnet block 9074082 took 2-6 minutes to process #7185
Labels
bug
Something isn't working
chain-incident
cosmic-swingset
package: cosmic-swingset
SwingSet
package: SwingSet
telemetry
v1_triaged
DO NOT USE
vaults_triage
DO NOT USE
xsnap
the XS execution tool
Describe the bug
Around 2023-03-16T11:19.00Z, block 9074082 took almost 6 minutes to execute (because of how blockTime is decided, a long block seem attributed to block 9074083 instead, but the actual slow block is actually the previous one).
Investigating the trace produced by our follower node, we notice that the culprit is a
Provision
transaction taking 314s on the follower node on a block that overall took 340s, without hitting the computron limit (see #6857)A wallet provision normally takes 384 deliveries. This trace has 389 because of a
bringOutYourDead
delivery followed by a few drop/retire exports, but these are not responsible for the slowdown, "only" contributing 6s.Most deliveries are taking a few dozen or hundred milliseconds, but some go up to 27s! It's not clear what is taking time in these deliveries (we need #6399 to help answer that question). No heap snapshots are taken in this run.
In this block, the vast majority of the time (320s) is spent in vat18, which is the smart wallet.
In vat18, most of the time (314s) in the block is spent on notify deliveries.
Looking at the list of recent provision transactions, while these always take some time to process, this occurrence is definitely an outlier.
While there is slight variance in deliveries, syscalls and computrons, they do not reflect the variance in wall clock time. The variance in deliveries is explained by a BOYD call triggering in the latest cases. The variance in syscalls can be explained by the LRU cache which currently spans deliveries (see #6693). The variance in computrons can be explained by the previous 2 variances and slightly different payloads.
While we're aware not all computrons are currently resulting in equivalent wall time spent, our running assumption is that an overall transaction doing roughly the same execution will take roughly the same amount of time.
One cause for wall clock variance could be accumulated state. There is 2 forms of this: heap state and virtualized state included in syscalls.
During the period above, it seems vat18's heap size as measured by snapshots was around 350MB. This is a very large heap, well outside the XS design parameters, and any organic gc would be very costly. However the size was roughly stable and at first sight seem to not correlate sufficiently.
We would need to look at slog files to get a sense of the size of serialized syscall payloads to get a sense of whether virtualized state had an impact. This information is currently not available through open telemetry. We also do not have durations for syscalls themselves (see #6399).
Given the lack of notable variance in the execution itself, the cause for the observed wall clock slowdown is likely the same as the large worker processes slowdown when staying in memory vs when restarted from snapshots (aka for the exact same deliveries). The current hypothesis is that such workers suffer from a fragmented memory (see #6661). This should be mitigated by both reducing the size of the worker through more effective virtualization of state, and through forced restarts of the workers on snapshots (#6943).
We should verify that other factors can be ruled out, such as the size of serialized state going over syscalls doesn't grow over time, whether in vatstore or in vstorage. We should also capture/report more information on the execution of deliveries, such as allocation and occurrences of garbage collection, and attempt to compare delivery and syscall timing between these transactions (this is unfortunately complicated by the current LRU cache and BOYD schedule).
The text was updated successfully, but these errors were encountered: