feat(cosmic-swingset): Add memory stats to new commit block slog event #5637

mhofman · 2022-06-21T07:02:05Z

Description

Cleanup various otel spans, handling events that were previously not handled, and removing data that is unactionable raw (will be extended upon in a follow up PR)
Adds a new cosmic-swingset-commit-block-* slog events, with their own timing
- commit-block-start and commit-block-finish measure the time taken by cosmic-swingset to save/flush its data.
- after-commit-block is an event containing post block cleanup stats related to node/v8 memory.
Adds a NODE_HEAP_SNAPSHOTS environment option to control the generation of node heap snapshots after a block commit
- -1 (or any negative number) to disable, 0 to only trigger on large block interval (hardcoded at 30 seconds), any positive number to generate a snapshot at an interval
- The snapshot should be generated after node has informed the go side of the block commit completion, which means it should happen while node is idle waiting for the next block (unless the cosmos node is already late)
Forces a GC after every commit to remove variance based on local GC schedule.

Security Considerations

This adds an option to dump the node heap, which shouldn't contain any sensitive data as it's all based on the consensus execution.

Documentation Considerations

TBD

Testing Considerations

Manual local testing with the loadgen deployment test, and ingest slot to otel script.

michaelfig

LGTM! Nice and tidy! Please rerequest a review when you're ready for it.

packages/telemetry/src/slog-to-otel.js

packages/cosmic-swingset/src/chain-main.js

mhofman · 2022-06-22T17:58:23Z

packages/telemetry/src/slog-to-otel.js

-        .map(([key, value]) => [`agoric.${key}`, cleanValue(value, key)])
-        .filter(([_key, value]) => value !== undefined && value !== null),
-    ),
+    ...serializeInto(attrs, 'agoric'),


@michaelfig this does cause some data to be serialized deeply, such as slots being written as .slots.0=ko123 etc instead of .slots=["ko123"]

That's perfectly fine, IMHO.

That sounds like it makes the number of attributes per (span?) become larger and variable.. if some crank uses a thousand slots, is that going to cause problems on the telemetry provider?

While it's nice to be able to recover the list of slots, I'm hard pressed to imagine a honeycomb-side search that could use it (slogfile-processing tools are probably more useful there). Unless it's possible to do a honeycomb search like "find me all spans WHERE any .slots.* attributes have value ko123", which would be kinda cool for causality tracing (currently I do this with a lot of jq work and/or ad-hoc python tools).

Right, a base slots which is a comma separated string of values may be more useful, and could be used with contains in Honeycomb queries

I explicitly handled slots below to avoid the splitting in separate fields.

mhofman · 2022-06-23T22:07:24Z

packages/telemetry/src/slog-to-otel.js

-          message.msg.methargs?.slots ?? message.msg.args.slots;
+        const slots =
+          message.msg.methargs?.slots ?? message.msg.args.slots ?? [];
+        attrs['message.msg.args.slots'] = slots.join(',');


@warner is there a risk that , would ever be used in krefs ?

Not anytime soon, I have a bunch of analysis tools that also use , as a joiner. vrefs are more interesting, but krefs have no punctuation (purely k[opd]\d+) and no plans to acquire any.

I guess the reason that this code is only ever seeing krefs is that it'd fed by crank-start events, rather than type: 'deliver' events? If it were the latter, something would need to extract the kernel-format delivery (.kd), and I don't see that here.

I plan of re-thinking the crank vs delivery slog events.

mhofman · 2022-06-23T22:12:07Z

packages/telemetry/src/slog-to-otel.js

+        if (spans.topName() === `intra-block`) {
+          spans.pop(`intra-block`);
+        }


This introduces a new span to capture the duration between "commit-block" and the next "begin-block", and layers that as a child of the "previous" block, which causes the block's duration to now measure the block to block time.

Before we were not capturing this gap anywhere, causing blocks to appear shorter than they really were.

cc @mfig @warner @arirubinstein

mhofman · 2022-06-28T22:43:20Z

@michaelfig @warner, PTAL (recommend reviewing commit-by-commit)

I've rebased and simplified the commit-block slog events. (sorry for nullifying previous reviews)

I've also included some preliminary cleanup to the slog-to-otel in advance of a more thorough rewrite of spans.

warner

Ok, I think that's good. I can't visualize the collection of spans and how they line up, so I'm not confident that I understand how this changes the spans, but I know you've got another change in the works which will include docs and a list/diagram of spans, so I'll re-examine this code once those docs are available as a reference.

warner · 2022-06-29T16:37:22Z

packages/swing-store/src/snapStore.js

@@ -230,23 +234,27 @@ export function makeSnapStore(
    toDelete.add(hash);
  }

-  function commitDeletes(ignoreErrors = false) {
+  async function commitDeletes(ignoreErrors = false) {


looks sound, although I don't get why an async version is better, and why these changes are part of the PR (this doesn't change the swingstore API, right?, just some churn in the necessary IO endowments?)

It was needed in an earlier version to allow executing JS while the delete I/Os were taking place. This version is strictly superiors though since it parallelizes the deletes.

And you're correct, at this point it became a drive by change.

warner · 2022-06-29T16:45:48Z

packages/SwingSet/src/kernel/kernel.js

@@ -1011,7 +1011,12 @@ export default function buildKernel(
  async function processDeliveryMessage(message) {
    kdebug(`processQ ${JSON.stringify(message)}`);
    kdebug(legibilizeMessage(message));
-    kernelSlog.write({ type: 'crank-start', message });
+    kernelSlog.write({


I was thinking of using delivery-crank-start and routing-crank-start instead of a separate field (likewise delivery-crank-finish/routing-crank-finish). I figured that my jq tools would just grow a select(.type=="delivery-crank-start") instead of select(.type=="crank-start" and .crankType=="delivery"), slightly easier to type, and it's rare that I'd want to pay attention to both types at the same time.

Also, I'm kinda wavering on the idea that the routing step should be called a "crank" (might be better to find a new name, maybe), and having separate type: values would make that easier to change.

OTOH if they share a crankNum numberspace then having separate type values is a tiny bit weird.

Ok leave it with crankType for now, but let's think about it for later.

Yeah the shared numbering space was one motivation for the decision. The other being that some "delivery" cranks still do not deliver anything. I was also concerned about breaking existing slog tooling which targets crank-start.

warner · 2022-06-29T17:20:46Z

packages/cosmic-swingset/src/launch-chain.js

+      });
+
+    controller.writeSlogObject({
+      type: 'cosmic-swingset-commit-block-finish',


For the type: name, be aware that the commit() this wraps is only the swingset/swing-store commit: the rest of cosmos-sdk does not start it's IAVL commit until after this returns. The name is probably ok (we can pretend that the cosmic-swingset- prefix implies that we're only talking about the swingset host app's commit of swingset state), but if we ever find a way to grab the golang/IAVL commit timing and write it to the slog, we might need to find a more precise name.

Right, I expect we'll need to overall the layering of spans if/when we ever manage to report data from the go side.

warner · 2022-06-29T17:28:18Z

packages/cosmic-swingset/src/chain-main.js

+      let heapSnapshotTime;
+
+      const t0 = performance.now();
+      engineGC();


Ok, so this is forcing a Node.js GC every block.. normally I'd be concerned about interfering with the carefully-designed V8 GC algorithm (we shouldn't presume to know better), but given that this is only happening every 5-10 seconds, and we're measuring how much overhead we're introducing, I'm ok with it.

But let's pay attention to the results, and be willing to remove it if it appears to be overhead that could be handled better through the normal incremental GC. I see that afterCommitCallback implies that this might happen in parallel with golang-side work, but I never believe parallelism actually happens without proof.

Correct, that's why I added the measurement compared to the previous iteration.
I couldn't 100% confirm that it happened in parallel with the go side, but from what I could tell, it did.

warner · 2022-06-29T17:34:43Z

packages/telemetry/src/ingest-slog-entrypoint.js

  for await (const line of lines) {
    lineCount += 1;
    const obj = JSON.parse(line);
-    const update = obj.time >= progress.lastSlogTime;
+    update ||= obj.time >= progress.lastSlogTime;


Huh, what prompted this? Did slogfiles have non-monotonically-increasing timestamps?

I see that there's no way for update to ever become false again (within a single run of this program), but that's ok, as the feature is meant to allow the program to be restarted after a crash and pick up where it left off.

I had some weird case where some slog lines were skipped, and this was the only explanation I found. I unfortunately didn't save the slog file.

warner · 2022-06-29T17:45:45Z

packages/telemetry/src/slog-to-otel.js

-          message.msg.methargs?.slots ?? message.msg.args.slots;
+        const slots =
+          message.msg.methargs?.slots ?? message.msg.args.slots ?? [];
+        attrs['message.msg.args.slots'] = slots.join(',');


I guess the reason that this code is only ever seeing krefs is that it'd fed by crank-start events, rather than type: 'deliver' events? If it were the latter, something would need to extract the kernel-format delivery (.kd), and I don't see that here.

Support missing types

feat(telemetry): create spans for commit-block and intra-block Causes block duration to measure block to block time

michaelfig

LGTM! Nice work, Mathieu.

mhofman requested review from arirubinstein, warner and michaelfig June 21, 2022 07:02

michaelfig approved these changes Jun 21, 2022

View reviewed changes

michaelfig self-requested a review June 21, 2022 13:27

mhofman commented Jun 21, 2022

View reviewed changes

packages/telemetry/src/slog-to-otel.js Outdated Show resolved Hide resolved

mhofman commented Jun 21, 2022

View reviewed changes

packages/cosmic-swingset/src/chain-main.js Outdated Show resolved Hide resolved

mhofman commented Jun 21, 2022

View reviewed changes

packages/cosmic-swingset/src/chain-main.js Outdated Show resolved Hide resolved

mhofman changed the title ~~Mhofman/memory sleuth~~ feat(cosmic-swingset): Add memory stats to new commit block slog event Jun 21, 2022

mhofman force-pushed the mhofman/memory-sleuth branch from f080255 to 547d722 Compare June 22, 2022 16:45

mhofman marked this pull request as ready for review June 22, 2022 16:48

mhofman mentioned this pull request Jun 22, 2022

fix(xsnap): do not leak through vat termination race #5643

Merged

mhofman commented Jun 22, 2022

View reviewed changes

mhofman force-pushed the mhofman/memory-sleuth branch from 547d722 to 384024a Compare June 23, 2022 22:04

mhofman commented Jun 23, 2022

View reviewed changes

mhofman force-pushed the mhofman/memory-sleuth branch from 384024a to 86179af Compare June 28, 2022 22:34

warner approved these changes Jun 29, 2022

View reviewed changes

mhofman added 9 commits June 30, 2022 00:48

feat(swingstore): switch to async fs for snapstore

13d443c

fix(telemetry): ingest script should not skip lines on time backtrack

785a5c0

fix(telemetry): cleanup slot-to-otel output

97c695f

Support missing types

feat(telemetry): flatten nested attributes

2fc39ca

fix(Swingset): add crank details to slog event

be1f443

feat(cosmic-swingset): add commit-block slog events

8335928

feat(telemetry): create spans for commit-block and intra-block Causes block duration to measure block to block time

feat(cosmic-swingset): Force GC after block commit

444325d

feat(cosmic-swingset): Add memory usage stats

d8cf4af

feat(cosmic-swingset): Add heap snapshots

42e43bc

mhofman force-pushed the mhofman/memory-sleuth branch from 86179af to 42e43bc Compare June 30, 2022 00:48

michaelfig approved these changes Jun 30, 2022

View reviewed changes

mhofman added the automerge:rebase Automatically rebase updates, then merge label Jun 30, 2022

mergify bot merged commit cd52fca into master Jun 30, 2022

mergify bot deleted the mhofman/memory-sleuth branch June 30, 2022 18:45

mhofman mentioned this pull request Jun 30, 2022

fix(telemetry): do not mutate the original slog object #5705

Merged

mhofman mentioned this pull request May 3, 2024

Memory leak in cosmic-swingset causes validator missed block #9316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cosmic-swingset): Add memory stats to new commit block slog event #5637

feat(cosmic-swingset): Add memory stats to new commit block slog event #5637

mhofman commented Jun 21, 2022 •

edited

Loading

michaelfig left a comment •

edited

Loading

mhofman Jun 22, 2022

michaelfig Jun 22, 2022

warner Jun 23, 2022

mhofman Jun 23, 2022

mhofman Jun 23, 2022

mhofman Jun 23, 2022

warner Jun 29, 2022

warner Jun 29, 2022

mhofman Jun 29, 2022

mhofman Jun 23, 2022

mhofman commented Jun 28, 2022

warner left a comment

warner Jun 29, 2022

mhofman Jun 29, 2022

mhofman Jun 29, 2022

warner Jun 29, 2022

mhofman Jun 29, 2022

warner Jun 29, 2022

mhofman Jun 29, 2022

warner Jun 29, 2022

mhofman Jun 29, 2022

warner Jun 29, 2022

mhofman Jun 29, 2022

warner Jun 29, 2022

michaelfig left a comment

feat(cosmic-swingset): Add memory stats to new commit block slog event #5637

feat(cosmic-swingset): Add memory stats to new commit block slog event #5637

Conversation

mhofman commented Jun 21, 2022 • edited Loading

Description

Security Considerations

Documentation Considerations

Testing Considerations

michaelfig left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhofman commented Jun 28, 2022

warner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelfig left a comment

Choose a reason for hiding this comment

mhofman commented Jun 21, 2022 •

edited

Loading

michaelfig left a comment •

edited

Loading