xsnap message timestamp logging/reporting #5152

warner · 2022-04-19T16:38:06Z

What is the Problem Being Solved?

Our "slogfile" currently records kernel-side timestamps of messages going to and from vats: deliveries, syscall requests, syscall responses, and delivery results.

To support investigations like #5139, as well as to collect data for performance improvements, it would be nice if we also had worker-side timestamps for those same messages. We expect the cross-process messaging to be quick, but we don't actually have enough data to prove it. And we expect the kernel to be idle and thus react immediately to a worker's syscall request, but if the kernel were stuck for some reason (VM thrashing, waiting for some IO write), then a late timestamp on the receipt of a syscall request would look identical to the worker spending a long time before emitting the syscall.

Description of the Design

I'm thinking that xsnap.c should have an array (maybe pre-allocated) of timestamps. Whenever it receives the end of a netstring frame, it should append a gettimeofday (ideally as 64-bit seconds-plus-microseconds-since-epoch) to the array. When it sends a string to the parent process, it should also append a timestamp. We might need to include a tiny bit (maybe just one character) of the message so we can distinguish between console messages and syscalls.

When the delivery is complete and the worker is serializing the results, it should fetch one last timestamp, then serialize the entire array and include it in the metering results. Then it should reset the array length to zero in preparation for the next delivery to begin.

On the kernel side, we already include the full metering record in the slogfile. As long as the kernel doesn't attempt to interpret the additional data, we need no additional modifications.

With this information in the slogfile, we can write post-mortem tools that correlate the sends and receives of each syscall request and response, and build a picture of how long the worker spent working, as opposed to how long it spent waiting for the kernel.

Security Considerations

None, the additional metering data is not shared elsewhere in the kernel, and is not included in consensus.

Test Plan

Unit tests in packages/xsnap/ asserting the presence of the timetable and its correct length, and that the values are sorted and bounded by parent-side timestamps captured before and after the worker interaction.

The text was updated successfully, but these errors were encountered:

Tartuffo · 2022-05-23T19:40:07Z

@warner Is this needed for MN-1 or can it go into the Product Backlog?

warner · 2022-06-09T18:37:28Z

This would help with the perf investigations called for by #5507 and #5361 .

This does not change the engine behavior, but it augments the metering results with a "timestamps" field (an array of Numbers), which captures the times at which: * the primary delivery message was received * each syscall ("issueCommand" to parent) was sent * the response from each syscall was received * the delivery response was sent (actually just before the metering results are serialized) The parent process can correlate these timestamps with similar ones collected on the parent side, to determine how long it took to send messages over the pipe. If an unexpected stall is observed, this can help distinguish between the stall happening on the worker process or on the parent process. refs Agoric/agoric-sdk#5152

This moves to the version of `xsnap-worker.c` which includes receipt/transmit timestamps in the metering results. Tests are updated to expect 'timestamps' in xsnap metering results, and to assert that they correctly interleave with the times of upstream/downstream commands as seen by the parent process. refs #5152

Unix provides multiple time APIs. `gettimeofday()` reports seconds-since-epoch with microsecond resolution, and can jump around, speed up, or even go backwards if/when someone changes the clock on the host computer. `clock_gettime(CLOCK_MONOTONIC)` never goes backwards, has nanosecond resolution, and in practice reports seconds-since-last-reboot (of the host computer). `CLOCK_MONOTONIC` is implemented (at least on linux) by initializing an offset (to the current time) at boot, and reporting `gettimeofday() - offset`. If someone calls `settimeofday()`, the kernel measures the delta, and subtracts it from the offset, so the `CLOCK_MONOTONIC` value does change even though the `gettimeofday()` does. However both `gettimeofday()` and `CLOCK_MONOTONIC` will reflect the gentler frequency adjustments that NTP makes (through `adjtime()`) to keep the clock in sync without timequakes. Previously, our slogfile's `.time` property recorded `performance.timeOrigin + performance.now()`, which is effectively the sum of `gettimeofday()` at process start, plus the delta in `CLOCK_MONOTONIC` between one sample at process start and another when `now()` is called, plus a tiny offset because `gettimeofday()` and the initial sample cannot happen exactly simultaneously. This has the benefit of never decreasing, and preserves the accuracy of intervals (timestamp subtractions) that span a `settimeofday()` timequake. But has the downside of becoming unique per-process when timequakes happen: if you compare this value against a similar one built in a different process, they will be different by the sum of any timequakes that occurred between the first process launching and the second one launching. If a process restarts, the timestamps before and after the restart will be offset by a similar amount. This changes the slogfile to use `gettimeofday()`, via a small C module named `microtime`. This matches what #5152 changes in `xsnap-worker.c` to record delivery/syscall send+receive timestamps. By using `gettimeofday()` on both sides, the timestamps remain comparable independent of when the processes are started or restarted. The drawback is that spans which cross a timequake will be corrupted, even becoming negative if `settimeofday()` causes the system clock to go backwards far enough. However we only really pay attention to one span at a time (a single block, or delivery, or syscall), so the damage is limited. I care more about comparability of timestamps across long-running processes (potentially across multiple hopefully-NTP-synced machines) than I'm worried about the occasional timequake-corrupted span. If we manage to make swingset fast enough that the microsecond resolution of `gettimeofday()` feels insufficient, we should probably change both the slogger and `xsnap-worker.c` to use the earlier approach: `timeOrigin + CLOCK_MONOTONIC` (which will involve writing new code on the `xsnap-worker.c` side). refs #5152

This moves to the version of `xsnap-worker.c` which includes receipt/transmit timestamps in the metering results. Tests are updated to expect 'timestamps' in xsnap metering results, and to assert that they correctly interleave with the times of upstream/downstream commands as seen by the parent process. refs #5152

Unix provides multiple time APIs. `gettimeofday()` reports seconds-since-epoch with microsecond resolution, and can jump around, speed up, or even go backwards if/when someone changes the clock on the host computer. `clock_gettime(CLOCK_MONOTONIC)` never goes backwards, has nanosecond resolution, and in practice reports seconds-since-last-reboot (of the host computer). `CLOCK_MONOTONIC` is implemented (at least on linux) by initializing an offset (to the current time) at boot, and reporting `gettimeofday() - offset`. If someone calls `settimeofday()`, the kernel measures the delta, and subtracts it from the offset, so the `CLOCK_MONOTONIC` value does change even though the `gettimeofday()` does. However both `gettimeofday()` and `CLOCK_MONOTONIC` will reflect the gentler frequency adjustments that NTP makes (through `adjtime()`) to keep the clock in sync without timequakes. Previously, our slogfile's `.time` property recorded `performance.timeOrigin + performance.now()`, which is effectively the sum of `gettimeofday()` at process start, plus the delta in `CLOCK_MONOTONIC` between one sample at process start and another when `now()` is called, plus a tiny offset because `gettimeofday()` and the initial sample cannot happen exactly simultaneously. This has the benefit of never decreasing, and preserves the accuracy of intervals (timestamp subtractions) that span a `settimeofday()` timequake. But has the downside of becoming unique per-process when timequakes happen: if you compare this value against a similar one built in a different process, they will be different by the sum of any timequakes that occurred between the first process launching and the second one launching. If a process restarts, the timestamps before and after the restart will be offset by a similar amount. This changes the slogfile to use `gettimeofday()` for the `.time` property, via a small C module named `microtime`. This matches what issue #5152 changes in `xsnap-worker.c` to record delivery/syscall send+receive timestamps. By using `gettimeofday()` on both sides, the timestamps remain comparable independent of when the processes are started or restarted. The drawback is that spans which cross a timequake will be corrupted, even becoming negative if `settimeofday()` causes the system clock to go backwards far enough. However we only really pay attention to one span at a time (a single block, or delivery, or syscall), so the damage is limited. I care more about comparability of timestamps across long-running processes (potentially across multiple hopefully-NTP-synced machines) than I'm worried about the occasional timequake-corrupted span. A new `.monotime` property records just `performance.now()`, which is CLOCK_MONOTONIC relative to process start, and should have nanosecond resolution (despite returning a number that absurdly claims femtoseconds, probably because of base-2 representation limits). This is good for higher-resolution interval computation, and is immune to timequakes, but cannot be easily compared to anything outside the process. The slogfile lines were also rearranged, to put `type` at the front, and the two timestamps at the end. This should make the file more legible to humans who are just trying to eyeball the start/end of each delivery. If, some day, we manage to make swingset fast enough that the microsecond resolution of `gettimeofday()` feels insufficient, we should probably change both the slogger and `xsnap-worker.c` to use the earlier approach: `timeOrigin + CLOCK_MONOTONIC` (which will involve writing new code on the `xsnap-worker.c` side). refs #5152

refs #5152

warner · 2022-06-19T21:00:30Z

Here's a (python) tool to scan a slogfile with the new meterResults.timestamps data and produce a report:

import sys, json, gzip, re, time
from collections import defaultdict
from pprint import pprint

class Syscall:
    def __init__(self, vsc, num, rx_request):
        self.num = num
        self.vsc = vsc
        self.tx_request = None
        self.rx_request = rx_request
        self.tx_result = None
        self.rx_result = None
        if vsc[0] == "send":
            target = vsc[1]
            msg = vsc[2]
            methargsCD = msg["methargs"]
            method = json.loads(methargsCD["body"])[0]
            result = msg.get("result")
            self.description = "(%s).%s -> %s" % (target, method, result)
        elif vsc[0] == "callNow":
            target = vsc[1]
            method = vsc[2]
            self.description = "D(%s).%s" % (target, method)
        elif vsc[0] == "subscribe":
            vpid = vsc[1]
            self.description = "subscribe(%s)" % vpid
        elif vsc[0] == "resolve":
            vpid_strings = []
            for [vpid, reject, data] in vsc[1]:
                r = "!" if reject else ""
                vpid_strings.append(r + vpid)
            self.description = "notify %s" % ",".join(vpid_strings)
        elif vsc[0] in ["dropImports", "retireImports", "retireExports"]:
            self.description = "%s(%s)" % (vsc[0], ",".join(vsc[1]))
        else:
            self.description = vsc[0]
    def got_result(self, when):
        self.tx_result = when
    def set_tx_request(self, when):
        self.tx_request = when
    def set_rx_result(self, when):
        self.rx_result = when

class Delivery:
    def __init__(self, cranknum, vd, when):
        self.cranknum = cranknum
        self.syscalls = []
        self.tx_delivery = when
        self.rx_delivery = None
        self.tx_result = None
        self.rx_result = None
        self.vd = vd
        if vd[0] == "message":
            target = vd[1]
            msg = vd[2]
            methargsCD = msg["methargs"]
            method = json.loads(methargsCD["body"])[0]
            result = msg.get("result")
            self.description = "(%s).%s -> %s" % (target, method, result)
        elif vd[0] == "notify":
            vpid_strings = []
            for [vpid, reject, data] in vd[1]:
                r = "!" if reject else ""
                vpid_strings.append(r + vpid)
            self.description = "notify %s" % ",".join(vpid_strings)
        else:
            self.description = vd[0]
    def add_syscall(self, vsc, when):
        num = len(self.syscalls)
        s = Syscall(vsc, num, when)
        self.syscalls.append(s)
    def num_syscalls(self):
        return len(self.syscalls)
    def get_syscall(self, which):
        return self.syscalls[which]
    def set_rx_result(self, when):
        self.rx_result = when
    def set_rx_delivery(self, when):
        self.rx_delivery = when
    def set_tx_result(self, when):
        self.tx_result = when
        
    
fn = sys.argv[1]
vatID = sys.argv[2] if len(sys.argv) > 2 else None

deliveries = []

opener = gzip.open if fn.endswith(".gz") else open
with opener(sys.argv[1]) as f:
    for line in f:
        if isinstance(line, bytes):
            line = line.decode("utf-8")
        if not line.strip():
            continue

        data = json.loads(line.strip())
        if vatID and data["vatID"] != vat:
            continue
        type = data["type"]
        when = data["time"]

        if type == 'deliver':
            cranknum = data["crankNum"]
            deliverynum = data["deliveryNum"]
            d = Delivery(cranknum, data["vd"], when)
            deliveries.append(d)
        if type == 'syscall':
            deliveries[-1].add_syscall(data['vsc'], when)
        if type == 'syscall-result':
            deliveries[-1].get_syscall(-1).got_result(when)
        if type == 'deliver-result':
            deliveries[-1].set_rx_result(when)
            dr = data["dr"]
            mr = dr[2]
            if mr and "timestamps" in dr[2]:
                timestamps = mr["timestamps"]
                d = deliveries[-1]
                if len(timestamps):
                    d.set_rx_delivery(timestamps.pop(0))
                for syscallnum in range(d.num_syscalls()):
                    s = d.get_syscall(syscallnum)
                    if len(timestamps):
                        s.set_tx_request(timestamps.pop(0))
                    if len(timestamps):
                        s.set_rx_result(timestamps.pop(0))
                if len(timestamps):
                    d.set_tx_result(timestamps.pop(0))

fill = "                         "
prev_d = None
for d in deliveries:
    total_worker = 0
    total_kernel = 0
    total_pipe = 0
    if prev_d:
        kerneltime = d.tx_delivery - prev_d.rx_result
        print("                                          %.6f kern between-cranks" % kerneltime)
    prev_d = d
    print("c%d %s" % (d.cranknum, d.description))
    if d.rx_delivery:
        print("                                                          k -> %.6f -> w   (send delivery)" % (d.rx_delivery - d.tx_delivery))
        #d.tx_result - d.rx_delivery,
        total_pipe += (d.rx_delivery - d.tx_delivery)
    else:
        print("                                             k -> w   (send delivery)")
    prev_s = None
    for s in d.syscalls:
        if not prev_s and d.rx_delivery and s.tx_request:
            print("%s%.6f worker" % (fill, s.tx_request - d.rx_delivery))
            total_worker += (s.tx_request - d.rx_delivery)
        if prev_s and prev_s.rx_result and s.tx_request:
            print("%s%.6f worker" % (fill, s.tx_request - prev_s.rx_result))
            total_worker += (s.tx_request - prev_s.rx_result)
        prev_s = s
        if s.tx_request and s.rx_result:
            print("   w -> %.6f -> k,                    %.6f kern,  k -> %.6f -> w : %s" % (
                s.rx_request - s.tx_request,
                s.tx_result - s.rx_request,
                s.rx_result - s.tx_result,
                s.description))
            total_pipe += (s.rx_request - s.tx_request)
            total_kernel += (s.tx_result - s.rx_request)
            total_pipe += (s.rx_result - s.tx_result)
        else:
            print("  w -> k,                    %.6f kern,  k -> w : %s" % (
                s.tx_result - s.rx_request,
                s.description))
    if prev_s and d.tx_result:
        print("%s%.6f worker" % (fill, d.tx_result - prev_s.rx_result))
        total_worker += (d.tx_result - prev_s.rx_result)
    if not prev_s and d.tx_result:
        print("%s%.6f worker" % (fill, d.tx_result - d.rx_delivery))
        total_worker += (d.tx_result - d.rx_delivery)
    if d.tx_result:
        print("   w -> %.6f -> k                                                          (return result)" % (d.rx_result - d.tx_result))
        total_pipe += (d.rx_result - d.tx_result)
    else:
        print("  w -> k                                              (return result)")
    k_to_k = d.rx_result - d.tx_delivery
    if d.tx_result:
        print("   pipe %.6f, worker %.6f, kernel %.6f    total  (k->k %.6f)" % (
            total_pipe, total_worker, total_kernel, k_to_k))
    else:
        print("         total  (k->k %.6f)" % k_to_k)

A sample report looks like:

c74 retireExports
                                                          k -> 0.000078 -> w   (send delivery)
                         0.000148 worker
   w -> 0.000121 -> k                                                          (return result)
   pipe 0.000199, worker 0.000148, kernel 0.000000    total  (k->k 0.000347)
                                          0.000346 kern between-cranks
c75 bringOutYourDead
                                                          k -> 0.000066 -> w   (send delivery)
                         0.001141 worker
   w -> 0.000188 -> k,                    0.000071 kern,  k -> 0.000053 -> w : vatstoreGet
                         0.000067 worker
   w -> 0.000098 -> k,                    0.000145 kern,  k -> 0.000059 -> w : vatstoreGetAfter
                         0.000062 worker
   w -> 0.000187 -> k,                    0.000025 kern,  k -> 0.000057 -> w : dropImports(o-50)
                         0.000060 worker
   w -> 0.000234 -> k,                    0.000025 kern,  k -> 0.000062 -> w : retireImports(o-50)
                         0.000044 worker
   w -> 0.000171 -> k                                                          (return result)
   pipe 0.001175, worker 0.001374, kernel 0.000266    total  (k->k 0.002815)

which shows two deliveries (retireExports in crank 74, then bringOutYourDead in crank 75). The kernel spent 346us thinking between the end of the retireExports and the beginning of bringOutYourDead. The BOYD started with a 66us delay to send the VatDeliveryObject from the (k)ernel to the (w)orker. Then the worker spent 1.141ms thinking. then the worker sent a vatstoreGet syscall, which took 188us to get from worker to kernel, 71us in the kernel (doing a DB read), then 53us to get the syscal results back to the worker. Then the worker though for 67us before sending a vatstoreGetAfter syscall, etc. The worker's VatDeliveryResult object took 171us to be sent from worker to kernel. Overall, the delivery took 2.815ms (from the delivery leaving the kernel, to the result arriving back at the kernel), of which 1.374ms was spent in the worker, 266us was spent in the kernel, and 1.175ms was spent sending things back and forth through pipes.

This dataset came from a unit test, so it's pretty quiet (there's no chain activity going on).

The next step is to adapt this code into slog-to-otel.js and build spans out of all these little chunks of activity.

warner · 2022-06-19T21:22:43Z

The timestamps we're recording look like this:

The red dashed boxes are potential sources of the unexpected delays we've observed in testnet data.

refs #5152

Tartuffo · 2022-06-30T12:27:48Z

@mhofman estimate please?

warner · 2022-07-07T16:52:21Z

The xsnap.c and slog-recording portions are landed. Once slog-to-otel.js is converting an adequate amount of this data into spans for telemetry, we can close this ticket.

warner added enhancement New feature or request xsnap the XS execution tool labels Apr 19, 2022

warner mentioned this issue Apr 19, 2022

odd crank times while running a devnet #5139

Closed

Tartuffo added the SwingSet package: SwingSet label May 12, 2022

warner self-assigned this Jun 9, 2022

Tartuffo added this to the Mainnet 1 milestone Jun 9, 2022

warner mentioned this issue Jun 10, 2022

feat: add sent/received timestamps to the metering record agoric-labs/xsnap-pub#28

Merged

warner mentioned this issue Jun 15, 2022

fix(xsnap): upgrade to latest xsnap, now with timestamps, and update METER_TYPE #5598

Merged

warner mentioned this issue Jun 15, 2022

fix(swingset): use gettimeofday() in the slogfile, not CLOCK_MONOTONIC #5599

Merged

warner added a commit that referenced this issue Jun 17, 2022

fix: copy timestamp deltas into otel records

bd534ee

refs #5152

warner mentioned this issue Jun 20, 2022

With light simulated load, validators have difficulty creating blocks in time #5507

Open

mhofman self-assigned this Jun 24, 2022

warner added a commit that referenced this issue Jun 24, 2022

fix: copy timestamp deltas into otel records

c9ab124

refs #5152

warner removed their assignment Jul 7, 2022

Tartuffo removed this from the Mainnet 1 milestone Jul 20, 2022

mhofman mentioned this issue Oct 5, 2022

Handle timestamps in otel telemetry #6399

Open

Tartuffo added migrate-product-backlog and removed migrate-product-backlog labels Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xsnap message timestamp logging/reporting #5152

xsnap message timestamp logging/reporting #5152

warner commented Apr 19, 2022 •

edited

Loading

Tartuffo commented May 23, 2022

warner commented Jun 9, 2022

warner commented Jun 19, 2022

warner commented Jun 19, 2022

Tartuffo commented Jun 30, 2022

warner commented Jul 7, 2022

xsnap message timestamp logging/reporting #5152

xsnap message timestamp logging/reporting #5152

Comments

warner commented Apr 19, 2022 • edited Loading

What is the Problem Being Solved?

Description of the Design

Security Considerations

Test Plan

Tartuffo commented May 23, 2022

warner commented Jun 9, 2022

warner commented Jun 19, 2022

warner commented Jun 19, 2022

Tartuffo commented Jun 30, 2022

warner commented Jul 7, 2022

warner commented Apr 19, 2022 •

edited

Loading