Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xsnap message timestamp logging/reporting #5152

Open
warner opened this issue Apr 19, 2022 · 6 comments
Open

xsnap message timestamp logging/reporting #5152

warner opened this issue Apr 19, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request SwingSet package: SwingSet xsnap the XS execution tool

Comments

@warner
Copy link
Member

warner commented Apr 19, 2022

What is the Problem Being Solved?

Our "slogfile" currently records kernel-side timestamps of messages going to and from vats: deliveries, syscall requests, syscall responses, and delivery results.

To support investigations like #5139, as well as to collect data for performance improvements, it would be nice if we also had worker-side timestamps for those same messages. We expect the cross-process messaging to be quick, but we don't actually have enough data to prove it. And we expect the kernel to be idle and thus react immediately to a worker's syscall request, but if the kernel were stuck for some reason (VM thrashing, waiting for some IO write), then a late timestamp on the receipt of a syscall request would look identical to the worker spending a long time before emitting the syscall.

Description of the Design

I'm thinking that xsnap.c should have an array (maybe pre-allocated) of timestamps. Whenever it receives the end of a netstring frame, it should append a gettimeofday (ideally as 64-bit seconds-plus-microseconds-since-epoch) to the array. When it sends a string to the parent process, it should also append a timestamp. We might need to include a tiny bit (maybe just one character) of the message so we can distinguish between console messages and syscalls.

When the delivery is complete and the worker is serializing the results, it should fetch one last timestamp, then serialize the entire array and include it in the metering results. Then it should reset the array length to zero in preparation for the next delivery to begin.

On the kernel side, we already include the full metering record in the slogfile. As long as the kernel doesn't attempt to interpret the additional data, we need no additional modifications.

With this information in the slogfile, we can write post-mortem tools that correlate the sends and receives of each syscall request and response, and build a picture of how long the worker spent working, as opposed to how long it spent waiting for the kernel.

Security Considerations

None, the additional metering data is not shared elsewhere in the kernel, and is not included in consensus.

Test Plan

Unit tests in packages/xsnap/ asserting the presence of the timetable and its correct length, and that the values are sorted and bounded by parent-side timestamps captured before and after the worker interaction.

@warner warner added enhancement New feature or request xsnap the XS execution tool labels Apr 19, 2022
@Tartuffo Tartuffo added the SwingSet package: SwingSet label May 12, 2022
@Tartuffo
Copy link
Contributor

@warner Is this needed for MN-1 or can it go into the Product Backlog?

@warner warner self-assigned this Jun 9, 2022
@warner
Copy link
Member Author

warner commented Jun 9, 2022

This would help with the perf investigations called for by #5507 and #5361 .

@Tartuffo Tartuffo added this to the Mainnet 1 milestone Jun 9, 2022
warner added a commit to agoric-labs/xsnap-pub that referenced this issue Jun 10, 2022
This does not change the engine behavior, but it augments the metering
results with a "timestamps" field (an array of Numbers), which
captures the times at which:
* the primary delivery message was received
* each syscall ("issueCommand" to parent) was sent
* the response from each syscall was received
* the delivery response was sent (actually just before the metering
results are serialized)

The parent process can correlate these timestamps with similar ones
collected on the parent side, to determine how long it took to send
messages over the pipe. If an unexpected stall is observed, this can
help distinguish between the stall happening on the worker process or
on the parent process.

refs Agoric/agoric-sdk#5152
warner added a commit that referenced this issue Jun 15, 2022
This moves to the version of `xsnap-worker.c` which includes
receipt/transmit timestamps in the metering results.

Tests are updated to expect 'timestamps' in xsnap metering results,
and to assert that they correctly interleave with the times of
upstream/downstream commands as seen by the parent process.

refs #5152
warner added a commit that referenced this issue Jun 15, 2022
This moves to the version of `xsnap-worker.c` which includes
receipt/transmit timestamps in the metering results.

Tests are updated to expect 'timestamps' in xsnap metering results,
and to assert that they correctly interleave with the times of
upstream/downstream commands as seen by the parent process.

refs #5152
warner added a commit that referenced this issue Jun 15, 2022
Unix provides multiple time APIs. `gettimeofday()` reports
seconds-since-epoch with microsecond resolution, and can jump around,
speed up, or even go backwards if/when someone changes the clock on
the host computer. `clock_gettime(CLOCK_MONOTONIC)` never goes
backwards, has nanosecond resolution, and in practice reports
seconds-since-last-reboot (of the host computer). `CLOCK_MONOTONIC` is
implemented (at least on linux) by initializing an offset (to the
current time) at boot, and reporting `gettimeofday() - offset`. If
someone calls `settimeofday()`, the kernel measures the delta, and
subtracts it from the offset, so the `CLOCK_MONOTONIC` value does
change even though the `gettimeofday()` does. However both
`gettimeofday()` and `CLOCK_MONOTONIC` will reflect the gentler
frequency adjustments that NTP makes (through `adjtime()`) to keep the
clock in sync without timequakes.

Previously, our slogfile's `.time` property recorded
`performance.timeOrigin + performance.now()`, which is effectively the
sum of `gettimeofday()` at process start, plus the delta in
`CLOCK_MONOTONIC` between one sample at process start and another when
`now()` is called, plus a tiny offset because `gettimeofday()` and the
initial sample cannot happen exactly simultaneously. This has the
benefit of never decreasing, and preserves the accuracy of
intervals (timestamp subtractions) that span a `settimeofday()`
timequake. But has the downside of becoming unique per-process when
timequakes happen: if you compare this value against a similar one
built in a different process, they will be different by the sum of any
timequakes that occurred between the first process launching and the
second one launching. If a process restarts, the timestamps before and
after the restart will be offset by a similar amount.

This changes the slogfile to use `gettimeofday()`, via a small C
module named `microtime`. This matches what #5152 changes in
`xsnap-worker.c` to record delivery/syscall send+receive
timestamps. By using `gettimeofday()` on both sides, the timestamps
remain comparable independent of when the processes are started or
restarted. The drawback is that spans which cross a timequake will be
corrupted, even becoming negative if `settimeofday()` causes the
system clock to go backwards far enough. However we only really pay
attention to one span at a time (a single block, or delivery, or
syscall), so the damage is limited. I care more about comparability of
timestamps across long-running processes (potentially across multiple
hopefully-NTP-synced machines) than I'm worried about the occasional
timequake-corrupted span.

If we manage to make swingset fast enough that the microsecond
resolution of `gettimeofday()` feels insufficient, we should probably
change both the slogger and `xsnap-worker.c` to use the earlier
approach: `timeOrigin + CLOCK_MONOTONIC` (which will involve writing
new code on the `xsnap-worker.c` side).

refs #5152
warner added a commit that referenced this issue Jun 15, 2022
This moves to the version of `xsnap-worker.c` which includes
receipt/transmit timestamps in the metering results.

Tests are updated to expect 'timestamps' in xsnap metering results,
and to assert that they correctly interleave with the times of
upstream/downstream commands as seen by the parent process.

refs #5152
warner added a commit that referenced this issue Jun 15, 2022
This moves to the version of `xsnap-worker.c` which includes
receipt/transmit timestamps in the metering results.

Tests are updated to expect 'timestamps' in xsnap metering results,
and to assert that they correctly interleave with the times of
upstream/downstream commands as seen by the parent process.

refs #5152
warner added a commit that referenced this issue Jun 15, 2022
This moves to the version of `xsnap-worker.c` which includes
receipt/transmit timestamps in the metering results.

Tests are updated to expect 'timestamps' in xsnap metering results,
and to assert that they correctly interleave with the times of
upstream/downstream commands as seen by the parent process.

refs #5152
warner added a commit that referenced this issue Jun 15, 2022
This moves to the version of `xsnap-worker.c` which includes
receipt/transmit timestamps in the metering results.

Tests are updated to expect 'timestamps' in xsnap metering results,
and to assert that they correctly interleave with the times of
upstream/downstream commands as seen by the parent process.

refs #5152
warner added a commit that referenced this issue Jun 15, 2022
Unix provides multiple time APIs. `gettimeofday()` reports
seconds-since-epoch with microsecond resolution, and can jump around,
speed up, or even go backwards if/when someone changes the clock on
the host computer. `clock_gettime(CLOCK_MONOTONIC)` never goes
backwards, has nanosecond resolution, and in practice reports
seconds-since-last-reboot (of the host computer). `CLOCK_MONOTONIC` is
implemented (at least on linux) by initializing an offset (to the
current time) at boot, and reporting `gettimeofday() - offset`. If
someone calls `settimeofday()`, the kernel measures the delta, and
subtracts it from the offset, so the `CLOCK_MONOTONIC` value does
change even though the `gettimeofday()` does. However both
`gettimeofday()` and `CLOCK_MONOTONIC` will reflect the gentler
frequency adjustments that NTP makes (through `adjtime()`) to keep the
clock in sync without timequakes.

Previously, our slogfile's `.time` property recorded
`performance.timeOrigin + performance.now()`, which is effectively the
sum of `gettimeofday()` at process start, plus the delta in
`CLOCK_MONOTONIC` between one sample at process start and another when
`now()` is called, plus a tiny offset because `gettimeofday()` and the
initial sample cannot happen exactly simultaneously. This has the
benefit of never decreasing, and preserves the accuracy of
intervals (timestamp subtractions) that span a `settimeofday()`
timequake. But has the downside of becoming unique per-process when
timequakes happen: if you compare this value against a similar one
built in a different process, they will be different by the sum of any
timequakes that occurred between the first process launching and the
second one launching. If a process restarts, the timestamps before and
after the restart will be offset by a similar amount.

This changes the slogfile to use `gettimeofday()` for the `.time`
property, via a small C module named `microtime`. This matches what
issue #5152 changes in `xsnap-worker.c` to record delivery/syscall
send+receive timestamps. By using `gettimeofday()` on both sides, the
timestamps remain comparable independent of when the processes are
started or restarted. The drawback is that spans which cross a
timequake will be corrupted, even becoming negative if
`settimeofday()` causes the system clock to go backwards far
enough. However we only really pay attention to one span at a time (a
single block, or delivery, or syscall), so the damage is limited. I
care more about comparability of timestamps across long-running
processes (potentially across multiple hopefully-NTP-synced machines)
than I'm worried about the occasional timequake-corrupted span.

A new `.monotime` property records just `performance.now()`, which is
CLOCK_MONOTONIC relative to process start, and should have nanosecond
resolution (despite returning a number that absurdly claims
femtoseconds, probably because of base-2 representation limits). This
is good for higher-resolution interval computation, and is immune to
timequakes, but cannot be easily compared to anything outside the
process.

The slogfile lines were also rearranged, to put `type` at the front,
and the two timestamps at the end. This should make the file more
legible to humans who are just trying to eyeball the start/end of each
delivery.

If, some day, we manage to make swingset fast enough that the
microsecond resolution of `gettimeofday()` feels insufficient, we
should probably change both the slogger and `xsnap-worker.c` to use
the earlier approach: `timeOrigin + CLOCK_MONOTONIC` (which will
involve writing new code on the `xsnap-worker.c` side).

refs #5152
warner added a commit that referenced this issue Jun 15, 2022
Unix provides multiple time APIs. `gettimeofday()` reports
seconds-since-epoch with microsecond resolution, and can jump around,
speed up, or even go backwards if/when someone changes the clock on
the host computer. `clock_gettime(CLOCK_MONOTONIC)` never goes
backwards, has nanosecond resolution, and in practice reports
seconds-since-last-reboot (of the host computer). `CLOCK_MONOTONIC` is
implemented (at least on linux) by initializing an offset (to the
current time) at boot, and reporting `gettimeofday() - offset`. If
someone calls `settimeofday()`, the kernel measures the delta, and
subtracts it from the offset, so the `CLOCK_MONOTONIC` value does
change even though the `gettimeofday()` does. However both
`gettimeofday()` and `CLOCK_MONOTONIC` will reflect the gentler
frequency adjustments that NTP makes (through `adjtime()`) to keep the
clock in sync without timequakes.

Previously, our slogfile's `.time` property recorded
`performance.timeOrigin + performance.now()`, which is effectively the
sum of `gettimeofday()` at process start, plus the delta in
`CLOCK_MONOTONIC` between one sample at process start and another when
`now()` is called, plus a tiny offset because `gettimeofday()` and the
initial sample cannot happen exactly simultaneously. This has the
benefit of never decreasing, and preserves the accuracy of
intervals (timestamp subtractions) that span a `settimeofday()`
timequake. But has the downside of becoming unique per-process when
timequakes happen: if you compare this value against a similar one
built in a different process, they will be different by the sum of any
timequakes that occurred between the first process launching and the
second one launching. If a process restarts, the timestamps before and
after the restart will be offset by a similar amount.

This changes the slogfile to use `gettimeofday()` for the `.time`
property, via a small C module named `microtime`. This matches what
issue #5152 changes in `xsnap-worker.c` to record delivery/syscall
send+receive timestamps. By using `gettimeofday()` on both sides, the
timestamps remain comparable independent of when the processes are
started or restarted. The drawback is that spans which cross a
timequake will be corrupted, even becoming negative if
`settimeofday()` causes the system clock to go backwards far
enough. However we only really pay attention to one span at a time (a
single block, or delivery, or syscall), so the damage is limited. I
care more about comparability of timestamps across long-running
processes (potentially across multiple hopefully-NTP-synced machines)
than I'm worried about the occasional timequake-corrupted span.

A new `.monotime` property records just `performance.now()`, which is
CLOCK_MONOTONIC relative to process start, and should have nanosecond
resolution (despite returning a number that absurdly claims
femtoseconds, probably because of base-2 representation limits). This
is good for higher-resolution interval computation, and is immune to
timequakes, but cannot be easily compared to anything outside the
process.

The slogfile lines were also rearranged, to put `type` at the front,
and the two timestamps at the end. This should make the file more
legible to humans who are just trying to eyeball the start/end of each
delivery.

If, some day, we manage to make swingset fast enough that the
microsecond resolution of `gettimeofday()` feels insufficient, we
should probably change both the slogger and `xsnap-worker.c` to use
the earlier approach: `timeOrigin + CLOCK_MONOTONIC` (which will
involve writing new code on the `xsnap-worker.c` side).

refs #5152
warner added a commit that referenced this issue Jun 17, 2022
@warner
Copy link
Member Author

warner commented Jun 19, 2022

Here's a (python) tool to scan a slogfile with the new meterResults.timestamps data and produce a report:

import sys, json, gzip, re, time
from collections import defaultdict
from pprint import pprint

class Syscall:
    def __init__(self, vsc, num, rx_request):
        self.num = num
        self.vsc = vsc
        self.tx_request = None
        self.rx_request = rx_request
        self.tx_result = None
        self.rx_result = None
        if vsc[0] == "send":
            target = vsc[1]
            msg = vsc[2]
            methargsCD = msg["methargs"]
            method = json.loads(methargsCD["body"])[0]
            result = msg.get("result")
            self.description = "(%s).%s -> %s" % (target, method, result)
        elif vsc[0] == "callNow":
            target = vsc[1]
            method = vsc[2]
            self.description = "D(%s).%s" % (target, method)
        elif vsc[0] == "subscribe":
            vpid = vsc[1]
            self.description = "subscribe(%s)" % vpid
        elif vsc[0] == "resolve":
            vpid_strings = []
            for [vpid, reject, data] in vsc[1]:
                r = "!" if reject else ""
                vpid_strings.append(r + vpid)
            self.description = "notify %s" % ",".join(vpid_strings)
        elif vsc[0] in ["dropImports", "retireImports", "retireExports"]:
            self.description = "%s(%s)" % (vsc[0], ",".join(vsc[1]))
        else:
            self.description = vsc[0]
    def got_result(self, when):
        self.tx_result = when
    def set_tx_request(self, when):
        self.tx_request = when
    def set_rx_result(self, when):
        self.rx_result = when

class Delivery:
    def __init__(self, cranknum, vd, when):
        self.cranknum = cranknum
        self.syscalls = []
        self.tx_delivery = when
        self.rx_delivery = None
        self.tx_result = None
        self.rx_result = None
        self.vd = vd
        if vd[0] == "message":
            target = vd[1]
            msg = vd[2]
            methargsCD = msg["methargs"]
            method = json.loads(methargsCD["body"])[0]
            result = msg.get("result")
            self.description = "(%s).%s -> %s" % (target, method, result)
        elif vd[0] == "notify":
            vpid_strings = []
            for [vpid, reject, data] in vd[1]:
                r = "!" if reject else ""
                vpid_strings.append(r + vpid)
            self.description = "notify %s" % ",".join(vpid_strings)
        else:
            self.description = vd[0]
    def add_syscall(self, vsc, when):
        num = len(self.syscalls)
        s = Syscall(vsc, num, when)
        self.syscalls.append(s)
    def num_syscalls(self):
        return len(self.syscalls)
    def get_syscall(self, which):
        return self.syscalls[which]
    def set_rx_result(self, when):
        self.rx_result = when
    def set_rx_delivery(self, when):
        self.rx_delivery = when
    def set_tx_result(self, when):
        self.tx_result = when
        
    
fn = sys.argv[1]
vatID = sys.argv[2] if len(sys.argv) > 2 else None

deliveries = []

opener = gzip.open if fn.endswith(".gz") else open
with opener(sys.argv[1]) as f:
    for line in f:
        if isinstance(line, bytes):
            line = line.decode("utf-8")
        if not line.strip():
            continue

        data = json.loads(line.strip())
        if vatID and data["vatID"] != vat:
            continue
        type = data["type"]
        when = data["time"]

        if type == 'deliver':
            cranknum = data["crankNum"]
            deliverynum = data["deliveryNum"]
            d = Delivery(cranknum, data["vd"], when)
            deliveries.append(d)
        if type == 'syscall':
            deliveries[-1].add_syscall(data['vsc'], when)
        if type == 'syscall-result':
            deliveries[-1].get_syscall(-1).got_result(when)
        if type == 'deliver-result':
            deliveries[-1].set_rx_result(when)
            dr = data["dr"]
            mr = dr[2]
            if mr and "timestamps" in dr[2]:
                timestamps = mr["timestamps"]
                d = deliveries[-1]
                if len(timestamps):
                    d.set_rx_delivery(timestamps.pop(0))
                for syscallnum in range(d.num_syscalls()):
                    s = d.get_syscall(syscallnum)
                    if len(timestamps):
                        s.set_tx_request(timestamps.pop(0))
                    if len(timestamps):
                        s.set_rx_result(timestamps.pop(0))
                if len(timestamps):
                    d.set_tx_result(timestamps.pop(0))

fill = "                         "
prev_d = None
for d in deliveries:
    total_worker = 0
    total_kernel = 0
    total_pipe = 0
    if prev_d:
        kerneltime = d.tx_delivery - prev_d.rx_result
        print("                                          %.6f kern between-cranks" % kerneltime)
    prev_d = d
    print("c%d %s" % (d.cranknum, d.description))
    if d.rx_delivery:
        print("                                                          k -> %.6f -> w   (send delivery)" % (d.rx_delivery - d.tx_delivery))
        #d.tx_result - d.rx_delivery,
        total_pipe += (d.rx_delivery - d.tx_delivery)
    else:
        print("                                             k -> w   (send delivery)")
    prev_s = None
    for s in d.syscalls:
        if not prev_s and d.rx_delivery and s.tx_request:
            print("%s%.6f worker" % (fill, s.tx_request - d.rx_delivery))
            total_worker += (s.tx_request - d.rx_delivery)
        if prev_s and prev_s.rx_result and s.tx_request:
            print("%s%.6f worker" % (fill, s.tx_request - prev_s.rx_result))
            total_worker += (s.tx_request - prev_s.rx_result)
        prev_s = s
        if s.tx_request and s.rx_result:
            print("   w -> %.6f -> k,                    %.6f kern,  k -> %.6f -> w : %s" % (
                s.rx_request - s.tx_request,
                s.tx_result - s.rx_request,
                s.rx_result - s.tx_result,
                s.description))
            total_pipe += (s.rx_request - s.tx_request)
            total_kernel += (s.tx_result - s.rx_request)
            total_pipe += (s.rx_result - s.tx_result)
        else:
            print("  w -> k,                    %.6f kern,  k -> w : %s" % (
                s.tx_result - s.rx_request,
                s.description))
    if prev_s and d.tx_result:
        print("%s%.6f worker" % (fill, d.tx_result - prev_s.rx_result))
        total_worker += (d.tx_result - prev_s.rx_result)
    if not prev_s and d.tx_result:
        print("%s%.6f worker" % (fill, d.tx_result - d.rx_delivery))
        total_worker += (d.tx_result - d.rx_delivery)
    if d.tx_result:
        print("   w -> %.6f -> k                                                          (return result)" % (d.rx_result - d.tx_result))
        total_pipe += (d.rx_result - d.tx_result)
    else:
        print("  w -> k                                              (return result)")
    k_to_k = d.rx_result - d.tx_delivery
    if d.tx_result:
        print("   pipe %.6f, worker %.6f, kernel %.6f    total  (k->k %.6f)" % (
            total_pipe, total_worker, total_kernel, k_to_k))
    else:
        print("         total  (k->k %.6f)" % k_to_k)
    

A sample report looks like:

c74 retireExports
                                                          k -> 0.000078 -> w   (send delivery)
                         0.000148 worker
   w -> 0.000121 -> k                                                          (return result)
   pipe 0.000199, worker 0.000148, kernel 0.000000    total  (k->k 0.000347)
                                          0.000346 kern between-cranks
c75 bringOutYourDead
                                                          k -> 0.000066 -> w   (send delivery)
                         0.001141 worker
   w -> 0.000188 -> k,                    0.000071 kern,  k -> 0.000053 -> w : vatstoreGet
                         0.000067 worker
   w -> 0.000098 -> k,                    0.000145 kern,  k -> 0.000059 -> w : vatstoreGetAfter
                         0.000062 worker
   w -> 0.000187 -> k,                    0.000025 kern,  k -> 0.000057 -> w : dropImports(o-50)
                         0.000060 worker
   w -> 0.000234 -> k,                    0.000025 kern,  k -> 0.000062 -> w : retireImports(o-50)
                         0.000044 worker
   w -> 0.000171 -> k                                                          (return result)
   pipe 0.001175, worker 0.001374, kernel 0.000266    total  (k->k 0.002815)

which shows two deliveries (retireExports in crank 74, then bringOutYourDead in crank 75). The kernel spent 346us thinking between the end of the retireExports and the beginning of bringOutYourDead. The BOYD started with a 66us delay to send the VatDeliveryObject from the (k)ernel to the (w)orker. Then the worker spent 1.141ms thinking. then the worker sent a vatstoreGet syscall, which took 188us to get from worker to kernel, 71us in the kernel (doing a DB read), then 53us to get the syscal results back to the worker. Then the worker though for 67us before sending a vatstoreGetAfter syscall, etc. The worker's VatDeliveryResult object took 171us to be sent from worker to kernel. Overall, the delivery took 2.815ms (from the delivery leaving the kernel, to the result arriving back at the kernel), of which 1.374ms was spent in the worker, 266us was spent in the kernel, and 1.175ms was spent sending things back and forth through pipes.

This dataset came from a unit test, so it's pretty quiet (there's no chain activity going on).

The next step is to adapt this code into slog-to-otel.js and build spans out of all these little chunks of activity.

@warner
Copy link
Member Author

warner commented Jun 19, 2022

The timestamps we're recording look like this:

slog timestamps

The red dashed boxes are potential sources of the unexpected delays we've observed in testnet data.

@Tartuffo
Copy link
Contributor

@mhofman estimate please?

@warner
Copy link
Member Author

warner commented Jul 7, 2022

The xsnap.c and slog-recording portions are landed. Once slog-to-otel.js is converting an adequate amount of this data into spans for telemetry, we can close this ticket.

@warner warner removed their assignment Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request SwingSet package: SwingSet xsnap the XS execution tool
Projects
None yet
Development

No branches or pull requests

3 participants