fix(subscriber): record timestamps for updates last #289

hawkw · 2022-02-17T18:38:48Z

Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually before
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

Solution

There's no actual reason to take those timestamps before we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and then record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
next update.

Fixes #266
Depends on #290

seanmonstar · 2022-02-17T18:55:15Z

There's no actual reason to take those timestamps before we assemble the update.

I'd think the reason for taking the timestamp first is because all the other stuff can take some time to compute, and we want to record as close as possible when the event actually happened, not when we finished understanding what the event means.

hawkw · 2022-02-17T18:58:49Z

There's no actual reason to take those timestamps before we assemble the update.

I'd think the reason for taking the timestamp first is because all the other stuff can take some time to compute, and we want to record as close as possible when the event actually happened, not when we finished understanding what the event means.

Right, but the issue is that converting the data to the protobuf format takes time, and the data may be updated while we're doing that. If the data is updated by another thread after when we take the "now" timestamp and when we actually construct the protobuf representation of that data, it might have a timestamp that occurs after "now". #266 explains why this is a problem.

We could, alternatively, handle this kind of thing in the console UI; this felt like a better solution to me, though, since it wouldn't have to be solved again in another consumer of the wire format.

This branch refactors how the aggregator task builds update messages. In particular, it factors out the shared code between building the initial update for a subscription and building the subsequent updates in `publish` into methods that are called with an `Include` enum to determine whether the update should include all tasks/resources/ops, or just the ones since the last update. This makes the code in `add_instrument_subscription` and in `publish` much easier to read. This is just the refactor part from #289 split out into a separate branch; this should make no functional change. If we decide that #289 is the correct way to fix the bug, we can rebase that branch onto this, so that it _just_ includes the bugfix. This way, we can land the refactor anyway, because I think it's nice.

## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. Fixes #266 Depends on #288

hawkw · 2022-02-17T19:25:37Z

I went ahead and split out the refactoring into a separate branch (#290) so we can go ahead and merge that regardless of what we decide is the best way to fix the bug.

This branch refactors how the aggregator task builds update messages. In particular, it factors out the shared code between building the initial update for a subscription and building the subsequent updates in `publish` into methods that are called with an `Include` enum to determine whether the update should include all tasks/resources/ops, or just the ones since the last update. This makes the code in `add_instrument_subscription` and in `publish` much easier to read. This is just the refactor part from #289 split out into a separate branch; this should make no functional change. If we decide that #289 is the correct way to fix the bug, we can rebase that branch onto this, so that it _just_ includes the bugfix. This way, we can land the refactor anyway, because I think it's nice.

thombles · 2022-02-17T21:50:05Z

I've been fiddling with this crash in parallel (I can trigger it pretty reliably) and was investigating the same change. I tried checking out the tip of this branch (713a7be) and I'm still getting a similar crash/backtrace sadly.

console-crash-713a7befc3dcb9b7dae5a6e5b4fbe220e47a6eb6.txt

hawkw · 2022-02-17T21:53:51Z

I've been fiddling with this crash in parallel (I can trigger it pretty reliably) and was investigating the same change. I tried checking out the tip of this branch (713a7be) and I'm still getting a similar crash/backtrace sadly.

console-crash-713a7befc3dcb9b7dae5a6e5b4fbe220e47a6eb6.txt

Hmm, that's a shame that this didn't fix it. Can you share how you're able to reliably trigger the panic?

thombles · 2022-02-17T21:56:01Z

Not really, sorry. My app under test just has 50–70 long-running tasks ticking along mostly idle and it crashes within 10 minutes or so looking at at the tasks list view.

thombles · 2022-02-17T21:57:06Z

If it helps account for any particular skew, the instrumented device is an iPhone I'm connected to over WiFi, with console running on my Mac.

hawkw · 2022-02-17T22:05:16Z

@thombles just to confirm, the iPhone's console-subscriber dependency is from this branch on commit 713a7be? Not just the version of the console UI running on your mac?

hawkw · 2022-02-17T22:08:32Z

Oh, one other question: when the panic occurs, do you know if it's always in AsyncOpsState::retain_active, or does it ever panic in TasksState::retain_active or ResourcesState::retain_active?

thombles · 2022-02-18T00:21:08Z

@thombles just to confirm, the iPhone's console-subscriber dependency is from this branch on commit 713a7be? Not just the version of the console UI running on your mac?

Whoops, great guess - yes I forgot to update the subscriber end and obviously this code is in the subscriber crate. Sorry to mislead. I've now updated both device to main and it's rock solid. It does make me wonder whether console should defensively do a saturating duration_since just in case it gets bogus data but I guess it's not critical?

when the panic occurs, do you know if it's always in AsyncOpsState::retain_active, or does it ever panic in TasksState::retain_active or ResourcesState::retain_active?

So far I've only noticed it in the AsyncOpsState but I haven't analysed whether there's any special reason for that.

hawkw · 2022-02-18T17:10:33Z

@thombles

Whoops, great guess - yes I forgot to update the subscriber end and obviously this code is in the subscriber crate. Sorry to mislead. I've now updated both device to main and it's rock solid.

Ah, yup, there's the problem --- thanks for double-checking. So, this change probably does fix the issue, which is good, because I felt like I was losing my mind there for a second. :)

It does make me wonder whether console should defensively do a saturating duration_since just in case it gets bogus data but I guess it's not critical?

Yeah, it probably should — we should give the whole console UI a pass for potential panics in the face of weird data from the instrumented application.

## 0.1.3 (2022-02-18) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))

#### Features * add `Builder::filter_env_var` builder parameter (#276) ([dbdb149](dbdb149), closes [#206](206)) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))

Changes duration_since and unchecked subtractions to assume Duration::ZERO (via Default) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative Durations in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)

…#295) Changes duration_since and unchecked subtractions to assume `Duration::ZERO` (via `Default`) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative `Duration`s in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to #266 (crash) and #289 (fix for root cause) --- There are a handful of other unwraps/expects I'm not so sure about. These ones looked like pretty clear-cut cases where panicking is possible, undesirable, and clamping to zero is a reasonable response.

hawkw requested review from seanmonstar and zaharidichev February 17, 2022 18:38

Base automatically changed from eliza/no-system-time to main February 17, 2022 18:39

hawkw mentioned this pull request Feb 17, 2022

refac(subscriber): factor out update construction #290

Merged

hawkw force-pushed the eliza/late-nows branch from 00ce215 to cf8ccd1 Compare February 17, 2022 19:23

hawkw added 2 commits February 17, 2022 11:24

hawkw force-pushed the eliza/late-nows branch from cf8ccd1 to 4d52973 Compare February 17, 2022 19:24

seanmonstar approved these changes Feb 17, 2022

View reviewed changes

Merge branch 'main' into eliza/late-nows

713a7be

hawkw enabled auto-merge (squash) February 17, 2022 21:40

hawkw merged commit 703f1aa into main Feb 17, 2022

hawkw deleted the eliza/late-nows branch February 17, 2022 21:43

hawkw mentioned this pull request Feb 18, 2022

chore(subscriber): prepare to release v0.1.3 #292

Merged

thombles mentioned this pull request Feb 20, 2022

fix(console): prevent panics if subscriber reports out-of-order times #295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(subscriber): record timestamps for updates last #289

fix(subscriber): record timestamps for updates last #289

hawkw commented Feb 17, 2022 •

edited

Loading

seanmonstar commented Feb 17, 2022

hawkw commented Feb 17, 2022

hawkw commented Feb 17, 2022

thombles commented Feb 17, 2022

hawkw commented Feb 17, 2022

thombles commented Feb 17, 2022

thombles commented Feb 17, 2022

hawkw commented Feb 17, 2022

hawkw commented Feb 17, 2022

thombles commented Feb 18, 2022

hawkw commented Feb 18, 2022

fix(subscriber): record timestamps for updates last #289

fix(subscriber): record timestamps for updates last #289

Conversation

hawkw commented Feb 17, 2022 • edited Loading

Motivation

Solution

seanmonstar commented Feb 17, 2022

hawkw commented Feb 17, 2022

hawkw commented Feb 17, 2022

thombles commented Feb 17, 2022

hawkw commented Feb 17, 2022

thombles commented Feb 17, 2022

thombles commented Feb 17, 2022

hawkw commented Feb 17, 2022

hawkw commented Feb 17, 2022

thombles commented Feb 18, 2022

hawkw commented Feb 18, 2022

hawkw commented Feb 17, 2022 •

edited

Loading