Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(subscriber): record timestamps for updates last #289

Merged
merged 3 commits into from
Feb 17, 2022
Merged

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Feb 17, 2022

Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually before
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

Solution

There's no actual reason to take those timestamps before we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and then record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
next update.

Fixes #266
Depends on #290

Base automatically changed from eliza/no-system-time to main February 17, 2022 18:39
@seanmonstar
Copy link
Member

There's no actual reason to take those timestamps before we assemble the update.

I'd think the reason for taking the timestamp first is because all the other stuff can take some time to compute, and we want to record as close as possible when the event actually happened, not when we finished understanding what the event means.

@hawkw
Copy link
Member Author

hawkw commented Feb 17, 2022

There's no actual reason to take those timestamps before we assemble the update.

I'd think the reason for taking the timestamp first is because all the other stuff can take some time to compute, and we want to record as close as possible when the event actually happened, not when we finished understanding what the event means.

Right, but the issue is that converting the data to the protobuf format takes time, and the data may be updated while we're doing that. If the data is updated by another thread after when we take the "now" timestamp and when we actually construct the protobuf representation of that data, it might have a timestamp that occurs after "now". #266 explains why this is a problem.

We could, alternatively, handle this kind of thing in the console UI; this felt like a better solution to me, though, since it wouldn't have to be solved again in another consumer of the wire format.

hawkw added a commit that referenced this pull request Feb 17, 2022
This branch refactors how the aggregator task builds update messages. In
particular, it factors out the shared code between building the initial
update for a subscription and building the subsequent updates in
`publish` into methods that are called with an `Include` enum to
determine whether the update should include all tasks/resources/ops, or
just the ones since the last update. This makes the code in
`add_instrument_subscription` and in `publish` much easier to read.

This is just the refactor part from #289 split out into a separate
branch; this should make no functional change. If we decide that #289 is
the correct way to fix the bug, we can rebase that branch onto this, so
that it _just_ includes the bugfix. This way, we can land the refactor
anyway, because I think it's nice.
This branch refactors how the aggregator task builds update messages. In
particular, it factors out the shared code between building the initial
update for a subscription and building the subsequent updates in
`publish` into methods that are called with an `Include` enum to
determine whether the update should include all tasks/resources/ops, or
just the ones since the last update. This makes the code in
`add_instrument_subscription` and in `publish` much easier to read.

This is just the refactor part from #289 split out into a separate
branch; this should make no functional change. If we decide that #289 is
the correct way to fix the bug, we can rebase that branch onto this, so
that it _just_ includes the bugfix. This way, we can land the refactor
anyway, because I think it's nice.
## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

Fixes #266
Depends on #288
@hawkw
Copy link
Member Author

hawkw commented Feb 17, 2022

I went ahead and split out the refactoring into a separate branch (#290) so we can go ahead and merge that regardless of what we decide is the best way to fix the bug.

hawkw added a commit that referenced this pull request Feb 17, 2022
This branch refactors how the aggregator task builds update messages. In
particular, it factors out the shared code between building the initial
update for a subscription and building the subsequent updates in
`publish` into methods that are called with an `Include` enum to
determine whether the update should include all tasks/resources/ops, or
just the ones since the last update. This makes the code in
`add_instrument_subscription` and in `publish` much easier to read.

This is just the refactor part from #289 split out into a separate
branch; this should make no functional change. If we decide that #289 is
the correct way to fix the bug, we can rebase that branch onto this, so
that it _just_ includes the bugfix. This way, we can land the refactor
anyway, because I think it's nice.
@hawkw hawkw enabled auto-merge (squash) February 17, 2022 21:40
@hawkw hawkw merged commit 703f1aa into main Feb 17, 2022
@hawkw hawkw deleted the eliza/late-nows branch February 17, 2022 21:43
@thombles
Copy link
Contributor

I've been fiddling with this crash in parallel (I can trigger it pretty reliably) and was investigating the same change. I tried checking out the tip of this branch (713a7be) and I'm still getting a similar crash/backtrace sadly.

console-crash-713a7befc3dcb9b7dae5a6e5b4fbe220e47a6eb6.txt

@hawkw
Copy link
Member Author

hawkw commented Feb 17, 2022

I've been fiddling with this crash in parallel (I can trigger it pretty reliably) and was investigating the same change. I tried checking out the tip of this branch (713a7be) and I'm still getting a similar crash/backtrace sadly.

console-crash-713a7befc3dcb9b7dae5a6e5b4fbe220e47a6eb6.txt

Hmm, that's a shame that this didn't fix it. Can you share how you're able to reliably trigger the panic?

@thombles
Copy link
Contributor

Not really, sorry. My app under test just has 50–70 long-running tasks ticking along mostly idle and it crashes within 10 minutes or so looking at at the tasks list view.

@thombles
Copy link
Contributor

If it helps account for any particular skew, the instrumented device is an iPhone I'm connected to over WiFi, with console running on my Mac.

@hawkw
Copy link
Member Author

hawkw commented Feb 17, 2022

@thombles just to confirm, the iPhone's console-subscriber dependency is from this branch on commit 713a7be? Not just the version of the console UI running on your mac?

@hawkw
Copy link
Member Author

hawkw commented Feb 17, 2022

Oh, one other question: when the panic occurs, do you know if it's always in AsyncOpsState::retain_active, or does it ever panic in TasksState::retain_active or ResourcesState::retain_active?

@thombles
Copy link
Contributor

@thombles just to confirm, the iPhone's console-subscriber dependency is from this branch on commit 713a7be? Not just the version of the console UI running on your mac?

Whoops, great guess - yes I forgot to update the subscriber end and obviously this code is in the subscriber crate. Sorry to mislead. I've now updated both device to main and it's rock solid. It does make me wonder whether console should defensively do a saturating duration_since just in case it gets bogus data but I guess it's not critical?

when the panic occurs, do you know if it's always in AsyncOpsState::retain_active, or does it ever panic in TasksState::retain_active or ResourcesState::retain_active?

So far I've only noticed it in the AsyncOpsState but I haven't analysed whether there's any special reason for that.

@hawkw
Copy link
Member Author

hawkw commented Feb 18, 2022

@thombles

Whoops, great guess - yes I forgot to update the subscriber end and obviously this code is in the subscriber crate. Sorry to mislead. I've now updated both device to main and it's rock solid.

Ah, yup, there's the problem --- thanks for double-checking. So, this change probably does fix the issue, which is good, because I felt like I was losing my mind there for a second. :)

It does make me wonder whether console should defensively do a saturating duration_since just in case it gets bogus data but I guess it's not critical?

Yeah, it probably should — we should give the whole console UI a pass for potential panics in the face of weird data from the instrumented application.

hawkw added a commit that referenced this pull request Feb 18, 2022
## 0.1.3  (2022-02-18)

#### Bug Fixes

*  record timestamps for updates last (#289) ([703f1aa](703f1aa),
   closes [#266](266))
*  use monotonic `Instant`s for all timestamps (#288)
   ([abc0830](abc0830), closes [#286](286))
*  bail rather than panic when encountering clock skew (#287)
   ([24db8c6](24db8c6), closes [#286](286))
*  fix compilation on targets without 64-bit atomics (#282)
   ([5590fdb](5590fdb), closes [#279](279))
hawkw added a commit that referenced this pull request Feb 18, 2022
#### Features

*  add `Builder::filter_env_var` builder parameter (#276)
   ([dbdb149](dbdb149), closes [#206](206))

#### Bug Fixes

*  record timestamps for updates last (#289) ([703f1aa](703f1aa),
   closes [#266](266))
*  use monotonic `Instant`s for all timestamps (#288)
   ([abc0830](abc0830), closes [#286](286))
*  bail rather than panic when encountering clock skew (#287)
   ([24db8c6](24db8c6), closes [#286](286))
*  fix compilation on targets without 64-bit atomics (#282)
   ([5590fdb](5590fdb), closes [#279](279))
hawkw added a commit that referenced this pull request Feb 18, 2022
#### Features

*  add `Builder::filter_env_var` builder parameter (#276)
   ([dbdb149](dbdb149), closes [#206](206))

#### Bug Fixes

*  record timestamps for updates last (#289) ([703f1aa](703f1aa),
   closes [#266](266))
*  use monotonic `Instant`s for all timestamps (#288)
   ([abc0830](abc0830), closes [#286](286))
*  bail rather than panic when encountering clock skew (#287)
   ([24db8c6](24db8c6), closes [#286](286))
*  fix compilation on targets without 64-bit atomics (#282)
   ([5590fdb](5590fdb), closes [#279](279))
thombles added a commit to thombles/console that referenced this pull request Feb 20, 2022
Changes duration_since and unchecked subtractions to assume
Duration::ZERO (via Default) if the alternative would be a crash.

If the subscriber reports nonsensical data this can result in negative
Durations in console's calculations, leading to panics. This has already
happened inadvertently. All changed code relates directly to received
protobuf data whose validity isn't guaranteed.

Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)
thombles added a commit to thombles/console that referenced this pull request Feb 20, 2022
Changes duration_since and unchecked subtractions to assume
Duration::ZERO (via Default) if the alternative would be a crash.

If the subscriber reports nonsensical data this can result in negative
Durations in console's calculations, leading to panics. This has already
happened inadvertently. All changed code relates directly to received
protobuf data whose validity isn't guaranteed.

Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)
hawkw pushed a commit that referenced this pull request Mar 7, 2022
…#295)

Changes duration_since and unchecked subtractions to assume
`Duration::ZERO` (via `Default`) if the alternative would be a crash.

If the subscriber reports nonsensical data this can result in negative
`Duration`s in console's calculations, leading to panics. This has
already happened inadvertently. All changed code relates directly to
received protobuf data whose validity isn't guaranteed.

Relates to #266 (crash) and #289 (fix for root cause)

---

There are a handful of other unwraps/expects I'm not so sure about.
These ones looked like pretty clear-cut cases where panicking is
possible, undesirable, and clamping to zero is a reasonable response.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The console randomly die with an unwrap
3 participants