-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(subscriber): record timestamps for updates last #289
Conversation
I'd think the reason for taking the timestamp first is because all the other stuff can take some time to compute, and we want to record as close as possible when the event actually happened, not when we finished understanding what the event means. |
Right, but the issue is that converting the data to the protobuf format takes time, and the data may be updated while we're doing that. If the data is updated by another thread after when we take the "now" timestamp and when we actually construct the protobuf representation of that data, it might have a timestamp that occurs after "now". #266 explains why this is a problem. We could, alternatively, handle this kind of thing in the console UI; this felt like a better solution to me, though, since it wouldn't have to be solved again in another consumer of the wire format. |
This branch refactors how the aggregator task builds update messages. In particular, it factors out the shared code between building the initial update for a subscription and building the subsequent updates in `publish` into methods that are called with an `Include` enum to determine whether the update should include all tasks/resources/ops, or just the ones since the last update. This makes the code in `add_instrument_subscription` and in `publish` much easier to read. This is just the refactor part from #289 split out into a separate branch; this should make no functional change. If we decide that #289 is the correct way to fix the bug, we can rebase that branch onto this, so that it _just_ includes the bugfix. This way, we can land the refactor anyway, because I think it's nice.
00ce215
to
cf8ccd1
Compare
This branch refactors how the aggregator task builds update messages. In particular, it factors out the shared code between building the initial update for a subscription and building the subsequent updates in `publish` into methods that are called with an `Include` enum to determine whether the update should include all tasks/resources/ops, or just the ones since the last update. This makes the code in `add_instrument_subscription` and in `publish` much easier to read. This is just the refactor part from #289 split out into a separate branch; this should make no functional change. If we decide that #289 is the correct way to fix the bug, we can rebase that branch onto this, so that it _just_ includes the bugfix. This way, we can land the refactor anyway, because I think it's nice.
## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. Fixes #266 Depends on #288
cf8ccd1
to
4d52973
Compare
I went ahead and split out the refactoring into a separate branch (#290) so we can go ahead and merge that regardless of what we decide is the best way to fix the bug. |
This branch refactors how the aggregator task builds update messages. In particular, it factors out the shared code between building the initial update for a subscription and building the subsequent updates in `publish` into methods that are called with an `Include` enum to determine whether the update should include all tasks/resources/ops, or just the ones since the last update. This makes the code in `add_instrument_subscription` and in `publish` much easier to read. This is just the refactor part from #289 split out into a separate branch; this should make no functional change. If we decide that #289 is the correct way to fix the bug, we can rebase that branch onto this, so that it _just_ includes the bugfix. This way, we can land the refactor anyway, because I think it's nice.
I've been fiddling with this crash in parallel (I can trigger it pretty reliably) and was investigating the same change. I tried checking out the tip of this branch (713a7be) and I'm still getting a similar crash/backtrace sadly. |
Hmm, that's a shame that this didn't fix it. Can you share how you're able to reliably trigger the panic? |
Not really, sorry. My app under test just has 50–70 long-running tasks ticking along mostly idle and it crashes within 10 minutes or so looking at at the tasks list view. |
If it helps account for any particular skew, the instrumented device is an iPhone I'm connected to over WiFi, with console running on my Mac. |
Oh, one other question: when the panic occurs, do you know if it's always in |
Whoops, great guess - yes I forgot to update the subscriber end and obviously this code is in the subscriber crate. Sorry to mislead. I've now updated both device to
So far I've only noticed it in the |
Ah, yup, there's the problem --- thanks for double-checking. So, this change probably does fix the issue, which is good, because I felt like I was losing my mind there for a second. :)
Yeah, it probably should — we should give the whole console UI a pass for potential panics in the face of weird data from the instrumented application. |
## 0.1.3 (2022-02-18) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))
#### Features * add `Builder::filter_env_var` builder parameter (#276) ([dbdb149](dbdb149), closes [#206](206)) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))
#### Features * add `Builder::filter_env_var` builder parameter (#276) ([dbdb149](dbdb149), closes [#206](206)) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))
Changes duration_since and unchecked subtractions to assume Duration::ZERO (via Default) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative Durations in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)
Changes duration_since and unchecked subtractions to assume Duration::ZERO (via Default) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative Durations in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)
…#295) Changes duration_since and unchecked subtractions to assume `Duration::ZERO` (via `Default`) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative `Duration`s in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to #266 (crash) and #289 (fix for root cause) --- There are a handful of other unwraps/expects I'm not so sure about. These ones looked like pretty clear-cut cases where panicking is possible, undesirable, and clamping to zero is a reasonable response.
Motivation
Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually before
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.
Solution
There's no actual reason to take those timestamps before we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and then record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
next update.
Fixes #266
Depends on #290