-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The console randomly die with an unwrap #266
Comments
I think this is due to the fact that essentially before #238 the subscriber was acting like an actor wrt to state updates and these were all driven by messages. So at the time of publishing an update message on the wire it was the case that this message timestamp will be after any other timestamp that was written into the AsyncOp stats. Since these stats are now shared between the aggregator and the
|
## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. While I was making this change, I also did a little refactoring. I factored out the code that's shared between constructing the initial update and subsequent updates into methods that are called with an `Include` enum, so we can reuse them when building the first update for a subscription and when building subsequent updates in `publish`. IMO this makes the code much more readable. Fixes #266 Depends on #288
## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. While I was making this change, I also did a little refactoring. I factored out the code that's shared between constructing the initial update and subsequent updates into methods that are called with an `Include` enum, so we can reuse them when building the first update for a subscription and when building subsequent updates in `publish`. IMO this makes the code much more readable. Fixes #266 Depends on #288
## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. Fixes #266 Depends on #288
## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. Fixes #266 Depends on #288
* refac(subscriber): factor out update construction This branch refactors how the aggregator task builds update messages. In particular, it factors out the shared code between building the initial update for a subscription and building the subsequent updates in `publish` into methods that are called with an `Include` enum to determine whether the update should include all tasks/resources/ops, or just the ones since the last update. This makes the code in `add_instrument_subscription` and in `publish` much easier to read. This is just the refactor part from #289 split out into a separate branch; this should make no functional change. If we decide that #289 is the correct way to fix the bug, we can rebase that branch onto this, so that it _just_ includes the bugfix. This way, we can land the refactor anyway, because I think it's nice. * fix(subscriber): record timestamps for updates last ## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. Fixes #266 Depends on #288
## 0.1.3 (2022-02-18) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))
#### Features * add `Builder::filter_env_var` builder parameter (#276) ([dbdb149](dbdb149), closes [#206](206)) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))
#### Features * add `Builder::filter_env_var` builder parameter (#276) ([dbdb149](dbdb149), closes [#206](206)) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))
Changes duration_since and unchecked subtractions to assume Duration::ZERO (via Default) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative Durations in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)
Changes duration_since and unchecked subtractions to assume Duration::ZERO (via Default) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative Durations in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)
…#295) Changes duration_since and unchecked subtractions to assume `Duration::ZERO` (via `Default`) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative `Duration`s in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to #266 (crash) and #289 (fix for root cause) --- There are a handful of other unwraps/expects I'm not so sure about. These ones looked like pretty clear-cut cases where panicking is possible, undesirable, and clamping to zero is a reasonable response.
The console randomly die after some time. Running with
RUST_LOG=debug
andRUST_BACKTRACE=full
produce this log file: console.txt (I suggest printing the log file in a terminal so you get the colors and other formatting)Error
Metadata
The text was updated successfully, but these errors were encountered: