The console randomly die with an unwrap #266

dzamlo · 2022-01-24T08:52:30Z

The console randomly die after some time. Running with RUST_LOG=debug and RUST_BACKTRACE=full produce this log file: console.txt (I suggest printing the log file in a terminal so you get the colors and other formatting)

Error

called `Result::unwrap()` on an `Err` value: SystemTimeError(836.154µs)

Metadata

key	value
version	0.1.1
location	tokio-console/src/state/async_ops.rs:214:61

The text was updated successfully, but these errors were encountered:

zaharidichev · 2022-01-24T13:13:14Z

I think this is due to the fact that essentially before #238 the subscriber was acting like an actor wrt to state updates and these were all driven by messages. So at the time of publishing an update message on the wire it was the case that this message timestamp will be after any other timestamp that was written into the AsyncOp stats. Since these stats are now shared between the aggregator and the ConsoleLayer and guarded by a lock, we can actually observe an Update.now timestamp that is before a Stats.dropped_at timestamp. We should fix that. Two possible ways of doing that:

Make sure that when publishing an update we take the now timestamp after everything else has been converted to proto repr.
Handle the case in the console where a timestamp can be after the Update.now one.

## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. While I was making this change, I also did a little refactoring. I factored out the code that's shared between constructing the initial update and subsequent updates into methods that are called with an `Include` enum, so we can reuse them when building the first update for a subscription and when building subsequent updates in `publish`. IMO this makes the code much more readable. Fixes #266 Depends on #288

## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. Fixes #266 Depends on #288

* refac(subscriber): factor out update construction This branch refactors how the aggregator task builds update messages. In particular, it factors out the shared code between building the initial update for a subscription and building the subsequent updates in `publish` into methods that are called with an `Include` enum to determine whether the update should include all tasks/resources/ops, or just the ones since the last update. This makes the code in `add_instrument_subscription` and in `publish` much easier to read. This is just the refactor part from #289 split out into a separate branch; this should make no functional change. If we decide that #289 is the correct way to fix the bug, we can rebase that branch onto this, so that it _just_ includes the bugfix. This way, we can land the refactor anyway, because I think it's nice. * fix(subscriber): record timestamps for updates last ## Motivation Currently, when constructing an update message to send over the wire, a timestamp is taken first, and then the protobuf data is constructed. This can lead to issues where the "now" timestamp is actually _before_ timestamps present in the stats sent in that update, since the stats for a particular task/resource/async op might be updated on another thread after taking the update's "now" timestamp. This results in issues like #266. ## Solution There's no actual reason to take those timestamps *before* we assemble the update. This branch changes the aggregator to build all the various data updates in an update message, and *then* record the update's "now" timestamp. Any timestamps for tasks/resources/async ops that are recorded after the update's "now" timestamp will now be included in the *next* update. Fixes #266 Depends on #288

## 0.1.3 (2022-02-18) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))

#### Features * add `Builder::filter_env_var` builder parameter (#276) ([dbdb149](dbdb149), closes [#206](206)) #### Bug Fixes * record timestamps for updates last (#289) ([703f1aa](703f1aa), closes [#266](266)) * use monotonic `Instant`s for all timestamps (#288) ([abc0830](abc0830), closes [#286](286)) * bail rather than panic when encountering clock skew (#287) ([24db8c6](24db8c6), closes [#286](286)) * fix compilation on targets without 64-bit atomics (#282) ([5590fdb](5590fdb), closes [#279](279))

Changes duration_since and unchecked subtractions to assume Duration::ZERO (via Default) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative Durations in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)

…#295) Changes duration_since and unchecked subtractions to assume `Duration::ZERO` (via `Default`) if the alternative would be a crash. If the subscriber reports nonsensical data this can result in negative `Duration`s in console's calculations, leading to panics. This has already happened inadvertently. All changed code relates directly to received protobuf data whose validity isn't guaranteed. Relates to #266 (crash) and #289 (fix for root cause) --- There are a handful of other unwraps/expects I'm not so sure about. These ones looked like pretty clear-cut cases where panicking is possible, undesirable, and clamping to zero is a reasonable response.

hawkw added C-console Crate: console. S-bug Severity: bug labels Jan 24, 2022

hawkw mentioned this issue Feb 17, 2022

fix(subscriber): record timestamps for updates last #289

Merged

hawkw closed this as completed in #289 Feb 17, 2022

thombles mentioned this issue Feb 20, 2022

fix(console): prevent panics if subscriber reports out-of-order times #295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The console randomly die with an unwrap #266

The console randomly die with an unwrap #266

dzamlo commented Jan 24, 2022

zaharidichev commented Jan 24, 2022

The console randomly die with an unwrap #266

The console randomly die with an unwrap #266

Comments

dzamlo commented Jan 24, 2022

Error

Metadata

zaharidichev commented Jan 24, 2022