Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The console randomly die with an unwrap #266

Closed
dzamlo opened this issue Jan 24, 2022 · 1 comment · Fixed by #289
Closed

The console randomly die with an unwrap #266

dzamlo opened this issue Jan 24, 2022 · 1 comment · Fixed by #289
Labels
C-console Crate: console. S-bug Severity: bug

Comments

@dzamlo
Copy link

dzamlo commented Jan 24, 2022

The console randomly die after some time. Running with RUST_LOG=debug and RUST_BACKTRACE=full produce this log file: console.txt (I suggest printing the log file in a terminal so you get the colors and other formatting)

Error

called `Result::unwrap()` on an `Err` value: SystemTimeError(836.154µs)

Metadata

key value
version 0.1.1
location tokio-console/src/state/async_ops.rs:214:61
@zaharidichev
Copy link
Collaborator

I think this is due to the fact that essentially before #238 the subscriber was acting like an actor wrt to state updates and these were all driven by messages. So at the time of publishing an update message on the wire it was the case that this message timestamp will be after any other timestamp that was written into the AsyncOp stats. Since these stats are now shared between the aggregator and the ConsoleLayer and guarded by a lock, we can actually observe an Update.now timestamp that is before a Stats.dropped_at timestamp. We should fix that. Two possible ways of doing that:

  1. Make sure that when publishing an update we take the now timestamp after everything else has been converted to proto repr.
  2. Handle the case in the console where a timestamp can be after the Update.now one.

@hawkw hawkw added C-console Crate: console. S-bug Severity: bug labels Jan 24, 2022
hawkw added a commit that referenced this issue Feb 17, 2022
## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

While I was making this change, I also did a little refactoring. I
factored out the code that's shared between constructing the initial
update and subsequent updates into methods that are called with an
`Include` enum, so we can reuse them when building the first update for
a subscription and when building subsequent updates in `publish`. IMO
this makes the code much more readable.

Fixes #266
Depends on #288
hawkw added a commit that referenced this issue Feb 17, 2022
## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

While I was making this change, I also did a little refactoring. I
factored out the code that's shared between constructing the initial
update and subsequent updates into methods that are called with an
`Include` enum, so we can reuse them when building the first update for
a subscription and when building subsequent updates in `publish`. IMO
this makes the code much more readable.

Fixes #266
Depends on #288
hawkw added a commit that referenced this issue Feb 17, 2022
## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

Fixes #266
Depends on #288
hawkw added a commit that referenced this issue Feb 17, 2022
## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

Fixes #266
Depends on #288
hawkw added a commit that referenced this issue Feb 17, 2022
* refac(subscriber): factor out update construction

This branch refactors how the aggregator task builds update messages. In
particular, it factors out the shared code between building the initial
update for a subscription and building the subsequent updates in
`publish` into methods that are called with an `Include` enum to
determine whether the update should include all tasks/resources/ops, or
just the ones since the last update. This makes the code in
`add_instrument_subscription` and in `publish` much easier to read.

This is just the refactor part from #289 split out into a separate
branch; this should make no functional change. If we decide that #289 is
the correct way to fix the bug, we can rebase that branch onto this, so
that it _just_ includes the bugfix. This way, we can land the refactor
anyway, because I think it's nice.

* fix(subscriber): record timestamps for updates last

## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

Fixes #266
Depends on #288
hawkw added a commit that referenced this issue Feb 18, 2022
## 0.1.3  (2022-02-18)

#### Bug Fixes

*  record timestamps for updates last (#289) ([703f1aa](703f1aa),
   closes [#266](266))
*  use monotonic `Instant`s for all timestamps (#288)
   ([abc0830](abc0830), closes [#286](286))
*  bail rather than panic when encountering clock skew (#287)
   ([24db8c6](24db8c6), closes [#286](286))
*  fix compilation on targets without 64-bit atomics (#282)
   ([5590fdb](5590fdb), closes [#279](279))
hawkw added a commit that referenced this issue Feb 18, 2022
#### Features

*  add `Builder::filter_env_var` builder parameter (#276)
   ([dbdb149](dbdb149), closes [#206](206))

#### Bug Fixes

*  record timestamps for updates last (#289) ([703f1aa](703f1aa),
   closes [#266](266))
*  use monotonic `Instant`s for all timestamps (#288)
   ([abc0830](abc0830), closes [#286](286))
*  bail rather than panic when encountering clock skew (#287)
   ([24db8c6](24db8c6), closes [#286](286))
*  fix compilation on targets without 64-bit atomics (#282)
   ([5590fdb](5590fdb), closes [#279](279))
hawkw added a commit that referenced this issue Feb 18, 2022
#### Features

*  add `Builder::filter_env_var` builder parameter (#276)
   ([dbdb149](dbdb149), closes [#206](206))

#### Bug Fixes

*  record timestamps for updates last (#289) ([703f1aa](703f1aa),
   closes [#266](266))
*  use monotonic `Instant`s for all timestamps (#288)
   ([abc0830](abc0830), closes [#286](286))
*  bail rather than panic when encountering clock skew (#287)
   ([24db8c6](24db8c6), closes [#286](286))
*  fix compilation on targets without 64-bit atomics (#282)
   ([5590fdb](5590fdb), closes [#279](279))
thombles added a commit to thombles/console that referenced this issue Feb 20, 2022
Changes duration_since and unchecked subtractions to assume
Duration::ZERO (via Default) if the alternative would be a crash.

If the subscriber reports nonsensical data this can result in negative
Durations in console's calculations, leading to panics. This has already
happened inadvertently. All changed code relates directly to received
protobuf data whose validity isn't guaranteed.

Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)
thombles added a commit to thombles/console that referenced this issue Feb 20, 2022
Changes duration_since and unchecked subtractions to assume
Duration::ZERO (via Default) if the alternative would be a crash.

If the subscriber reports nonsensical data this can result in negative
Durations in console's calculations, leading to panics. This has already
happened inadvertently. All changed code relates directly to received
protobuf data whose validity isn't guaranteed.

Relates to tokio-rs#266 (crash) and tokio-rs#289 (fix for root cause)
hawkw pushed a commit that referenced this issue Mar 7, 2022
…#295)

Changes duration_since and unchecked subtractions to assume
`Duration::ZERO` (via `Default`) if the alternative would be a crash.

If the subscriber reports nonsensical data this can result in negative
`Duration`s in console's calculations, leading to panics. This has
already happened inadvertently. All changed code relates directly to
received protobuf data whose validity isn't guaranteed.

Relates to #266 (crash) and #289 (fix for root cause)

---

There are a handful of other unwraps/expects I'm not so sure about.
These ones looked like pretty clear-cut cases where panicking is
possible, undesirable, and clamping to zero is a reasonable response.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-console Crate: console. S-bug Severity: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants