Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread 'console_subscriber' panicked at 'attempt to subtract with overflow' #180

Closed
pnkfelix opened this issue Nov 4, 2021 · 6 comments
Closed
Assignees
Labels
S-bug Severity: bug

Comments

@pnkfelix
Copy link
Contributor

pnkfelix commented Nov 4, 2021

thread 'console_subscriber' panicked at 'attempt to subtract with overflow', console/console-subscriber/src/aggregator/mod.rs:242:9

This is while attaching the console to some web-crawler demo code I've been working on. (It's deliberately naive code.)

Here is that code base:

https://github.com/pnkfelix/huaa-exsets/tree/console-play/crawler

The nature of that current web crawler is that it will visit pages in undetermined order, which means that the overflow panic is not always the first panic it encounters. So, it currently does not deterministically reproduce the bug here. But I bet I can narrow it down.

@pnkfelix pnkfelix self-assigned this Nov 4, 2021
@pnkfelix pnkfelix added the S-bug Severity: bug label Nov 4, 2021
@pnkfelix
Copy link
Contributor Author

pnkfelix commented Nov 4, 2021

Oh actually I think I just encountered this on a simpler test case. I'll try to produce that shortly.

@pnkfelix
Copy link
Contributor Author

pnkfelix commented Nov 4, 2021

Simpler test case (and should be deterministic now, though it does take a while for the problem to surface; about 90 seconds or so on my desktop):

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    console_subscriber::init();

    let mut handles = Vec::new();
    for i in 1..1000 {
        handles.push(tokio::task::spawn(async move {
            loop { fact_yielding(i).await; }
        }));
    }

    for h in handles {
        h.await?;
    }

    Ok(())
}

async fn fact_yielding(n: u32) -> f64 {
    let mut i = 0_u32;
    let mut accum = 1_f64;
    loop {
        if i == n { break; }
        tokio::task::yield_now().await;
        i += 1;
        accum *= i as f64;
    }
    return accum;
}

I am running this atop Eliza's branch with the fix for compute-intensive task(s), i.e. branch https://github.com/tokio-rs/console/tree/eliza/approaching-capacity

@hawkw
Copy link
Member

hawkw commented Nov 4, 2021

Hmm, the panic occurs on this line:

self.current_polls -= 1;

That code seems to assume the number of polls will never be zero when handling a poll-end event, because it's incremented when starting a poll, and a poll can't end without having been started. This makes me wonder if the panic occurs because the event that started the poll was dropped because the event channel was at capacity, but the event for ending the poll occurred when there was capacity for it.

I suppose we could just use a saturating subtraction there so we don't panic...but this doesn't seem like correctly handling the potentially lossy nature of the event channel. I wonder if we want to avoid generating poll end events if the poll start event was not recorded.

sd2k added a commit to grafana/console that referenced this issue Dec 3, 2021
hawkw added a commit that referenced this issue Dec 16, 2021
Similarly, if we don't record entering a span due to event buffer
capacity, it doesn't make sense to record exiting it. This commit
changes the `ConsoleLayer` to only push a span to the current thread's
span stack if we were able to successfully send an `Enter` event to the
aggregator. This means that it won't be considered the parent span for
other events/spans. When a span is exited, we only send an `Exit` event
to the aggregator if the span *was* previously recorded as being entered.

In theory, ignoring subsequent events on spans that were dropped due to
buffer capacity technically means we are losing *more* data than we
would have if we did not ignore those spans. But, the data we are losing
here is *wrong*. For example, we cannot calculate a correct poll time
for a poll where we didn't record the beginning of the poll, and we only
recorded the poll ending. Therefore, I think it's better to ignore this
data than to make a half-assed attempt to record it even though we know
it's incorrect.

I believe this will probably also fix issue #180. That issue occurs
when we attempt to decrement the number of times a task has been
polled, and sometimes --- if an `enter` event for that task was missed
--- we may subtract more than we've added to the counter. By ignoring
exits for spans that we never recorded as entered, this panic should now
be avoided.
hawkw added a commit that referenced this issue Dec 16, 2021
Similarly, if we don't record entering a span due to event buffer
capacity, it doesn't make sense to record exiting it. This commit
changes the `ConsoleLayer` to only push a span to the current thread's
span stack if we were able to successfully send an `Enter` event to the
aggregator. This means that it won't be considered the parent span for
other events/spans. When a span is exited, we only send an `Exit` event
to the aggregator if the span *was* previously recorded as being entered.

In theory, ignoring subsequent events on spans that were dropped due to
buffer capacity technically means we are losing *more* data than we
would have if we did not ignore those spans. But, the data we are losing
here is *wrong*. For example, we cannot calculate a correct poll time
for a poll where we didn't record the beginning of the poll, and we only
recorded the poll ending. Therefore, I think it's better to ignore this
data than to make a half-assed attempt to record it even though we know
it's incorrect.

I believe this will probably also fix issue #180. That issue occurs
when we attempt to decrement the number of times a task has been
polled, and sometimes --- if an `enter` event for that task was missed
--- we may subtract more than we've added to the counter. By ignoring
exits for spans that we never recorded as entered, this panic should now
be avoided.
hawkw added a commit that referenced this issue Dec 16, 2021
Similarly, if we don't record entering a span due to event buffer
capacity, it doesn't make sense to record exiting it. This commit
changes the `ConsoleLayer` to only push a span to the current thread's
span stack if we were able to successfully send an `Enter` event to the
aggregator. This means that it won't be considered the parent span for
other events/spans. When a span is exited, we only send an `Exit` event
to the aggregator if the span *was* previously recorded as being entered.

In theory, ignoring subsequent events on spans that were dropped due to
buffer capacity technically means we are losing *more* data than we
would have if we did not ignore those spans. But, the data we are losing
here is *wrong*. For example, we cannot calculate a correct poll time
for a poll where we didn't record the beginning of the poll, and we only
recorded the poll ending. Therefore, I think it's better to ignore this
data than to make a half-assed attempt to record it even though we know
it's incorrect.

I believe this will probably also fix issue #180. That issue occurs
when we attempt to decrement the number of times a task has been
polled, and sometimes --- if an `enter` event for that task was missed
--- we may subtract more than we've added to the counter. By ignoring
exits for spans that we never recorded as entered, this panic should now
be avoided.
@hawkw
Copy link
Member

hawkw commented Dec 17, 2021

I'm pretty sure #212 fixed this --- I ran the repro from #180 (comment) on the commit prior to #212, and (on my machine) it panicked after 3 minutes. On commit ad442e2, the same repro can run for >45 minutes without panicking, so I'm going to say this is fixed.

@hawkw hawkw closed this as completed Dec 17, 2021
@pymongo
Copy link
Contributor

pymongo commented Dec 20, 2021

thread 'console_subscriber' panicked at 'attempt to subtract with overflow', /home/w/repos/clone_repos/console/console-subscriber/src/aggregator/mod.rs:1079:45
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

this panic reproduce again

[w@ww console]$ git log -1
commit ef4150725476942784f1ad4cacfb21c3bed17c54 (HEAD -> main, origin/main, origin/HEAD)
Author: Eliza Weisman <eliza@buoyant.io>
Date:   Sat Dec 18 14:25:08 2021 -0800

@hawkw hawkw reopened this Dec 20, 2021
@hawkw
Copy link
Member

hawkw commented Dec 20, 2021

thread 'console_subscriber' panicked at 'attempt to subtract with overflow', /home/w/repos/clone_repos/console/console-subscriber/src/aggregator/mod.rs:1079:45 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

this panic reproduce again

[w@ww console]$ git log -1
commit ef4150725476942784f1ad4cacfb21c3bed17c54 (HEAD -> main, origin/main, origin/HEAD)
Author: Eliza Weisman <eliza@buoyant.io>
Date:   Sat Dec 18 14:25:08 2021 -0800

Hmm, thanks for reporting that, but I don't think this is the same bug --- that the panic in this issue occurred here:

self.current_polls -= 1;

Meanwhile, your panic occurs on a very different line, here:

Some(AttributeUpdateOp::Sub) => *v -= upd,

I think this is a related issue --- we probably missed an "add" event for a resource because the event buffer was at capacity, so when we then subtracted a value from the same resource, it overflowed. However, I think #212 did fix the bug described in this issue, but not the related one you're seeing, which we'll have to fix separately.

@hawkw hawkw closed this as completed Dec 20, 2021
hawkw added a commit that referenced this issue Dec 20, 2021
Currently, an issue exists where a subtraction attribute update for a
resource or async op attribute may overflow, if a previous add update
was dropped due to event buffer capacity. This may result in the
aggregator task panicking. See
#180 (comment)
for details.

This branch changes all resource updates to use saturating arithmetic.
In practice, I think it's _much_ more likely for a subtract to overflow
due to missed adds than it is for an add to overflow due to missed subs,
but I made the addition saturating as well, just in case.

We should probably also increase the default event buffer capacity,
since it seems like lots of people are hitting problems with this...
hawkw added a commit that referenced this issue Dec 20, 2021
Currently, an issue exists where a subtraction attribute update for a
resource or async op attribute may overflow, if a previous add update
was dropped due to event buffer capacity. This may result in the
aggregator task panicking. See
#180 (comment)
for details.

This branch changes all resource updates to use saturating arithmetic.
In practice, I think it's _much_ more likely for a subtract to overflow
due to missed adds than it is for an add to overflow due to missed subs,
but I made the addition saturating as well, just in case.

We should probably also increase the default event buffer capacity,
since it seems like lots of people are hitting problems with this...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-bug Severity: bug
Projects
None yet
Development

No branches or pull requests

3 participants