thread 'console_subscriber' panicked at 'attempt to subtract with overflow' #180

pnkfelix · 2021-11-04T18:41:11Z

thread 'console_subscriber' panicked at 'attempt to subtract with overflow', console/console-subscriber/src/aggregator/mod.rs:242:9

This is while attaching the console to some web-crawler demo code I've been working on. (It's deliberately naive code.)

Here is that code base:

https://github.com/pnkfelix/huaa-exsets/tree/console-play/crawler

The nature of that current web crawler is that it will visit pages in undetermined order, which means that the overflow panic is not always the first panic it encounters. So, it currently does not deterministically reproduce the bug here. But I bet I can narrow it down.

pnkfelix · 2021-11-04T19:37:01Z

Oh actually I think I just encountered this on a simpler test case. I'll try to produce that shortly.

pnkfelix · 2021-11-04T19:54:26Z

Simpler test case (and should be deterministic now, though it does take a while for the problem to surface; about 90 seconds or so on my desktop):

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    console_subscriber::init();

    let mut handles = Vec::new();
    for i in 1..1000 {
        handles.push(tokio::task::spawn(async move {
            loop { fact_yielding(i).await; }
        }));
    }

    for h in handles {
        h.await?;
    }

    Ok(())
}

async fn fact_yielding(n: u32) -> f64 {
    let mut i = 0_u32;
    let mut accum = 1_f64;
    loop {
        if i == n { break; }
        tokio::task::yield_now().await;
        i += 1;
        accum *= i as f64;
    }
    return accum;
}

I am running this atop Eliza's branch with the fix for compute-intensive task(s), i.e. branch https://github.com/tokio-rs/console/tree/eliza/approaching-capacity

hawkw · 2021-11-04T20:27:24Z

Hmm, the panic occurs on this line:

console/console-subscriber/src/aggregator/mod.rs

Line 242 in c1111bc

self.current_polls -= 1;

That code seems to assume the number of polls will never be zero when handling a poll-end event, because it's incremented when starting a poll, and a poll can't end without having been started. This makes me wonder if the panic occurs because the event that started the poll was dropped because the event channel was at capacity, but the event for ending the poll occurred when there was capacity for it.

I suppose we could just use a saturating subtraction there so we don't panic...but this doesn't seem like correctly handling the potentially lossy nature of the event channel. I wonder if we want to avoid generating poll end events if the poll start event was not recorded.

Temporary workaround for tokio-rs#180

Similarly, if we don't record entering a span due to event buffer capacity, it doesn't make sense to record exiting it. This commit changes the `ConsoleLayer` to only push a span to the current thread's span stack if we were able to successfully send an `Enter` event to the aggregator. This means that it won't be considered the parent span for other events/spans. When a span is exited, we only send an `Exit` event to the aggregator if the span *was* previously recorded as being entered. In theory, ignoring subsequent events on spans that were dropped due to buffer capacity technically means we are losing *more* data than we would have if we did not ignore those spans. But, the data we are losing here is *wrong*. For example, we cannot calculate a correct poll time for a poll where we didn't record the beginning of the poll, and we only recorded the poll ending. Therefore, I think it's better to ignore this data than to make a half-assed attempt to record it even though we know it's incorrect. I believe this will probably also fix issue #180. That issue occurs when we attempt to decrement the number of times a task has been polled, and sometimes --- if an `enter` event for that task was missed --- we may subtract more than we've added to the counter. By ignoring exits for spans that we never recorded as entered, this panic should now be avoided.

hawkw · 2021-12-17T18:06:24Z

I'm pretty sure #212 fixed this --- I ran the repro from #180 (comment) on the commit prior to #212, and (on my machine) it panicked after 3 minutes. On commit ad442e2, the same repro can run for >45 minutes without panicking, so I'm going to say this is fixed.

pymongo · 2021-12-20T10:18:46Z

thread 'console_subscriber' panicked at 'attempt to subtract with overflow', /home/w/repos/clone_repos/console/console-subscriber/src/aggregator/mod.rs:1079:45
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

this panic reproduce again

[w@ww console]$ git log -1
commit ef4150725476942784f1ad4cacfb21c3bed17c54 (HEAD -> main, origin/main, origin/HEAD)
Author: Eliza Weisman <eliza@buoyant.io>
Date:   Sat Dec 18 14:25:08 2021 -0800

hawkw · 2021-12-20T17:31:21Z

thread 'console_subscriber' panicked at 'attempt to subtract with overflow', /home/w/repos/clone_repos/console/console-subscriber/src/aggregator/mod.rs:1079:45 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

this panic reproduce again
[w@ww console]$ git log -1
commit ef4150725476942784f1ad4cacfb21c3bed17c54 (HEAD -> main, origin/main, origin/HEAD)
Author: Eliza Weisman <eliza@buoyant.io>
Date:   Sat Dec 18 14:25:08 2021 -0800

Hmm, thanks for reporting that, but I don't think this is the same bug --- that the panic in this issue occurred here:

console/console-subscriber/src/aggregator/mod.rs

Line 252 in ef41507

self.current_polls -= 1;

Meanwhile, your panic occurs on a very different line, here:

console/console-subscriber/src/aggregator/mod.rs

Line 1079 in ef41507

Some(AttributeUpdateOp::Sub) => *v -= upd,

I think this is a related issue --- we probably missed an "add" event for a resource because the event buffer was at capacity, so when we then subtracted a value from the same resource, it overflowed. However, I think #212 did fix the bug described in this issue, but not the related one you're seeing, which we'll have to fix separately.

Currently, an issue exists where a subtraction attribute update for a resource or async op attribute may overflow, if a previous add update was dropped due to event buffer capacity. This may result in the aggregator task panicking. See #180 (comment) for details. This branch changes all resource updates to use saturating arithmetic. In practice, I think it's _much_ more likely for a subtract to overflow due to missed adds than it is for an add to overflow due to missed subs, but I made the addition saturating as well, just in case. We should probably also increase the default event buffer capacity, since it seems like lots of people are hitting problems with this...

pnkfelix self-assigned this Nov 4, 2021

pnkfelix added the S-bug Severity: bug label Nov 4, 2021

sd2k added a commit to grafana/console that referenced this issue Dec 3, 2021

Use saturating_sub to decrement current_polls count

1df61d9

Temporary workaround for tokio-rs#180

hawkw mentioned this issue Dec 16, 2021

fix(subscriber): ignore subsequent data for spans that weren't recorded #212

Merged

hawkw closed this as completed Dec 17, 2021

hawkw reopened this Dec 20, 2021

hawkw closed this as completed Dec 20, 2021

hawkw mentioned this issue Dec 20, 2021

fix(subscriber): use saturating arithmetic for attribute updates #234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thread 'console_subscriber' panicked at 'attempt to subtract with overflow' #180

thread 'console_subscriber' panicked at 'attempt to subtract with overflow' #180

pnkfelix commented Nov 4, 2021

pnkfelix commented Nov 4, 2021

pnkfelix commented Nov 4, 2021 •

edited

Loading

hawkw commented Nov 4, 2021

hawkw commented Dec 17, 2021

pymongo commented Dec 20, 2021

hawkw commented Dec 20, 2021

thread 'console_subscriber' panicked at 'attempt to subtract with overflow' #180

thread 'console_subscriber' panicked at 'attempt to subtract with overflow' #180

Comments

pnkfelix commented Nov 4, 2021

pnkfelix commented Nov 4, 2021

pnkfelix commented Nov 4, 2021 • edited Loading

hawkw commented Nov 4, 2021

hawkw commented Dec 17, 2021

pymongo commented Dec 20, 2021

hawkw commented Dec 20, 2021

pnkfelix commented Nov 4, 2021 •

edited

Loading