-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
thread 'console_subscriber' panicked at 'attempt to subtract with overflow' #180
Comments
Oh actually I think I just encountered this on a simpler test case. I'll try to produce that shortly. |
Simpler test case (and should be deterministic now, though it does take a while for the problem to surface; about 90 seconds or so on my desktop): #[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
console_subscriber::init();
let mut handles = Vec::new();
for i in 1..1000 {
handles.push(tokio::task::spawn(async move {
loop { fact_yielding(i).await; }
}));
}
for h in handles {
h.await?;
}
Ok(())
}
async fn fact_yielding(n: u32) -> f64 {
let mut i = 0_u32;
let mut accum = 1_f64;
loop {
if i == n { break; }
tokio::task::yield_now().await;
i += 1;
accum *= i as f64;
}
return accum;
} I am running this atop Eliza's branch with the fix for compute-intensive task(s), i.e. branch https://github.com/tokio-rs/console/tree/eliza/approaching-capacity |
Hmm, the panic occurs on this line:
That code seems to assume the number of polls will never be zero when handling a poll-end event, because it's incremented when starting a poll, and a poll can't end without having been started. This makes me wonder if the panic occurs because the event that started the poll was dropped because the event channel was at capacity, but the event for ending the poll occurred when there was capacity for it. I suppose we could just use a saturating subtraction there so we don't panic...but this doesn't seem like correctly handling the potentially lossy nature of the event channel. I wonder if we want to avoid generating poll end events if the poll start event was not recorded. |
Temporary workaround for tokio-rs#180
Similarly, if we don't record entering a span due to event buffer capacity, it doesn't make sense to record exiting it. This commit changes the `ConsoleLayer` to only push a span to the current thread's span stack if we were able to successfully send an `Enter` event to the aggregator. This means that it won't be considered the parent span for other events/spans. When a span is exited, we only send an `Exit` event to the aggregator if the span *was* previously recorded as being entered. In theory, ignoring subsequent events on spans that were dropped due to buffer capacity technically means we are losing *more* data than we would have if we did not ignore those spans. But, the data we are losing here is *wrong*. For example, we cannot calculate a correct poll time for a poll where we didn't record the beginning of the poll, and we only recorded the poll ending. Therefore, I think it's better to ignore this data than to make a half-assed attempt to record it even though we know it's incorrect. I believe this will probably also fix issue #180. That issue occurs when we attempt to decrement the number of times a task has been polled, and sometimes --- if an `enter` event for that task was missed --- we may subtract more than we've added to the counter. By ignoring exits for spans that we never recorded as entered, this panic should now be avoided.
Similarly, if we don't record entering a span due to event buffer capacity, it doesn't make sense to record exiting it. This commit changes the `ConsoleLayer` to only push a span to the current thread's span stack if we were able to successfully send an `Enter` event to the aggregator. This means that it won't be considered the parent span for other events/spans. When a span is exited, we only send an `Exit` event to the aggregator if the span *was* previously recorded as being entered. In theory, ignoring subsequent events on spans that were dropped due to buffer capacity technically means we are losing *more* data than we would have if we did not ignore those spans. But, the data we are losing here is *wrong*. For example, we cannot calculate a correct poll time for a poll where we didn't record the beginning of the poll, and we only recorded the poll ending. Therefore, I think it's better to ignore this data than to make a half-assed attempt to record it even though we know it's incorrect. I believe this will probably also fix issue #180. That issue occurs when we attempt to decrement the number of times a task has been polled, and sometimes --- if an `enter` event for that task was missed --- we may subtract more than we've added to the counter. By ignoring exits for spans that we never recorded as entered, this panic should now be avoided.
Similarly, if we don't record entering a span due to event buffer capacity, it doesn't make sense to record exiting it. This commit changes the `ConsoleLayer` to only push a span to the current thread's span stack if we were able to successfully send an `Enter` event to the aggregator. This means that it won't be considered the parent span for other events/spans. When a span is exited, we only send an `Exit` event to the aggregator if the span *was* previously recorded as being entered. In theory, ignoring subsequent events on spans that were dropped due to buffer capacity technically means we are losing *more* data than we would have if we did not ignore those spans. But, the data we are losing here is *wrong*. For example, we cannot calculate a correct poll time for a poll where we didn't record the beginning of the poll, and we only recorded the poll ending. Therefore, I think it's better to ignore this data than to make a half-assed attempt to record it even though we know it's incorrect. I believe this will probably also fix issue #180. That issue occurs when we attempt to decrement the number of times a task has been polled, and sometimes --- if an `enter` event for that task was missed --- we may subtract more than we've added to the counter. By ignoring exits for spans that we never recorded as entered, this panic should now be avoided.
I'm pretty sure #212 fixed this --- I ran the repro from #180 (comment) on the commit prior to #212, and (on my machine) it panicked after 3 minutes. On commit ad442e2, the same repro can run for >45 minutes without panicking, so I'm going to say this is fixed. |
thread 'console_subscriber' panicked at 'attempt to subtract with overflow', /home/w/repos/clone_repos/console/console-subscriber/src/aggregator/mod.rs:1079:45 this panic reproduce again
|
Hmm, thanks for reporting that, but I don't think this is the same bug --- that the panic in this issue occurred here:
Meanwhile, your panic occurs on a very different line, here: console/console-subscriber/src/aggregator/mod.rs Line 1079 in ef41507
I think this is a related issue --- we probably missed an "add" event for a resource because the event buffer was at capacity, so when we then subtracted a value from the same resource, it overflowed. However, I think #212 did fix the bug described in this issue, but not the related one you're seeing, which we'll have to fix separately. |
Currently, an issue exists where a subtraction attribute update for a resource or async op attribute may overflow, if a previous add update was dropped due to event buffer capacity. This may result in the aggregator task panicking. See #180 (comment) for details. This branch changes all resource updates to use saturating arithmetic. In practice, I think it's _much_ more likely for a subtract to overflow due to missed adds than it is for an add to overflow due to missed subs, but I made the addition saturating as well, just in case. We should probably also increase the default event buffer capacity, since it seems like lots of people are hitting problems with this...
Currently, an issue exists where a subtraction attribute update for a resource or async op attribute may overflow, if a previous add update was dropped due to event buffer capacity. This may result in the aggregator task panicking. See #180 (comment) for details. This branch changes all resource updates to use saturating arithmetic. In practice, I think it's _much_ more likely for a subtract to overflow due to missed adds than it is for an add to overflow due to missed subs, but I made the addition saturating as well, just in case. We should probably also increase the default event buffer capacity, since it seems like lots of people are hitting problems with this...
thread 'console_subscriber' panicked at 'attempt to subtract with overflow', console/console-subscriber/src/aggregator/mod.rs:242:9
This is while attaching the console to some web-crawler demo code I've been working on. (It's deliberately naive code.)
Here is that code base:
https://github.com/pnkfelix/huaa-exsets/tree/console-play/crawler
The nature of that current web crawler is that it will visit pages in undetermined order, which means that the overflow panic is not always the first panic it encounters. So, it currently does not deterministically reproduce the bug here. But I bet I can narrow it down.
The text was updated successfully, but these errors were encountered: