-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(subscriber): ignore subsequent data for spans that weren't recorded #212
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hawkw
requested review from
carllerche,
seanmonstar,
pnkfelix and
zaharidichev
December 16, 2021 21:08
carllerche
approved these changes
Dec 16, 2021
Currently, a strange behavior exists in the `console-subscriber` crate. If a span for a task, async op, or resource is created, and the event buffer is full, the aggregator task will not be informed of that span's creation. But, if the event buffer then empties out, we might send the aggregator task enter/exit/close events for that span. This results in the aggregator receiving events for an unknown span, which might result in panics or subtly wrong data. This branch fixes this by changing the `ConsoleLayer` to track whether it was able to successfully send an event for the creation of a span, and to only care about subsequent events on that span if it successfully recorded the span's creation. This is done by inserting a zero-sized marker type into the span's extensions map, and ignoring spans that don't have this marker.
Similarly, if we don't record entering a span due to event buffer capacity, it doesn't make sense to record exiting it. This commit changes the `ConsoleLayer` to only push a span to the current thread's span stack if we were able to successfully send an `Enter` event to the aggregator. This means that it won't be considered the parent span for other events/spans. When a span is exited, we only send an `Exit` event to the aggregator if the span *was* previously recorded as being entered. In theory, ignoring subsequent events on spans that were dropped due to buffer capacity technically means we are losing *more* data than we would have if we did not ignore those spans. But, the data we are losing here is *wrong*. For example, we cannot calculate a correct poll time for a poll where we didn't record the beginning of the poll, and we only recorded the poll ending. Therefore, I think it's better to ignore this data than to make a half-assed attempt to record it even though we know it's incorrect. I believe this will probably also fix issue #180. That issue occurs when we attempt to decrement the number of times a task has been polled, and sometimes --- if an `enter` event for that task was missed --- we may subtract more than we've added to the counter. By ignoring exits for spans that we never recorded as entered, this panic should now be avoided.
hawkw
force-pushed
the
eliza/ignore-failed-sends
branch
from
December 16, 2021 21:35
c919e63
to
dd562d7
Compare
This was referenced Dec 17, 2021
hawkw
added a commit
that referenced
this pull request
Jan 12, 2022
…hannel (#238) ## Motivation Currently, there are some rather bad issues that occur when the event buffer is at capacity and events are dropped. Completely *losing* data due to buffer capacity is relatively okay: if we set a bound on how much memory the console can use, and we don't record new things that occur when we've reached that limit, this is correct and acceptable behavior. However, the current design can result in *incorrect* data when events are lost due to the buffer being at capacity. This is because we currently record things like starting to poll a task/resource, ending a poll, dropping a task/resource/async op, and waker operations, as individual events in the buffer. This means that we can have a situation where the creation of a task was recorded, because there was buffer capacity at the time, but then when the task ended, the buffer was full, so we never recorded its termination. That results in the tasks appearing to run forever and never terminate --- see issue #230. Similarly, if we record the beginning of a poll, but drop the end-of-poll event because we're at capacity, this means we will (incorrectly) record a poll that goes on forever, which is obviously incorrect. I think this may also be the cause of the false positives with the lost waker lint (#149), if we record a waker drop but missed a waker clone that previously occurred. The change in #212 fixed one category of issue that occurs due to event buffer capacity --- when a task, resource, or async op's _creation_ event is dropped due to buffer capacity, we skip any subsequent events related to that task/resource/op. However, this doesn't fix issues where the subsequent events are the ones that are dropped. ## Solution This branch proposes a solution to this whole category of event buffer capacity related issues. Unfortunately, this requires rewriting a *lot* of `console-subscriber`'s internals. In the new approach, we now _only_ send events over the channel when creating a new task, resource, or async op. Those events now contain an `Arc` holding the stats for that entity. Another clone of the `Arc` is stored in the `tracing_subscriber::Registry`'s [span extensions] for the span corresponding to that entity. When the `ConsoleLayer` records subsequent events for a particular entity, such as starting/ending a poll, it looks up the span by ID, and updates the stats type stored in its extensions. The aggregator stores its clone of the `Arc` in a map of entities, just like it does currently, but no longer handles actually updating the stats; just building wire format updates from any tracked entities whose data was updated by the layer. This should fix all issues where dropping something due to event buffer capacity results in incorrect data. Once we have successfully recorded the *creation* of a task, resource, or async op, any subsequent updates to its stats are *guaranteed* to be reliable. If the channel is at capacity and we fail to record a new resource/task/op, we never create a stats extension for it, and we won't record anything for it at all. Otherwise, it will always have correct data recorded. When possible, the stats in the `Arc`ed stats are updated atomically. In some cases, this isn't easily possible, and some fields of the stats types are stored in a mutex. In particualr, this is required for storing timestamps. I don't really love that, but these mutices should be contented very infrequently. Stats aren't marked as having unset updates until after the stats inside the mutices have been updated, so the aggregator will not try to lock the mutex if the layer is currently updating it; instead, it will simply be included in the next update once the layer is no longer touching it. Mutices here will only be contended when multiple threads are updating a task's stats at the same time, which should occur very rarely...and in most cases, they *still* won't have to contend a mutex, since access to most of the mutices are guarded by an atomic variable for e.g. determining which thread actually was the last to complete a concurrent poll. The biggest performance downside of the mutices is probably not actually contention, but the additional heap allocation required when using `std::sync::Mutex`. However, since we have conditional `parking_lot` support, parking_lot can be used to avoid requiring additional allocations. In the future, it's probably possible to make more of this atomic by converting timestamps into integers and storing them in atomic variables. I haven't done this yet because both the protobuf timestamps and `std::time` timestamps are larger than a single 64-bit number and it might take a little extra work to ensure we can nicely fit them in an `AtomicUsize`...but we can probably do that later. [span extensions]: https://docs.rs/tracing-subscriber/latest/tracing_subscriber/registry/struct.SpanRef.html#method.extensions Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Depends on #211 (because I didn't want to figure out the potential merge
conflicts).
Two separate (but closely related) commits:
fix(subscriber): ignore spans that weren't initially recorded
Currently, a strange behavior exists in the
console-subscriber
crate. If a span for a task, async op, or resource is created, and the
event buffer is full, the aggregator task will not be informed of that
span's creation. But, if the event buffer then empties out, we might
send the aggregator task enter/exit/close events for that span. This
results in the aggregator receiving events for an unknown span, which
might result in panics or subtly wrong data.
This branch fixes this by changing the
ConsoleLayer
to track whetherit was able to successfully send an event for the creation of a span,
and to only care about subsequent events on that span if it
successfully recorded the span's creation. This is done by inserting a
zero-sized marker type into the span's extensions map, and ignoring
spans that don't have this marker.
fix(subscriber): ignore exiting spans that were never entered
Similarly, if we don't record entering a span due to event buffer
capacity, it doesn't make sense to record exiting it. This commit
changes the
ConsoleLayer
to only push a span to the current thread'sspan stack if we were able to successfully send an
Enter
event tothe aggregator. This means that it won't be considered the parent span
for other events/spans. When a span is exited, we only send an
Exit
event to the aggregator if the span was previously recorded as being
entered.
In theory, ignoring subsequent events on spans that were dropped due
to buffer capacity technically means we are losing more data than we
would have if we did not ignore those spans. But, the data we are
losing here is wrong. For example, we cannot calculate a correct
poll time for a poll where we didn't record the beginning of the poll,
and we only recorded the poll ending. Therefore, I think it's better
to ignore this data than to make a half-assed attempt to record it
even though we know it's incorrect.
I believe this will probably also fix issue thread 'console_subscriber' panicked at 'attempt to subtract with overflow' #180. That issue occurs
when we attempt to decrement the number of times a task has been
polled, and sometimes --- if an
enter
event for that task was missed--- we may subtract more than we've added to the counter. By ignoring
exits for spans that we never recorded as entered, this panic should
now be avoided.