This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
Fix a potential race in iterating counters #28112
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
There was a mistake that was made in backporting dotnet/runtime#40259 to #28089 where
_counters
was used to iterate instead of the snapshotted value ofcounters
._counters
is a list that needs to be lock-protected, but this access was happening outside of the lock, which is what the snapshot is for. This caused a race-condition on the read/write on this list, causing a crash. The problem does not exist in the 5.0 fix, only for the 3.1 backport.Customer Impact
Medium/High. When hit with this issue, a customer may experience a crash. A customer may run into this issue when they turn on the runtime counters (System.Runtime) or any other set of counter providers, and then modify the list of counters while a counter callback is happening by either creating or disposing instances of EventCounters. For customers that don't consume/create their own EventCounters, this shouldn't be a problem. For those who do, it's pretty easy to hit with the likelihood increasing the shorter they set the counter polling interval.
Regression
Yes. From 3.1.8 -> 3.1.9 servicing release.
Testing
We do not have a local repro of this failure, but we tested the fix by providing a custom build of the runtime with this fix to the internal partner test that initially reported this problem. They ran it in production that hit this issue and reported that this addresses the issue.
Risk
The root cause/fix is well-understood and the fix was verified through partner testing - I would say it is relatively low.