Fix a potential race in iterating counters #28112

sywhang · 2020-10-21T07:40:07Z

Summary

There was a mistake that was made in backporting dotnet/runtime#40259 to #28089 where _counters was used to iterate instead of the snapshotted value of counters. _counters is a list that needs to be lock-protected, but this access was happening outside of the lock, which is what the snapshot is for. This caused a race-condition on the read/write on this list, causing a crash. The problem does not exist in the 5.0 fix, only for the 3.1 backport.

Customer Impact

Medium/High. When hit with this issue, a customer may experience a crash. A customer may run into this issue when they turn on the runtime counters (System.Runtime) or any other set of counter providers, and then modify the list of counters while a counter callback is happening by either creating or disposing instances of EventCounters. For customers that don't consume/create their own EventCounters, this shouldn't be a problem. For those who do, it's pretty easy to hit with the likelihood increasing the shorter they set the counter polling interval.

Regression

Yes. From 3.1.8 -> 3.1.9 servicing release.

Testing

We do not have a local repro of this failure, but we tested the fix by providing a custom build of the runtime with this fix to the internal partner test that initially reported this problem. They ran it in production that hit this issue and reported that this addresses the issue.

Risk

The root cause/fix is well-understood and the fix was verified through partner testing - I would say it is relatively low.

noahfalk · 2020-10-21T08:56:52Z

Customer Impact

@sywhang - its best if the text here describes what the user needs to do to hit the issue and some estimate of how likely the situation is to occur. The resulting behavior when the bug is triggered is important as well (crashing as you mentioned). The total impact is then a combination of both: likelihood_to_hit * effect_when_hit.

noahfalk

LGTM

noahfalk · 2020-10-21T09:00:17Z

cc @davmason for another pair of eyes (despite being simple I was also the one who missed catching it during the port so I'd appreciate getting a fresh eyes on it)

jeffschwMSFT

Approved. Can you elaborate on the customer impact and how often it will occur? We will consider this for 3.1.

seif · 2020-11-05T15:32:36Z

I am affected by this, trying to add Counters to the EventSource dynamically, the process crashes because of this

System.InvalidOperationException: Collection was modified; enumeration operation may not execute.
at System.Collections.Generic.List1.Enumerator.MoveNextRare() at System.Collections.Generic.List1.Enumerator.MoveNext()
at System.Diagnostics.Tracing.CounterGroup.OnTimer()
at System.Diagnostics.Tracing.CounterGroup.<>c__DisplayClass21_0.b__0()
at System.Diagnostics.Tracing.CounterGroup.PollForValues()
at System.Threading.ThreadHelper.ThreadStart()

Any workaround you suggest until 3.1.11 is out?

Fix a potential race in iterating counters

7357ff6

sywhang added the area-Tracing label Oct 21, 2020

sywhang requested a review from noahfalk October 21, 2020 07:40

sywhang added Servicing-consider Issue for next servicing release review area-Tracing and removed area-Tracing Servicing-consider Issue for next servicing release review labels Oct 21, 2020

noahfalk approved these changes Oct 21, 2020

View reviewed changes

jeffschwMSFT approved these changes Oct 21, 2020

View reviewed changes

jeffschwMSFT added the Servicing-consider Issue for next servicing release review label Oct 21, 2020

jeffschwMSFT added this to the 3.1.x milestone Oct 21, 2020

davmason approved these changes Oct 21, 2020

View reviewed changes

leecow added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Oct 22, 2020

leecow modified the milestones: 3.1.x, 3.1.11 Oct 22, 2020

wtgodbe merged commit 9fb9458 into release/3.1 Nov 16, 2020

kevingosse mentioned this pull request Dec 28, 2020

Fix crash caused by EventCounters test DataDog/dd-trace-dotnet#1113

Merged

stephentoub deleted the dev/suwhang/fix-counter-race-condition branch January 5, 2021 02:43

sdmaclea pushed a commit to sdmaclea/coreclr that referenced this pull request Jan 12, 2021

Fix a potential race in iterating counters (dotnet#28112)

b30331f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a potential race in iterating counters #28112

Fix a potential race in iterating counters #28112

sywhang commented Oct 21, 2020 •

edited

Loading

noahfalk commented Oct 21, 2020

noahfalk left a comment

noahfalk commented Oct 21, 2020

jeffschwMSFT left a comment

seif commented Nov 5, 2020

Fix a potential race in iterating counters #28112

Fix a potential race in iterating counters #28112

Conversation

sywhang commented Oct 21, 2020 • edited Loading

Summary

Customer Impact

Regression

Testing

Risk

noahfalk commented Oct 21, 2020

noahfalk left a comment

Choose a reason for hiding this comment

noahfalk commented Oct 21, 2020

jeffschwMSFT left a comment

Choose a reason for hiding this comment

seif commented Nov 5, 2020

sywhang commented Oct 21, 2020 •

edited

Loading