add concurrency tests with Coyote to CI #4879

Yun-Ting · 2023-09-22T23:58:53Z

Changes

Coyote is a cross-platform library and tool for testing concurrent C# code and deterministically reproducing bugs. Deck made by @pdeligia from MSR with excellent visual aids:
https://microsoft-my.sharepoint.com/:p:/p/pdeligia/EW96pNu-66dKurXENmz4du0BtMWCNKiT1jprciqc9P5r1w?e=5Gk9BQ

The methodology has been validated in reproducing a known concurrency bug in OTel 1.3.0 release.

Details of the bug

OTel 1.3.0 release is buggy: Fix Histogram synchronization issue #3534 because of different locking mechanism used between Update and TakeSnapshot for Histogram.
Update: https://github.com/open-telemetry/opentelemetry-dotnet/blob/c9052ef7655b8c9b210500f48ac2a521acf63d66/src/OpenTelemetry/Metrics/MetricPoint.cs#LL325-L357
TakeSnapshot: https://github.com/open-telemetry/opentelemetry-dotnet/blob/c9052ef7655b8c9b210500f48ac2a521acf63d66/src/OpenTelemetry/Metrics/MetricPoint.cs#LL496C1-L521C22
Unit test that can capture the synchronization issue added a test for Histogram Update and TakeSnapshot methods #4340.
What this test is asserting on:

Used Coyote to capture and reproduce the above mentioning synchronization bug in OTel 1.3.0 release

Demo using Coyote to catch the bug and reproduce the steps with the Coyote replay feature

Demo-20231004_162907-Meeting.Recording.mp4

Test code that reproduces this bug:
https://github.com/open-telemetry/opentelemetry-dotnet/compare/main-1.3.0...Yun-Ting:opentelemetry-dotnet:yunl/1.3.0-UpdateAndSnapshotWithCoyote?expand=1
Test output when running the test with Coyote:
repro_multithreaded_bug_on_1.3.0_with_coyote.txt

What's Next

This PR is the first step to integrate Coyote into the CI pipeline.
When Coyote detects concurrency bug(s) during CI, developers can use the replay feature in Coyote to replay and debug a trace file locally. Check the above demo for the E2E debugging flow for OTel developers.

Follow-up PRs can be made to use Coyote to run other existing concurrency test cases.

For example:

We can also add/improve our concurrency tests coverage with the infrastructure introduced by this PR.

Merge requirement checklist

CONTRIBUTING guidelines followed (nullable enabled, static analysis, etc.)
Unit tests added/updated
Appropriate CHANGELOG.md files updated for non-trivial changes
Changes in public API reviewed (if applicable)

Directory.Packages.props

OpenTelemetry.sln

codecov · 2023-09-27T22:15:03Z

Codecov Report

Merging #4879 (7dca448) into main (e904994) will decrease coverage by 0.02%.
Report is 2 commits behind head on main.
The diff coverage is 100.00%.

❗ Current head 7dca448 differs from pull request most recent head 583b7d6. Consider uploading reports for the commit 583b7d6 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4879      +/-   ##
==========================================
- Coverage   83.54%   83.52%   -0.02%     
==========================================
  Files         296      296              
  Lines       12481    12487       +6     
==========================================
+ Hits        10427    10430       +3     
- Misses       2054     2057       +3

Flag	Coverage Δ
unittests	`83.52% <100.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/OpenTelemetry.Api/Trace/TracerProvider.cs	`93.93% <100.00%> (ø)`
src/OpenTelemetry/Metrics/AggregatorStore.cs	`78.53% <100.00%> (-1.33%)`	⬇️
src/OpenTelemetry/Metrics/MeterProviderSdk.cs	`89.97% <100.00%> (+0.02%)`	⬆️
src/OpenTelemetry/Metrics/Metric.cs	`96.49% <100.00%> (+0.03%)`	⬆️
src/OpenTelemetry/Metrics/MetricPoint.cs	`68.47% <100.00%> (ø)`
src/OpenTelemetry/Metrics/MetricReaderExt.cs	`87.06% <100.00%> (+0.22%)`	⬆️

... and 6 files with indirect coverage changes

…telemetry-dotnet into yunl/addCoyoteCI_3

build/test-threadSafety.ps1

.github/workflows/ci-concurrency.yml

test/Directory.Packages.props

utpilla · 2023-10-26T06:27:13Z

I was hoping for more to be done for me.

I had the same expectations @CodeBlanch.

I was expecting Coyote to do more than just detecting the presence of a bug. I expected it to provide further insights into how we can fix the issue by pointing out the buggy critical sections/ shared data access or by some other means.

I believe only one of the following is true:

This is an actual limitation of Coyote (that it would only detect the presence of a bug and not provide more code level insights)
We are not using Coyote to its full potential and are not aware of how to do that (We probably don't know how to get hold of the code level insights from the generated artifacts)

@pdeligia Would you be able to help us here?

pdeligia · 2023-10-26T18:42:42Z

@CodeBlanch @utpilla -- the tool is not "magic" for sure :) Your understanding is correct, you need to identify areas you care about testing and have concerns over concurrency, and then write concurrent unit tests as entry points to explore coverage. Coyote will then drive coverage of them.

The good thing is that with a "few" concurrency tests (controlled by Coyote) you can drive coverage across tons of different execution paths that your program might take non-deterministically. These can then be part of your regression suite to catch concurrency bugs as the code evolves over time. (Test generation to reduce the test-writing effort is an interesting problem for sure, and a lot of research is going into this, perhaps LLMs can help soon with all the advancements happening recently, but it's an orthogonal problem to what Coyote focuses on.)

The "giant leap" arguably comes in two directions: (1) find bugs in (very) rare interleavings in a very small setting i.e. few numbers of tasks needed to run your logic (which regular unit tests might never discover or only once in a while, or stress testing which might need to create thousands of tasks, ending up with tons of logs to repro which are hard to make sense of), and (2) deterministically replay such bugs on the debugger (even if the test itself is by itself concurrent), which previously you relied on timing to replay (the infamous Heizenbugs :))

These are the two things Coyote gives you. It won't sadly tell you how to fix the bug as that is hard to generalize to all kinds of bugs and applications! (Although would love for that to be the case! :)) The found bug then needs to be debugged using the coyote -replay process, which (in our experience with many other teams using it internally) is what makes this a significantly better dev experience than say stress testing. Especially for these kinds of super complex concurrency issues with tons of tasks involved that you can't make sense of easily. Honestly, this is when this kind of tooling gets appreciated more. For shallow bugs, the delta value you get is indeed smaller, as these could be found and debugged already relatively easily, without tools like Coyote.

(BTW Coyote also has more advanced capabilities -- that you can't easily get with unit/stress testing -- for example detecting liveness issues, i.e. something that should eventually happen, does not happen, but these can be more complex to encode as tests & debug and might not matter to OT at all.)

Hope you guys get some good value out of this! (And happy -- for MSFT employees -- to connect you to internal teams to learn from their own experience, if interested.) Thanks so much @Yun-Ting for driving all this integration effort :)

Yun-Ting · 2023-10-31T22:25:27Z

Thank you @pdeligia for your detailed explanation about the capabilities of Coyote!

The good thing is that with a "few" concurrency tests (controlled by Coyote) you can drive coverage across tons of different execution paths that your program might take non-deterministically. These can then be part of your regression suite to catch concurrency bugs as the code evolves over time. (Test generation to reduce the test-writing effort is an interesting problem for sure, and a lot of research is going into this, perhaps LLMs can help soon with all the advancements happening recently, but it's an orthogonal problem to what Coyote focuses on.)

The "giant leap" arguably comes in two directions: (1) find bugs in (very) rare interleavings in a very small setting i.e. few numbers of tasks needed to run your logic (which regular unit tests might never discover or only once in a while, or stress testing which might need to create thousands of tasks, ending up with tons of logs to repro which are hard to make sense of), and (2) deterministically replay such bugs on the debugger (even if the test itself is by itself concurrent), which previously you relied on timing to replay (the infamous Heizenbugs :))

Definitely, I think Coyote is a powerful tool that could easily stress test all kinds of interleaving in the focused area. And I'm looking forward to seeing the LLM projects MSR are focusing on lately to further advance developers life ✨. Please let us know if you have interesting LLM projects that you think OTel could try out!

CodeBlanch · 2023-11-01T19:22:46Z

Just FYI we (maintainers) discussed this yesterday and we are going to move forward with Coyote! @utpilla is going to review and merge when he has a cycle or two

CodeBlanch · 2023-11-01T19:36:53Z

@pdeligia

Here's some random food for thought.

You familiar at all with SpecFlow? It allows you to author (essentially) test cases using gherkin.

What I was hoping Coyote would have is something kind of similar to help author/model the test cases.

Imagine something like:

Scenario: TracerProvider.GetTracer concurrency test
Given A {TracerProvider} instance
And Many threads calling {TracerProvider.GetTracer}
When A single thread calls {TracerProvider.Dispose}
Then {TracerProvider.tracers} remains empty

Now that would be magic 😄

Essentially some way to describe how the threads are expected (single/many readers, single/many writes, etc.) and then the tool builds the tests.

pdeligia · 2023-11-02T16:31:01Z

@CodeBlanch great thoughts here! I have heard about SpecFlow, but have not used it myself. However, I am in general familiar with test-generation (even internally, there is a lot of very interest research happening in MSR from some colleagues of mine on test-generation from "high-level specs" of a system, and nowadays even AI can be a great accelerator here!).

Coyote, however, is in some sense orthogonal to test generation: you can think of Coyote as the concurrency exploration engine, which you use once you have tests in place. How these tests are authored or generated is completely up to developers, as we do not want to be opinionated, there are so many different ways to do this including manually authoring them (so this is out of scope for Coyote, by design). Once you have tests in place (manually or by using an automated tool), Coyote will then take control of the inherent non-determinism in them and explore hundreds/thousands of concurrency execution paths for you (out of each test) to find race conditions, etc!

I believe you could use any tool that works with C# (like SpecFlow, but not sure if it supports C#?) to generate your concurrent tests, and then feed them to Coyote to explore task/thread interleavings for bugs :) Are there any such test generation tools for C# you have used before, might be worth trying them?

github-actions · 2023-11-10T03:16:44Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

… Microsoft.Coyote.Test 1.7.10 -> System.Text.Json (>= 7.0.1)

examples/Directory.Packages.props

utpilla

@Yun-Ting Apologies for the delay! Need to address this https://github.com/open-telemetry/opentelemetry-dotnet/pull/4879/files#r1396782338, otherwise it looks good to be merged.

…telemetry-dotnet into yunl/addCoyoteCI_3

skeleton

3d25657

Yun-Ting changed the title ~~add concurrency test with Coyote to CI~~ add concurrency tests with Coyote to CI Sep 22, 2023

reyang reviewed Sep 26, 2023

View reviewed changes

Directory.Packages.props Outdated Show resolved Hide resolved

reyang reviewed Sep 26, 2023

View reviewed changes

OpenTelemetry.sln Outdated Show resolved Hide resolved

Yun-Ting added 2 commits September 27, 2023 14:58

test

b776160

Merge branch 'main' into yunl/addCoyoteCI_3

9d2982a

Yun-Ting added 17 commits September 27, 2023 15:27

update

a7ff1ac

split tests and cleanup

0a95eb7

yikes

22eb5dd

Merge branch 'main' into yunl/addCoyoteCI_3

b9b4db7

space

1da2417

Merge branch 'main' into yunl/addCoyoteCI_3

a18602b

cleanups and net8

5b96500

Merge branch 'main' into yunl/addCoyoteCI_3

c2f8650

comment

cc152e7

Merge branch 'yunl/addCoyoteCI_3' of https://github.com/Yun-Ting/open…

7ce1dc4

…telemetry-dotnet into yunl/addCoyoteCI_3

test

2bea31c

Merge branch 'main' into yunl/addCoyoteCI_3

3a69758

Merge branch 'main' into yunl/addCoyoteCI_3

30e582f

use 7 for now

67d51d0

cleanups

f11877e

update shell

c48df7f

cleanups

f47ed19

Yun-Ting marked this pull request as ready for review October 6, 2023 22:29

Yun-Ting requested a review from a team October 6, 2023 22:29

Yun-Ting commented Oct 6, 2023

View reviewed changes

build/test-threadSafety.ps1 Show resolved Hide resolved

,

fe21daf

reyang reviewed Oct 6, 2023

View reviewed changes

.github/workflows/ci-concurrency.yml Show resolved Hide resolved

utpilla reviewed Oct 10, 2023

View reviewed changes

test/Directory.Packages.props Show resolved Hide resolved

github-actions bot removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Oct 21, 2023

Yun-Ting added 6 commits October 31, 2023 13:43

Merge branch 'main' into yunl/addCoyoteCI_3

3309c8e

md workflow

80f2fc9

composition

57d1b95

sanity and sln

dfc8f80

fix

1601fa2

comment

f24b08e

Merge branch 'main' into yunl/addCoyoteCI_3

ace5edf

github-actions bot added the Stale Issues and pull requests which have been flagged for closing due to inactivity label Nov 10, 2023

utpilla removed the Stale Issues and pull requests which have been flagged for closing due to inactivity label Nov 13, 2023

Yun-Ting added 3 commits November 15, 2023 15:46

merge main

b1015ce

fixed error NU1605: OpenTelemetry.Tests -> Microsoft.Coyote 1.7.10 ->…

f7afd78

… Microsoft.Coyote.Test 1.7.10 -> System.Text.Json (>= 7.0.1)

Merge branch 'main' into yunl/addCoyoteCI_3

9d98916

utpilla reviewed Nov 17, 2023

View reviewed changes

examples/Directory.Packages.props Outdated Show resolved Hide resolved

utpilla reviewed Nov 17, 2023

View reviewed changes

Yun-Ting added 4 commits November 17, 2023 09:58

Merge branch 'main' into yunl/addCoyoteCI_3

af6e0fe

comment

0388f74

Merge branch 'yunl/addCoyoteCI_3' of https://github.com/Yun-Ting/open…

7dca448

…telemetry-dotnet into yunl/addCoyoteCI_3

ci

583b7d6

utpilla approved these changes Nov 17, 2023

View reviewed changes

utpilla merged commit d346196 into open-telemetry:main Nov 17, 2023
40 checks passed

utpilla mentioned this pull request Nov 17, 2023

[repo] CI improvements #5023

Merged

Yun-Ting deleted the yunl/addCoyoteCI_3 branch November 17, 2023 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add concurrency tests with Coyote to CI #4879

add concurrency tests with Coyote to CI #4879

Yun-Ting commented Sep 22, 2023 •

edited

Loading

codecov bot commented Sep 27, 2023 •

edited

Loading

utpilla commented Oct 26, 2023

pdeligia commented Oct 26, 2023

Yun-Ting commented Oct 31, 2023

CodeBlanch commented Nov 1, 2023

CodeBlanch commented Nov 1, 2023

pdeligia commented Nov 2, 2023 •

edited

Loading

github-actions bot commented Nov 10, 2023

utpilla left a comment

add concurrency tests with Coyote to CI #4879

add concurrency tests with Coyote to CI #4879

Conversation

Yun-Ting commented Sep 22, 2023 • edited Loading

Changes

Merge requirement checklist

codecov bot commented Sep 27, 2023 • edited Loading

Codecov Report

utpilla commented Oct 26, 2023

pdeligia commented Oct 26, 2023

Yun-Ting commented Oct 31, 2023

CodeBlanch commented Nov 1, 2023

CodeBlanch commented Nov 1, 2023

pdeligia commented Nov 2, 2023 • edited Loading

github-actions bot commented Nov 10, 2023

utpilla left a comment

Choose a reason for hiding this comment

Yun-Ting commented Sep 22, 2023 •

edited

Loading

codecov bot commented Sep 27, 2023 •

edited

Loading

pdeligia commented Nov 2, 2023 •

edited

Loading