Fix flaky tests in test_otel.py::TestOtelIntegration #52936

xBis7 · 2025-07-06T12:35:36Z

Related issue: #52906

Although the test always passes locally, it has 20-30% failure rate on the remote CI.

For testing it, I created a custom CI that runs the class on repeat.

I was able to determine that test_scheduler_change_after_the_first_task_finishes was the flaky test method and once that failed, it would cause every test running after it to fail as well. After commenting it out, the class passed 40/40 runs where it would pass only 25/40 runs.

https://github.com/xBis7/airflow/actions/runs/16096527407

https://github.com/xBis7/airflow/actions/runs/16096660040

https://github.com/xBis7/airflow/actions/runs/16096742809

https://github.com/xBis7/airflow/actions/runs/16096823166

The actual issue causing the flakiness is the value of the scheduler_health_check_threshold flag.

The test

uses 2 schedulers
scheduler1 processes the dag_run
in the middle of the dag_run, scheduler1 becomes idle
- This tries to simulate the scenario where there is a very long running dag_run and one scheduler stops processing it and another with more resources picks it up
- In that case, scheduler2 finishes it and realizes that another scheduler has started the dr spans.
- Scheduler2 marks the dr spans on the DB so that the original scheduler that holds the objects in memory, will know to end the spans

The rest of the tests need scheduler_health_check_threshold to have a low value so that scheduler2 can mark scheduler1 as unhealthy pretty quickly. But the opposite is needed for this test.

During 20-30% of the runs that the test is failing, scheduler2 is marking scheduler1 as unhealthy and therefore recreating the older spans because they are considered lost. The test is then timing out waiting for the span status to change from ENDED to SHOULD_END which will never happen.

After increasing the flag just for this test, the flakiness is gone. I've run the test 99/100 times successfully. The other run was canceled because the workflow didn't have enough resources and the test exceeded the 30 mins threshold.

https://github.com/xBis7/airflow/actions/runs/16098256324

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

xBis7 · 2025-07-06T12:36:20Z

@potiuk @amoghrajesh Can you take a look please?

potiuk

Nice! Let's merge and see !

potiuk · 2025-07-06T13:34:50Z

You never know with flaky tests :D.. That's their nature...

amoghrajesh

Nice, fixing a flaky test isnt easy!

xBis7 · 2025-07-06T14:00:37Z

Thank you for the quick merge! Let's hope there are no more random failures.

fix flakiness

c61f8f8

potiuk approved these changes Jul 6, 2025

View reviewed changes

potiuk merged commit a338157 into apache:main Jul 6, 2025
56 checks passed

amoghrajesh reviewed Jul 6, 2025

View reviewed changes

xBis7 deleted the flaky-otel-tests-52906 branch July 6, 2025 14:12

HsiuChuanHsu pushed a commit to HsiuChuanHsu/airflow that referenced this pull request Jul 10, 2025

fix flakiness (apache#52936)

7401f3f

stephen-bracken pushed a commit to stephen-bracken/airflow that referenced this pull request Jul 15, 2025

fix flakiness (apache#52936)

eeaa01c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky tests in test_otel.py::TestOtelIntegration #52936

Fix flaky tests in test_otel.py::TestOtelIntegration #52936

Uh oh!

xBis7 commented Jul 6, 2025 •

edited

Loading

Uh oh!

xBis7 commented Jul 6, 2025

Uh oh!

potiuk left a comment

Uh oh!

Uh oh!

potiuk commented Jul 6, 2025

Uh oh!

amoghrajesh left a comment

Uh oh!

xBis7 commented Jul 6, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix flaky tests in test_otel.py::TestOtelIntegration #52936

Fix flaky tests in test_otel.py::TestOtelIntegration #52936

Uh oh!

Conversation

xBis7 commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xBis7 commented Jul 6, 2025

Uh oh!

potiuk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

potiuk commented Jul 6, 2025

Uh oh!

amoghrajesh left a comment

Choose a reason for hiding this comment

Uh oh!

xBis7 commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xBis7 commented Jul 6, 2025 •

edited

Loading

xBis7 commented Jul 6, 2025 •

edited

Loading