Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quarantine test_backfill_integration in dask executor #32780

Merged
merged 1 commit into from
Jul 23, 2023

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jul 23, 2023

The test has been recently failing with deadlock (see #32778) and needs thorough looking at if we want to find the root cause/remedium.

In the meantime it looks like a niche case connected with Dask Executor that is rather obsure and we have no expertise in solving problems with and diagnosing, so until the problem is diagnosed it might be a long time (and maybe even we decide not to care about it and let Dask community take a look and either fix or ignore it.

We aim to have a very low number of those Quarantined tests (currently we have 1 and we have not run it for a while as this was a mysql test run on Postgres) but we have now the opportunity to also improve the quarantined tests framework.

This test will be run together with other (1) quarantined test and:

  • they will not be run in our regular tests
  • they are run sequentially not in parallel with all other tests
  • they are run for all 4 backends but only for the default versions of those backends
  • failure of the quarantined tests will not cause failure of the whole job or limit constraints from being generated and updated

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Contributor

@amoghrajesh amoghrajesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, interesting issue.

We might not need to fix it from the airflow side altogether! I don't understand a lot of the background, but the idea of the fix looks good in general +1

@potiuk
Copy link
Member Author

potiuk commented Jul 23, 2023

Hey here. We have this nasty "dask executor" being extremely flaky recently with database deadlock.

#32778

It will need some detailed investigation. It would be great if maybe someone takes a deeeeeeeep look at the problem as this one is pretty nasty and next to impossible to reproduce locally (I have not managed to do it yet).

Next week I am travelling around Slovakia with my wife (not yet vacations but as close to it as it might be - these are coming in Halifax in September/October after Summit and before Community over Code conference. So I will not have time to take a look at this this week. It's also for those who might be counting on my usual responsiveness the coming week - "DON't :)".

Next week I will mostly make sure that the provider separation/configuration stuff will not delay Airflow 2.7 as top priority . I think I have all PRs that are needed for that already up and "green-ish" (with the exception that almost always when I run "full tests" needed preparation as the top priority one of the Postgress tests fails with this nasty deadlock.

So instead of investigating and trying to fix that one, I decided to isolate the test and make it less disruptive. In order to "isolate" the test I revived a bit our old "quarantine" mechanism of ours that we used to isolate those kind of tests until they get a good treatment - it was useful when we have quite a number of those ~ 2 years ago but then we had a streak where we solved and investigated almost all of the "regular" flaky tests, so the mechanism was a bit forgotten, but it's the right time to bring it back and polish a bit.

I refreshed it a bit and made more complete - I run the "quarantined" tests now sequentially for all the 4 backends we have and only on self-hosted runners - so it will only be visible for committers and in main builds. @o-nikolas, we have been discussing it recently about the tests that rely on "timing" - this is exactly the mechanism we can use if we see any of our timing test become flaky. We mark the test with @pytest.mark.quarantined and it will run in a perfect environment:

  • on self-hosted runners
  • in isolation (everything runs sequentially)
  • even if it fails, you will see it in the logs but it will not fail the whole build, limit constraints from updating and generally it will not disrupt "green" status of builds (which is good if we want to learn to rely on "green" beeing the indicator of "can be merged".

Most likely when the tests will run in isolation, they will not fail - usually those flaky tests are cause by some side-effects or when they are run in "busy" environment in parallel. We might not even have to - eventually - fix it, maybe just running in isolation is the final solution, but I think this one might be an indication of some REAL issue we have with deadlocks when backfilling, so it might be worth to investigate and fix it eventually.

This one is difficult to reproduce locally but relatively "easy" to reproduce by a commiter whose tests are run on self-hosted runners:

  • remove quarantined marker
  • make a PR with "full tests needed"
  • likely one of the jobs will fail - if not, Re-run all the jobs - it will fail rather quickly

So if you think you found a fix, just closing/reopening PR with removed marker and "full tests needed" should be a good sign that things are fixed.

@potiuk potiuk force-pushed the quarantine-test-backfill-job branch 3 times, most recently from c64bd81 to 3c19298 Compare July 23, 2023 07:56
The test has been recently failing with deadlock (see apache#32778) and
needs thorough looking at if we want to find the root cause/remedium.

In the meantime it looks like a niche case connected with Dask
Executor that is rather obsure and we have no expertise in solving
problems with and diagnosing, so until the problem is diagnosed
it might be a long time (and maybe even we decide not to care
about it and let Dask community take a look and either fix or
ignore it.

We aim to have a very low number of those Quarantined tests
(currently we have 1 and we have not run it for a while as this
was a mysql test run on Postgres) but we have now the opportunity
to also improve the quarantined tests framework.

This test will be run together with other (1) quarantined test and:

* they will not be run in our regular tests
* they are run sequentially not in parallel with all other tests
* they are run for all 4 backends but only for the default
  versions of those backends
* failure of the quarantined tests will not cause failure of the
  whole job or limit constraints from being generated and updated
@potiuk potiuk force-pushed the quarantine-test-backfill-job branch from 3c19298 to a4fdf72 Compare July 23, 2023 08:41
@potiuk
Copy link
Member Author

potiuk commented Jul 23, 2023

Ok. Some "better" news - it seems that test fails often enough and does not require parallelism to trigger this problem: see https://github.com/apache/airflow/actions/runs/5635480217/job/15266574775?pr=32780#step:5:538

@potiuk potiuk merged commit 685328e into apache:main Jul 23, 2023
@potiuk potiuk deleted the quarantine-test-backfill-job branch July 23, 2023 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools area:Scheduler including HA (high availability) scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants