-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky dask backfill test in quarantine #32778
Comments
The test has been recently failing with deadlock (see apache#32778) and needs thorough looking at if we want to find the root cause/remedium. In the meantime it looks like a niche case connected with Dask Executor that is rather obsure and we have no expertise in solving problems with and diagnosing, so until the problem is diagnosed it might be a long time (and maybe even we decide not to care about it and let Dask community take a look and either fix or ignore it. We aim to have a very low number of those Quarantined tests (currently we have 1 and we have not run it for a while as this was a mysql test run on Postgres) but we have now the opportunity to also improve the quarantined tests framework. This test will be run together with other (1) quarantined test and: * they will not be run in our regular tests * they are run sequentially not in parallel with all other tests * they are run for all 4 backends but only for the default versions of those backends * failure of the quarantined tests will not cause failure of the whole job or limit constraints from being generated and updated
More information - aftere separating it out to a separate job that runs sequentially - it seems that it fails much more often and even in isolation:
|
cc: @ashb @uranusjr @hussein-awala @dstandish @o-nikolas @vandonr-amz @vincbeck - I have a reason to believe this has been caused by some of the recent changes in setup/teardown or some recent optimisations in the core of scheduling (this is a bit of a wild guess of course, but I think it only started to appear relatively recently (1 week or so). I seriously doubt it has anything to do with Dask, it's more triggered by Dask because of it's distributed nature that can run task in much more parallel and disstributed fashion. It occurs exclusively on Postgres and I think I ONLY saw it with Postgres 11. And it awfully looks like a REAL problem to diagnose and solve before we hit 2.7.0 - if we hit it that often in tests with a single backfil, then in production it might happen very, very quickly. As written in #32780 I will be less available the coming week, so maybe if someone can take a look that woudl be great. In the meantime merging #32780 should mitigate the failures we have in main and isolate that one until we fix it. |
The fact that this deadlock happens on Postgres, indicates that it is a "Real" issue. Unlike MySQL - Postgres is rather good in detecting real deadlocks and does a great job in avoiding them so this is rather real one. I will also modify the code of our CI build to always upload the logs from containers after qurantined job. That might help with diagnosing the issue as we will see more details from Postgres server on where the deadlock happen. |
The test has been recently failing with deadlock (see #32778) and needs thorough looking at if we want to find the root cause/remedium. In the meantime it looks like a niche case connected with Dask Executor that is rather obsure and we have no expertise in solving problems with and diagnosing, so until the problem is diagnosed it might be a long time (and maybe even we decide not to care about it and let Dask community take a look and either fix or ignore it. We aim to have a very low number of those Quarantined tests (currently we have 1 and we have not run it for a while as this was a mysql test run on Postgres) but we have now the opportunity to also improve the quarantined tests framework. This test will be run together with other (1) quarantined test and: * they will not be run in our regular tests * they are run sequentially not in parallel with all other tests * they are run for all 4 backends but only for the default versions of those backends * failure of the quarantined tests will not cause failure of the whole job or limit constraints from being generated and updated
…#32775) * Quarantine test_backfill_integration in dask executor The test has been recently failing with deadlock (see #32778) and needs thorough looking at if we want to find the root cause/remedium. In the meantime it looks like a niche case connected with Dask Executor that is rather obsure and we have no expertise in solving problems with and diagnosing, so until the problem is diagnosed it might be a long time (and maybe even we decide not to care about it and let Dask community take a look and either fix or ignore it. We aim to have a very low number of those Quarantined tests (currently we have 1 and we have not run it for a while as this was a mysql test run on Postgres) but we have now the opportunity to also improve the quarantined tests framework. This test will be run together with other (1) quarantined test and: * they will not be run in our regular tests * they are run sequentially not in parallel with all other tests * they are run for all 4 backends but only for the default versions of those backends * failure of the quarantined tests will not cause failure of the whole job or limit constraints from being generated and updated * Add pre-Airflow-2-7 hardcoded defaults for config for older providers During thorough testing and review of moving configuration to provoders I realised that there was a case that was not handled properly. In some cases some providers and DAGs could rely on some default values being available as default, but when we move them from core, and use older version of provider those defaults were not available: * they were remove as defaults in core * the old providers did not have "config" section to contribute the defaults This would be a breaking change and old providers (Celery, K8s) could fail - as it happened in some tests. This PR implements a nice solution to that, also allowing to remove some manual fallbacks in Celery and Kubernetes executor code. The solution is to add a hard-coded "pre-2.7" configuration which would only contain "provider" pre-2.7 hard-coded defaults and make it a fallback option if the values are neither set nor defaults contributed by the providers. We do not have to maintain those - the defaults are "frozen" effectively at the values available just before 2.7. The nice side effect is that we can remove a number of fallbacks, because this hard-coded configuration becomes the fallback automatically, That entirely solves the case where you want to install older providers on 2.7 where config.yml does not contain those provider values. * Update airflow/configuration.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
I have very little free-work time this week, I can try taking a closer look next week if there is no movement by then. |
The dask_executor backfill tests started to fail recently more often due to backfill exception, and the likely cause for it is that it is now better parallelise execution and triggering of the deadlocks because of contention betwee dag_run state update and task state update had become much easier. While this PR does not fix the underlying issue, it catches the operational error where the deadlock occured during the backfill. and rolls back the operation. This **should** be safe. backfil has a built-in mechanism to loop and retry failed tasks and the test passed multiple times, completing the backfill after this fix was applied. It was not easy to reproduce it locally but it failed every 20-30 times. When extra logging was added, it was always connected to OperationalException raised (and caught) right after _per_task_process. The same exception was observed few times when rollback was added, and despite it backfill job retried and completed the process successfully every time. We also leave the logs with exceptions and add reassuring messages that should make it clear that in case backfill completes, the exceptions can be ignored as the updates will be retried by the backfill job. Fixes: apache#32778
The dask_executor backfill tests started to fail recently more often due to backfill exception, and the likely cause for it is that it is now better parallelise execution and triggering of the deadlocks because of contention betwee dag_run state update and task state update had become much easier. While this PR does not fix the underlying issue, it catches the operational error where the deadlock occured during the backfill. and rolls back the operation. This **should** be safe. backfil has a built-in mechanism to loop and retry failed tasks and the test passed multiple times, completing the backfill after this fix was applied. It was not easy to reproduce it locally but it failed every 20-30 times. When extra logging was added, it was always connected to OperationalException raised (and caught) right after _per_task_process. The same exception was observed few times when rollback was added, and despite it backfill job retried and completed the process successfully every time. We also leave the logs with exceptions and add reassuring messages that should make it clear that in case backfill completes, the exceptions can be ignored as the updates will be retried by the backfill job. Fixes: #32778
The dask_executor backfill tests started to fail recently more often due to backfill exception, and the likely cause for it is that it is now better parallelise execution and triggering of the deadlocks because of contention betwee dag_run state update and task state update had become much easier. While this PR does not fix the underlying issue, it catches the operational error where the deadlock occured during the backfill. and rolls back the operation. This **should** be safe. backfil has a built-in mechanism to loop and retry failed tasks and the test passed multiple times, completing the backfill after this fix was applied. It was not easy to reproduce it locally but it failed every 20-30 times. When extra logging was added, it was always connected to OperationalException raised (and caught) right after _per_task_process. The same exception was observed few times when rollback was added, and despite it backfill job retried and completed the process successfully every time. We also leave the logs with exceptions and add reassuring messages that should make it clear that in case backfill completes, the exceptions can be ignored as the updates will be retried by the backfill job. Fixes: #32778 (cherry picked from commit f616ee8)
The test is still flaky - reopening it and marking as quarantined for now. |
since we removed Daskexecutor provider this can be closed, |
Body
We have recently started to observe a very flaky
tests/executors/test_dask_executor.py::TestDaskExecutor::test_backfill_integration
test - especially Python 3.8 + postgres 3.11 combo seems to trigger it easily -but not always.Example of failure here:
https://github.com/apache/airflow/actions/runs/5632434844/job/15260418883?pr=32776
Example errors:
Details:
Eventually failing
Would be great to track it down.
Committer
The text was updated successfully, but these errors were encountered: