-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround failing deadlock when running backfill #32991
Conversation
f1ac81f
to
f7e9f89
Compare
7ddd389
to
c9c37e1
Compare
Hey everyone - I am not very proud of this PR, but I am cautiously optimisting it is likely workrarounding the curesed DeadLock that started to show up with test_backfill_job of DaskExecutor. I believe it was a real issue, not a test problem. And I think workaround I came up with should also handle similar "production" issues when they happen. It is not dissimilar to other similar fixes for backfill job causing deadlocks (you can see similar workarounds in other places). Generally speaking - i was able to reproduce it locally (happens once every 30-50 times or so) so it is definitely not a "side-effect" of another test- I run this one individually with --with-db-init flag and still (with difficulty) I was able to produce deadlock. When I added some debugging it turned out that the deadlock in "commit()" in the outer loop was actually caused by a deadlock exception raised in the inner loop - right after
So we have not seen the original place where the deadlock occured - because we have not logged that OperationalError in the inner loop and the deadlock was reported again when the outer process (not even loop) tried to run the commit for already deadlocked session - at the end of the backfil_job. It looks like the deadlock is cause by the lack of "for update" on Dag Run in the backfill job, while the backfill job runs "for update" on task instance. So if scheduler happens to want to update task_instance and keeps the dag_run "for update", the backfill job tries to update. BTW. Possibly this is an inherent problem with the backfill job. Maybe this is expected really and maybe the workaround is the best what we can do ? I gathered server-side logging where I could see both sides of the deadlock:
I have few more examples of server logs where I was able to reproduce it locally https://gist.github.com/potiuk/6bd2d855a01ed3e8877a30cefcec01b3 I added a bit more logging and I specifically call It was far more often replicable on CI (often 1 or 2 postgres jobs failed with it) so I will now close/reopen that PR few times tomorrow to see if I can make the deadlock happen in CI with this fix. I'd love to hear more from those who are better experts than me on this one. Maybe collectively we can solve the puzzle better than just via workaround. |
bb0f94a
to
e0b1c35
Compare
So far I run it ~ 7 times with full tests needed, not a single backfill_job deadlock occured. |
e0b1c35
to
aa8eab7
Compare
aa8eab7
to
f63131f
Compare
Happened once: https://github.com/apache/airflow/actions/runs/5776982503/job/15656723055?pr=32991 3 repeats might be not enough ? |
f63131f
to
a32376f
Compare
8f8803b
to
e9d85c7
Compare
How about we merge that one?
|
a31a13c
to
de5c147
Compare
The dask_executor backfill tests started to fail recently more often due to backfill exception, and the likely cause for it is that it is now better parallelise execution and triggering of the deadlocks because of contention betwee dag_run state update and task state update had become much easier. While this PR does not fix the underlying issue, it catches the operational error where the deadlock occured during the backfill. and rolls back the operation. This **should** be safe. backfil has a built-in mechanism to loop and retry failed tasks and the test passed multiple times, completing the backfill after this fix was applied. It was not easy to reproduce it locally but it failed every 20-30 times. When extra logging was added, it was always connected to OperationalException raised (and caught) right after _per_task_process. The same exception was observed few times when rollback was added, and despite it backfill job retried and completed the process successfully every time. We also leave the logs with exceptions and add reassuring messages that should make it clear that in case backfill completes, the exceptions can be ignored as the updates will be retried by the backfill job. Fixes: apache#32778
de5c147
to
048c9b9
Compare
Any particualr reason this doesn’t use |
The only reason is that I added those retries incrementally, trying different things. But yeah. goodl idea, I can actually extract the inner code into a method and decorate it - that would be way cleaner. let me do it. |
Looked at - it would make it awfully inconsistent. the Backfill Job is in a VERY BAD need of refactoring. and it already ises the pattern of try/except/loop and if want to introduce the decorator, that would make it look even worse than what we have now (we would have to have another internal method in a 5-level-indented 800 line method. That would be a bit risky to do it now as we want to cherry-pick that one to 2.7 - and it will be difficult to solve any problems it might cause if we make a mistake of some sort. I think I will leave it as "post-2.7.0" to refactor the backfill job. I created a task to do so: #33249 and maybe we can engage a few people to collaborate on that, I personally thing backfill job is one of the most neglected parts of Airflow and maybe for 2.8 we could have it back in order, easier to understand and reason about. |
The dask_executor backfill tests started to fail recently more often due to backfill exception, and the likely cause for it is that it is now better parallelise execution and triggering of the deadlocks because of contention betwee dag_run state update and task state update had become much easier. While this PR does not fix the underlying issue, it catches the operational error where the deadlock occured during the backfill. and rolls back the operation. This **should** be safe. backfil has a built-in mechanism to loop and retry failed tasks and the test passed multiple times, completing the backfill after this fix was applied. It was not easy to reproduce it locally but it failed every 20-30 times. When extra logging was added, it was always connected to OperationalException raised (and caught) right after _per_task_process. The same exception was observed few times when rollback was added, and despite it backfill job retried and completed the process successfully every time. We also leave the logs with exceptions and add reassuring messages that should make it clear that in case backfill completes, the exceptions can be ignored as the updates will be retried by the backfill job. Fixes: #32778 (cherry picked from commit f616ee8)
The dask_executor backfill tests started to fail recently more often due
to backfill exception, and the likely cause for it is that it is now
better parallelise execution and triggering of the deadlocks because of
contention betwee dag_run state update and task state update had
become much easier.
While this PR does not fix the underlying issue, it catches the
operational error where the deadlock occured during the backfill.
and rolls back the operation.
This should be safe. backfil has a built-in mechanism to loop and
retry failed tasks and the test passed multiple times, completing the
backfill after this fix was applied. It was not easy to reproduce it
locally but it failed every 20-30 times. When extra logging was added,
it was always connected to OperationalException raised (and caught)
right after _per_task_process. The same exception was observed few times
when rollback was added, and despite it backfill job retried and
completed the process successfully every time. We also leave the logs
with exceptions and add reassuring messages that should make it clear
that in case backfill completes, the exceptions can be ignored as
the updates will be retried by the backfill job.
Fixes: #32778
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.