Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround failing deadlock when running backfill #32991

Merged
merged 1 commit into from
Aug 9, 2023

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Aug 1, 2023

The dask_executor backfill tests started to fail recently more often due
to backfill exception, and the likely cause for it is that it is now
better parallelise execution and triggering of the deadlocks because of
contention betwee dag_run state update and task state update had
become much easier.

While this PR does not fix the underlying issue, it catches the
operational error where the deadlock occured during the backfill.
and rolls back the operation.

This should be safe. backfil has a built-in mechanism to loop and
retry failed tasks and the test passed multiple times, completing the
backfill after this fix was applied. It was not easy to reproduce it
locally but it failed every 20-30 times. When extra logging was added,
it was always connected to OperationalException raised (and caught)
right after _per_task_process. The same exception was observed few times
when rollback was added, and despite it backfill job retried and
completed the process successfully every time. We also leave the logs
with exceptions and add reassuring messages that should make it clear
that in case backfill completes, the exceptions can be ignored as
the updates will be retried by the backfill job.

Fixes: #32778


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk potiuk force-pushed the chase-flaky-quarantine-test branch 5 times, most recently from f1ac81f to f7e9f89 Compare August 3, 2023 16:33
@potiuk potiuk changed the title Attempt to get logs from failed flaky bakfill test Attempt to get logs from failed flaky backfill test Aug 3, 2023
@potiuk potiuk added the full tests needed We need to run full set of tests for this PR to merge label Aug 3, 2023
@potiuk potiuk closed this Aug 3, 2023
@potiuk potiuk reopened this Aug 3, 2023
@potiuk potiuk force-pushed the chase-flaky-quarantine-test branch 4 times, most recently from 7ddd389 to c9c37e1 Compare August 6, 2023 00:41
@potiuk potiuk changed the title Attempt to get logs from failed flaky backfill test Workaround failing dedlock when running backfill Aug 6, 2023
@potiuk potiuk marked this pull request as ready for review August 6, 2023 00:43
@potiuk
Copy link
Member Author

potiuk commented Aug 6, 2023

Hey everyone - I am not very proud of this PR, but I am cautiously optimisting it is likely workrarounding the curesed DeadLock that started to show up with test_backfill_job of DaskExecutor.

I believe it was a real issue, not a test problem. And I think workaround I came up with should also handle similar "production" issues when they happen.

It is not dissimilar to other similar fixes for backfill job causing deadlocks (you can see similar workarounds in other places).

Generally speaking - i was able to reproduce it locally (happens once every 30-50 times or so) so it is definitely not a "side-effect" of another test- I run this one individually with --with-db-init flag and still (with difficulty) I was able to produce deadlock. When I added some debugging it turned out that the deadlock in "commit()" in the outer loop was actually caused by a deadlock exception raised in the inner loop - right after _per_task_process - then the exception was propagated up and then the finally clause in line ~ 960 attempted to commit it again:

        finally:
            session.commit()
            executor.end()

So we have not seen the original place where the deadlock occured - because we have not logged that OperationalError in the inner loop and the deadlock was reported again when the outer process (not even loop) tried to run the commit for already deadlocked session - at the end of the backfil_job.

It looks like the deadlock is cause by the lack of "for update" on Dag Run in the backfill job, while the backfill job runs "for update" on task instance. So if scheduler happens to want to update task_instance and keeps the dag_run "for update", the backfill job tries to update.

BTW. Possibly this is an inherent problem with the backfill job. Maybe this is expected really and maybe the workaround is the best what we can do ?

I gathered server-side logging where I could see both sides of the deadlock:

	Process 11417 waits for ShareLock on transaction 70151; blocked by process 11337.

        # !!!! This is what backfill job does !!!

	Process 11337: UPDATE dag_run SET last_scheduling_decision=NULL, updated_at='2023-08-05T23:07:39.040007+00:00'::timestamptz WHERE dag_run.id = 1

        # !!!!! This is what I presume Scheduler does !!!! 

	Process 11417: UPDATE task_instance SET start_date='2023-08-05T23:07:42.249709+00:00'::timestamptz, end_date='2023-08-05T23:07:42.249709+00:00'::timestamptz, duration=0.0, state='skipped', updated_at='2023-08-05T23:07:42.253235+00:00'::timestamptz WHERE task_instance.dag_id = 'example_bash_operator' AND task_instance.task_id = 'run_this_last' AND task_instance.run_id = 'backfill__2017-01-01T00:00:00+00:00' AND task_instance.map_index =  -1

I have few more examples of server logs where I was able to reproduce it locally https://gist.github.com/potiuk/6bd2d855a01ed3e8877a30cefcec01b3

I added a bit more logging and I specifically call rollback() in the inner loop - without re-raising the error. This should not be a problem because backfill job runs in a loop and it will again pick up the task next time. So the most harm that might happen is that the task that caused DeadLock will get slightly delayed.

It was far more often replicable on CI (often 1 or 2 postgres jobs failed with it) so I will now close/reopen that PR few times tomorrow to see if I can make the deadlock happen in CI with this fix.

I'd love to hear more from those who are better experts than me on this one. Maybe collectively we can solve the puzzle better than just via workaround.

@potiuk potiuk added this to the Airflow 2.7.0 milestone Aug 6, 2023
@potiuk potiuk force-pushed the chase-flaky-quarantine-test branch from bb0f94a to e0b1c35 Compare August 6, 2023 12:04
@potiuk
Copy link
Member Author

potiuk commented Aug 6, 2023

So far I run it ~ 7 times with full tests needed, not a single backfill_job deadlock occured.

@potiuk potiuk force-pushed the chase-flaky-quarantine-test branch from e0b1c35 to aa8eab7 Compare August 6, 2023 12:46
@potiuk potiuk changed the title Workaround failing dedlock when running backfill Workaround failing deadlock when running backfill Aug 6, 2023
@potiuk potiuk force-pushed the chase-flaky-quarantine-test branch from aa8eab7 to f63131f Compare August 6, 2023 13:51
@potiuk
Copy link
Member Author

potiuk commented Aug 6, 2023

@potiuk potiuk force-pushed the chase-flaky-quarantine-test branch from f63131f to a32376f Compare August 6, 2023 14:31
@potiuk potiuk closed this Aug 6, 2023
@potiuk potiuk reopened this Aug 6, 2023
@potiuk potiuk force-pushed the chase-flaky-quarantine-test branch 3 times, most recently from 8f8803b to e9d85c7 Compare August 8, 2023 08:33
@potiuk
Copy link
Member Author

potiuk commented Aug 8, 2023

How about we merge that one?

  • looks stable (increased max_attempts to 5)
  • unquarantines the failing backfill job
  • has better user experience than "failing with deadllock" when apparently the deadlock is recoverable

@potiuk potiuk force-pushed the chase-flaky-quarantine-test branch 2 times, most recently from a31a13c to de5c147 Compare August 8, 2023 23:55
The dask_executor backfill tests started to fail recently more often due
to backfill exception, and the likely cause for it is that it is now
better parallelise execution and triggering of the deadlocks because of
contention betwee dag_run state update and task state update had
become much easier.

While this PR does not fix the underlying issue, it catches the
operational error where the deadlock occured during the backfill.
and rolls back the operation.

This **should** be safe. backfil has a built-in mechanism to loop and
retry failed tasks and the test passed multiple times, completing the
backfill after this fix was applied. It was not easy to reproduce it
locally but it failed every 20-30 times. When extra logging was added,
it was always connected to OperationalException raised (and caught)
right after _per_task_process. The same exception was observed few times
when rollback was added, and despite it backfill job retried and
completed the process successfully every time. We also leave the logs
with exceptions and add reassuring messages that should make it clear
that in case backfill completes, the exceptions can be ignored as
the updates will be retried by the backfill job.

Fixes: apache#32778
@potiuk potiuk force-pushed the chase-flaky-quarantine-test branch from de5c147 to 048c9b9 Compare August 8, 2023 23:56
@uranusjr
Copy link
Member

uranusjr commented Aug 9, 2023

Any particualr reason this doesn’t use @retry_db_transaction?

@potiuk
Copy link
Member Author

potiuk commented Aug 9, 2023

Any particualr reason this doesn’t use @retry_db_transaction?

The only reason is that I added those retries incrementally, trying different things. But yeah. goodl idea, I can actually extract the inner code into a method and decorate it - that would be way cleaner. let me do it.

@potiuk
Copy link
Member Author

potiuk commented Aug 9, 2023

Any particualr reason this doesn’t use @retry_db_transaction?

The only reason is that I added those retries incrementally, trying different things. But yeah. goodl idea, I can actually extract the inner code into a method and decorate it - that would be way cleaner. let me do it.

Looked at - it would make it awfully inconsistent. the Backfill Job is in a VERY BAD need of refactoring. and it already ises the pattern of try/except/loop and if want to introduce the decorator, that would make it look even worse than what we have now (we would have to have another internal method in a 5-level-indented 800 line method.

That would be a bit risky to do it now as we want to cherry-pick that one to 2.7 - and it will be difficult to solve any problems it might cause if we make a mistake of some sort.

I think I will leave it as "post-2.7.0" to refactor the backfill job. I created a task to do so: #33249 and maybe we can engage a few people to collaborate on that, I personally thing backfill job is one of the most neglected parts of Airflow and maybe for 2.8 we could have it back in order, easier to understand and reason about.

@potiuk potiuk merged commit f616ee8 into apache:main Aug 9, 2023
61 checks passed
@potiuk potiuk deleted the chase-flaky-quarantine-test branch August 9, 2023 11:00
@ephraimbuddy ephraimbuddy added the type:improvement Changelog: Improvements label Aug 9, 2023
ephraimbuddy pushed a commit that referenced this pull request Aug 9, 2023
The dask_executor backfill tests started to fail recently more often due
to backfill exception, and the likely cause for it is that it is now
better parallelise execution and triggering of the deadlocks because of
contention betwee dag_run state update and task state update had
become much easier.

While this PR does not fix the underlying issue, it catches the
operational error where the deadlock occured during the backfill.
and rolls back the operation.

This **should** be safe. backfil has a built-in mechanism to loop and
retry failed tasks and the test passed multiple times, completing the
backfill after this fix was applied. It was not easy to reproduce it
locally but it failed every 20-30 times. When extra logging was added,
it was always connected to OperationalException raised (and caught)
right after _per_task_process. The same exception was observed few times
when rollback was added, and despite it backfill job retried and
completed the process successfully every time. We also leave the logs
with exceptions and add reassuring messages that should make it clear
that in case backfill completes, the exceptions can be ignored as
the updates will be retried by the backfill job.

Fixes: #32778
(cherry picked from commit f616ee8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky dask backfill test in quarantine
4 participants