-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Fix DAG processor crash on MySQL connection failure during import error recording #59167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…or recording The DAG processor was crashing when MySQL connection failures occurred while recording DAG import errors to the database. The root cause was missing session.rollback() calls after caught exceptions, leaving the SQLAlchemy session in an invalid state. When session.flush() was subsequently called, it would raise a new exception that wasn't caught, causing the DAG processor to crash and enter restart loops. This issue was observed in production environments where the DAG processor would restart 1,259 times in 4 days (~13 restarts/hour), leading to: - Connection pool exhaustion - Cascading failures across Airflow components - Import errors not being recorded in the UI - System instability Changes: - Add session.rollback() after caught exceptions in _update_import_errors() - Add session.rollback() after caught exceptions in _update_dag_warnings() - Wrap session.flush() in try-except with session.rollback() on failure - Add comprehensive unit tests for all failure scenarios - Update comments to clarify error handling behavior The fix ensures the DAG processor gracefully handles database connection failures and continues processing other DAGs instead of crashing.
|
Thanks. Nice one. |
|
From https://docs.sqlalchemy.org/en/20/orm/session_basics.html#flushing
|
|
Maybe, the root cause of MYSQL connection failure is #56879 |
totaly agree @wjddn279 |
Yeah. Worth fixing it with gc freezing I think. |
|
Do we know if the |
Fix DAG processor crash on MySQL connection failure during import error recording
Fixes #59166
The DAG processor was crashing when MySQL connection failures occurred while
recording DAG import errors to the database. The root cause was missing
session.rollback() calls after caught exceptions, leaving the SQLAlchemy
session in an invalid state. When session.flush() was subsequently called,
it would raise a new exception that wasn't caught, causing the DAG processor
to crash and enter restart loops.
This issue was observed in production environments where the DAG processor
would restart 1,259 times in 4 days (~13 restarts/hour), leading to:
Changes
session.rollback()after caught exceptions in_update_import_errors()session.rollback()after caught exceptions in_update_dag_warnings()session.flush()in try-except withsession.rollback()on failureTesting
Added 5 new unit tests in
TestDagProcessorCrashFixclass:test_update_dag_parsing_results_handles_db_failure_gracefullytest_update_dag_parsing_results_handles_dag_warnings_db_failure_gracefullytest_update_dag_parsing_results_handles_session_flush_failure_gracefullytest_session_rollback_called_on_import_errors_failuretest_session_rollback_called_on_dag_warnings_failureAll tests pass and verify that:
session.rollback()is called correctly on failuresImpact
The fix ensures the DAG processor gracefully handles database connection
failures and continues processing other DAGs instead of crashing, preventing
production outages from restart loops.