-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Fix scheduler heartbeat timeout failures with DetachedInstanceError
#53838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Resolves `DetachedInstanceError` when scheduler processes task instances that have timed out during heartbeat detection. The error occurred when Pydantic validation of `TIRunContext` attempted to access the consumed_asset_events relationship on `DagRun` objects that had been detached from the `SQLAlchemy` session. Root cause: The main scheduler loop calls `session.expunge_all()` which detaches all objects from the session. Later, when processing heartbeat timeouts, the scheduler creates `TIRunContext` objects that trigger Pydantic validation of `dag_run.consumed_asset_events`, causing `DetachedInstanceError` on the lazy-loaded relationship. Solution: Add `selectinload(DagRun.consumed_asset_events)` to the heartbeat timeout query to eagerly load the relationship before objects are detached. This minimal fix loads only the required relationship without over-eager loading of nested fields that aren't accessed during heartbeat processing. The fix affects all DAG types since consumed_asset_events is initialized as an empty list on all DagRun objects, not just asset-triggered DAGs. Longer term using `back_populates` (with `lazy="selectin"`) might be better so we don't need to remember this: https://docs.sqlalchemy.org/en/20/orm/queryguide/relationships.html https://docs.sqlalchemy.org/en/20/orm/relationship_api.html#sqlalchemy.orm.relationship.params.back_populates
uranusjr
approved these changes
Jul 28, 2025
ashb
approved these changes
Jul 29, 2025
RoyLee1224
pushed a commit
to RoyLee1224/airflow
that referenced
this pull request
Jul 31, 2025
ferruzzi
pushed a commit
to aws-mwaa/upstream-to-airflow
that referenced
this pull request
Aug 7, 2025
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Aug 11, 2025
Similar to apache#53838 but prevents it for all queries needing `consumed_asset_events`. Instead of adding `.selectinload(DR.consumed_asset_events))` wherever needed, I am eagerly loading them now. Changes: - Add lazy='selectin' to `DagRun.consumed_asset_events` relationship for always-eager loading - Changed `backref` to `back_populates` in `AssetEvent.created_dagruns` to enable explicit control Why This Fix Works: - Eliminates lazy loading entirely by pre-loading the relationship at the model level - Prevents dependency on consistent session state in concurrent scheduler operations Closes apache#54306
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Aug 11, 2025
Similar to apache#53838 but prevents it for all queries needing `consumed_asset_events`. Instead of adding `.selectinload(DR.consumed_asset_events))` wherever needed, I am eagerly loading them now. Changes: - Add lazy='selectin' to `DagRun.consumed_asset_events` relationship for always-eager loading - Changed `backref` to `back_populates` in `AssetEvent.created_dagruns` to enable explicit control Why This Fix Works: - Eliminates lazy loading entirely by pre-loading the relationship at the model level - Prevents dependency on consistent session state in concurrent scheduler operations Closes apache#54306
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Aug 11, 2025
Similar to apache#53838 but prevents it for all queries needing `consumed_asset_events`. Instead of adding `.selectinload(DR.consumed_asset_events))` wherever needed, I am eagerly loading them now. Changes: - Add lazy='selectin' to `DagRun.consumed_asset_events` relationship for always-eager loading - Changed `backref` to `back_populates` in `AssetEvent.created_dagruns` to enable explicit control Why This Fix Works: - Eliminates lazy loading entirely by pre-loading the relationship at the model level - Prevents dependency on consistent session state in concurrent scheduler operations Closes apache#54306
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Aug 11, 2025
Similar to apache#53838 and alternative for apache#54331 This is a more localized change and only eagerly loads for this specific instance. Closes apache#54306
fweilun
pushed a commit
to fweilun/airflow
that referenced
this pull request
Aug 11, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resolves intermittent
DetachedInstanceErrorwhen scheduler processes task instances that have timed out during heartbeat detection. The error occurred when Pydantic validation ofTIRunContextattempted to access theconsumed_asset_eventsrelationship onDagRunobjects that had been detached from the SQLAlchemy session.The Problem:
selectinload(TI.dag_run)(missingconsumed_asset_events)session.expunge_all()is called, detaching all objects from the sessionTIRunContextcreation triggers Pydantic validation that accessesdag_run.consumed_asset_eventsDetachedInstanceErrorKey Evidence:
consumed_asset_eventson detachedDagRunobjects reliably reproduces the errorWhy It's Intermittent:
session.expunge_all()and subsequent object accessSolution
Add minimal eager loading with
selectinload(DagRun.consumed_asset_events)to the heartbeat timeout query. This ensures the relationship is loaded before objects can be detached, eliminating the need for lazy loading.Why This Fix Works:
TIRunContextVerification Steps for Reviewers
To verify the root cause and validate the fix, run these tests in an iPython shell:
Test 1: Verify DetachedInstanceError on expunged objects
Expected Result: Should show
DetachedInstanceErrorwhen accessingconsumed_asset_eventson the detached object.Test 2: Verify the fix prevents the error
Expected Result: Should show
SUCCESSbecause the relationship was eagerly loaded before detachment.Test 3: Verify scoped session reuse (explains contamination mechanism)
Expected Result: Should show
Truefor session reuse, confirming thread-local scoping that enables object contamination.Testing Strategy
Why no new automated test added:
test_scheduler_passes_context_from_server_on_heartbeat_timeoutsession.expunge_all()and object access across concurrent scheduler operationsFuture Considerations
Long-term architectural improvement: Migrate to
back_populateswithlazy="selectin"to eliminate this class of issues entirely:This would prevent similar
DetachedInstanceErrorissues across the codebase by making the relationship always eagerly loaded.References:
Additional Context
This affects all DAG types (not just asset-triggered) since
consumed_asset_eventsis initialized as empty list on all DagRun objects during creation in_create_orm_dagrun().The fix uses
selectinload(vsjoinedload) because the heartbeat query can return multiple TaskInstances, making selectinload more efficient for bulk operations.