Skip to content

Conversation

@ashb
Copy link
Member

@ashb ashb commented Mar 3, 2025

It's about time we delivered on one of the key points of AIP-72: DB isolation from workers.
(To be honest, it's probably past time, but now is the second best time)

All of these use the Workload supervisor from the TaskSDK and the main paths
XCom, Variables and Secrets) have all been ported to use the Execution API,
so it's about time we disabled DB access.

Note: this will almost certainly break a few things, like Skip mixin based tasks in particular - that is WIP in 46584

Also closes #47232 as that was failing if configure_orm was never called.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:logging area:providers area:task-sdk provider:amazon AWS/Amazon - related issues provider:edge Edge Executor / Worker (AIP-69) / edge3 provider:fab labels Mar 3, 2025
@ashb ashb force-pushed the disable-db-access-tasks branch from 18faa85 to eb14f67 Compare March 3, 2025 23:24
@ashb ashb changed the title disable db access tasks Disable ORM access from Tasks, DAG processing and Triggers Mar 3, 2025
@ashb ashb force-pushed the disable-db-access-tasks branch from eb14f67 to 2a77c31 Compare March 4, 2025 11:21
@ashb ashb requested a review from hussein-awala as a code owner March 4, 2025 11:21
@ashb ashb requested a review from jedcunningham as a code owner March 4, 2025 14:01
@ashb ashb force-pushed the disable-db-access-tasks branch from e55d781 to 1176f81 Compare March 4, 2025 18:23
@Lee-W Lee-W self-requested a review March 5, 2025 02:55
@ashb ashb force-pushed the disable-db-access-tasks branch from 7d81e91 to 0f19bf7 Compare March 5, 2025 12:56
@ashb ashb added the full tests needed We need to run full set of tests for this PR to merge label Mar 5, 2025
@ashb ashb force-pushed the disable-db-access-tasks branch 3 times, most recently from 1a9f977 to 93aacd2 Compare March 5, 2025 17:06
@jedcunningham jedcunningham force-pushed the disable-db-access-tasks branch from 93aacd2 to 63cb614 Compare March 5, 2025 19:20
Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ashb ashb force-pushed the disable-db-access-tasks branch 2 times, most recently from 5c5e1ea to c127964 Compare March 6, 2025 11:24
ashb added 2 commits March 6, 2025 12:27
All of these use the Workload supervisor from the TaskSDK and the main paths
(XCom, Variables and Secrets) have all been ported to use the Execution API,
so it's about time we disabled DB access.
@ashb ashb force-pushed the disable-db-access-tasks branch from c127964 to 34b7152 Compare March 6, 2025 12:28
@ashb ashb merged commit c440959 into main Mar 6, 2025
89 checks passed
@ashb ashb deleted the disable-db-access-tasks branch March 6, 2025 13:29
@ashb
Copy link
Member Author

ashb commented Mar 6, 2025

Oof that was a bit of a mission to land.

@jscheffl
Copy link
Contributor

jscheffl commented Mar 6, 2025

Oof that was a bit of a mission to land.

Wohooo!

@jedcunningham
Copy link
Member

#protm

@potiuk
Copy link
Member

potiuk commented Mar 6, 2025

Oof that was a bit of a mission to land.

Indeed

ashb added a commit that referenced this pull request Mar 12, 2025
This change seems innocuous, and possibly even wrong, but it is the correct
behaviour since #47320 landed. We _do not_ want to call dispose_orm, as that
ends up reconnecting, and sometimes this results in the wrong connection
being shared between the parent and the child. I don't love the "sometimes"
nature of this bug, but the fix seems sound.

Prior to this running one or two runs concurrently would result in the
scheduler handing (stuck in SQLA code trying to roll back) or an error from
psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq".

With this change we were able to repeatedly run 10 runs concurrently.

The reason we don't want this is that we registered an at_fork handler already
that closes/discards the socket object (without closing the DB level session)
so calling dispose can, perversely, resurrect that object and try reusing it!

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>
ashb added a commit that referenced this pull request Mar 12, 2025
This change seems innocuous, and possibly even wrong, but it is the correct
behaviour since #47320 landed. We _do not_ want to call dispose_orm, as that
ends up reconnecting, and sometimes this results in the wrong connection
being shared between the parent and the child. I don't love the "sometimes"
nature of this bug, but the fix seems sound.

Prior to this running one or two runs concurrently would result in the
scheduler handing (stuck in SQLA code trying to roll back) or an error from
psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq".

With this change we were able to repeatedly run 10 runs concurrently.

The reason we don't want this is that we registered an at_fork handler already
that closes/discards the socket object (without closing the DB level session)
so calling dispose can, perversely, resurrect that object and try reusing it!

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>
ashb added a commit that referenced this pull request Mar 12, 2025
This change seems innocuous, and possibly even wrong, but it is the correct
behaviour since #47320 landed. We _do not_ want to call dispose_orm, as that
ends up reconnecting, and sometimes this results in the wrong connection
being shared between the parent and the child. I don't love the "sometimes"
nature of this bug, but the fix seems sound.

Prior to this running one or two runs concurrently would result in the
scheduler handing (stuck in SQLA code trying to roll back) or an error from
psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq".

With this change we were able to repeatedly run 10 runs concurrently.

The reason we don't want this is that we registered an at_fork handler already
that closes/discards the socket object (without closing the DB level session)
so calling dispose can, perversely, resurrect that object and try reusing it!

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>
jedcunningham added a commit that referenced this pull request Mar 12, 2025
This change seems innocuous, and possibly even wrong, but it is the correct
behaviour since #47320 landed. We _do not_ want to call dispose_orm, as that
ends up reconnecting, and sometimes this results in the wrong connection
being shared between the parent and the child. I don't love the "sometimes"
nature of this bug, but the fix seems sound.

Prior to this running one or two runs concurrently would result in the
scheduler handing (stuck in SQLA code trying to roll back) or an error from
psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq".

With this change we were able to repeatedly run 10 runs concurrently.

The reason we don't want this is that we registered an at_fork handler already
that closes/discards the socket object (without closing the DB level session)
so calling dispose can, perversely, resurrect that object and try reusing it!

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>
nailo2c pushed a commit to nailo2c/airflow that referenced this pull request Apr 4, 2025
)

All of these use the Workload supervisor from the TaskSDK and the main paths
(XCom, Variables and Secrets) have all been ported to use the Execution API,
so it's about time we disabled DB access.
nailo2c pushed a commit to nailo2c/airflow that referenced this pull request Apr 4, 2025
This change seems innocuous, and possibly even wrong, but it is the correct
behaviour since apache#47320 landed. We _do not_ want to call dispose_orm, as that
ends up reconnecting, and sometimes this results in the wrong connection
being shared between the parent and the child. I don't love the "sometimes"
nature of this bug, but the fix seems sound.

Prior to this running one or two runs concurrently would result in the
scheduler handing (stuck in SQLA code trying to roll back) or an error from
psycopg about "error with status PGRES_TUPLES_OK and no message from the libpq".

With this change we were able to repeatedly run 10 runs concurrently.

The reason we don't want this is that we registered an at_fork handler already
that closes/discards the socket object (without closing the DB level session)
so calling dispose can, perversely, resurrect that object and try reusing it!

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Kaxil Naik <kaxilnaik@apache.org>
potiuk added a commit to jason810496/airflow that referenced this pull request Apr 12, 2025
Some recent changes in main and documentation published in the
inventories, made the 2.10 doc building fail as references to
non-existing docs in the new inventories were still used in the
documentation for 2.10

This PR fixes it by changing the docs to not refer to those
changed docs any more.

The PRs that removed the links: apache#47320 and apache#47399

Co-authored-by: LIU ZHE YOU <zhu424.dev@gmail.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
potiuk added a commit to jason810496/airflow that referenced this pull request Apr 12, 2025
Some recent changes in main and documentation published in the
inventories, made the 2.10 doc building fail as references to
non-existing docs in the new inventories were still used in the
documentation for 2.10

This PR fixes it by changing the docs to not refer to those
changed docs any more.

The PRs that removed the links: apache#47320 and apache#47399

Co-authored-by: LIU ZHE YOU <zhu424.dev@gmail.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
potiuk added a commit to jason810496/airflow that referenced this pull request Apr 12, 2025
Some recent changes in main and documentation published in the
inventories, made the 2.10 doc building fail as references to
non-existing docs in the new inventories were still used in the
documentation for 2.10

This PR fixes it by changing the docs to not refer to those
changed docs any more.

The PRs that removed the links: apache#47320 and apache#47399

Co-authored-by: LIU ZHE YOU <zhu424.dev@gmail.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
potiuk added a commit that referenced this pull request Apr 12, 2025
Some recent changes in main and documentation published in the
inventories, made the 2.10 doc building fail as references to
non-existing docs in the new inventories were still used in the
documentation for 2.10

This PR fixes it by changing the docs to not refer to those
changed docs any more.

The PRs that removed the links: #47320 and #47399

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:logging area:providers area:task-sdk full tests needed We need to run full set of tests for this PR to merge provider:amazon AWS/Amazon - related issues provider:edge Edge Executor / Worker (AIP-69) / edge3 provider:fab

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error in testing re logging re task sdk

5 participants