Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler to handle incrementing of try_number #39336

Merged
merged 53 commits into from
May 9, 2024

Conversation

dstandish
Copy link
Contributor

@dstandish dstandish commented Apr 30, 2024

Previously, there was a lot of bad stuff happening around try_number.

We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule.

What I do here is try to remove all of that stuff:

  • no more private _try_number attr
  • no more getter logic
  • no more decrementing
  • no more incrementing as part of task execution

Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try.

One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change.

If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage.

As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before.

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:logging area:plugins area:providers area:Scheduler including HA (high availability) scheduler area:serialization area:webserver Webserver related Issues provider:amazon-aws AWS/Amazon - related issues provider:elasticsearch labels Apr 30, 2024
@dstandish dstandish force-pushed the remove-try-number-shenanigans branch 2 times, most recently from 1d54320 to 310f923 Compare May 1, 2024 20:15
Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

@SamWheating, you might be a good reviewer on this one as well.

@dstandish dstandish force-pushed the remove-try-number-shenanigans branch 2 times, most recently from 36fcb5a to ed839d4 Compare May 6, 2024 19:28
@utkarsharma2 utkarsharma2 added type:improvement Changelog: Improvements type:misc/internal Changelog: Misc changes that should appear in change log and removed type:improvement Changelog: Improvements type:misc/internal Changelog: Misc changes that should appear in change log labels Jun 3, 2024
@utkarsharma2 utkarsharma2 added this to the Airflow 2.10.0 milestone Jun 4, 2024
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
Previously, there was a lot of bad stuff happening around try_number.

We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule.

What I do here is try to remove all of that stuff:

no more private _try_number attr
no more getter logic
no more decrementing
no more incrementing as part of task execution
Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try.

One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change.

If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage.

As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before.

---------

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling.  We can remove this code now.  Just missed this in apache#39336.
howardyoo added a commit to howardyoo/airflow that referenced this pull request Aug 15, 2024
potiuk pushed a commit that referenced this pull request Aug 16, 2024
…irflow (issue #41501) (#41502)

* Fix for issue #39336

* removed unnecessary import
potiuk pushed a commit to potiuk/airflow that referenced this pull request Aug 16, 2024
…irflow (issue apache#41501) (apache#41502)

* Fix for issue apache#39336

* removed unnecessary import

(cherry picked from commit dd3c3a7)
potiuk added a commit that referenced this pull request Aug 16, 2024
…irflow (issue #41501) (#41502) (#41535)

* Fix for issue #39336

* removed unnecessary import

(cherry picked from commit dd3c3a7)

Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>
Artuz37 pushed a commit to Artuz37/airflow that referenced this pull request Aug 19, 2024
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Aug 20, 2024
utkarsharma2 added a commit that referenced this pull request Aug 22, 2024
…41610)

* Enable pull requests to be run from v*test branches (#41474) (#41476)

Since we switch from direct push of cherry-picking to open PRs
against v*test branch, we should enable PRs to run for the target
branch.

(cherry picked from commit a9363e6)

* Prevent provider lowest-dependency tests to run in non-main branch (#41478) (#41481)

When running tests in v2-10-test branch, lowest depenency tests
are run for providers - because when calculating separate tests,
the "skip_provider_tests" has not been used to filter them out.

This PR fixes it.

(cherry picked from commit 75da507)

* Make PROD image building works in non-main PRs (#41480) (#41484)

The PROD image building fails currently in non-main because it
attempts to build source provider packages rather than use them from
PyPi when PR is run against "v-test" branch.

This PR fixes it:

* PROD images in non-main-targetted build will pull providers from
  PyPI rather than build them
* they use PyPI constraints to install the providers
* they use UV - which should speed up building of the images

(cherry picked from commit 4d5f1c4)

* Add WebEncoder for trigger page rendering to avoid render failure (#41350) (#41485)

Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com>

* Incorrect try number subtraction producing invalid span id for OTEL airflow (issue #41501) (#41502) (#41535)

* Fix for issue #39336

* removed unnecessary import

(cherry picked from commit dd3c3a7)

Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>

* Fix failing pydantic v1 tests (#41534) (#41541)

We need to exclude some versions of Pydantic v1 because it conflicts
with aws provider.

(cherry picked from commit a033c5f)

* Fix Non-DB test calculation for main builds (#41499) (#41543)

Pytest has a weird behaviour that it will not collect tests
from parent folder when subfolder of it is specified after the
parent folder. This caused some non-db tests from providers folder
have been skipped during main build.

The issue in Pytest 8.2 (used to work before) is tracked at
pytest-dev/pytest#12605

(cherry picked from commit d489826)

* Add changelog for airflow python client 2.10.0 (#41583) (#41584)

* Add changelog for airflow python client 2.10.0

* Update client version

(cherry picked from commit 317a28e)

* Make all test pass in Database Isolation mode (#41567)

This adds dedicated "DatabaseIsolation" test to airflow v2-10-test
branch..

The DatabaseIsolation test will run all "db-tests" with enabled
DB isolation mode and running `internal-api` component - groups
of tests marked with "skip-if-database-isolation" will be skipped.

* Upgrade build and chart dependencies (#41570) (#41588)

(cherry picked from commit c88192c)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

* Limit watchtower as depenendcy as 3.3.0 breaks moin. (#41612)

(cherry picked from commit 1b602d5)

* Enable running Pull Requests against v2-10-stable branch (#41624)

(cherry picked from commit e306e7f)

* Fix tests/models/test_variable.py for database isolation mode (#41414)

* Fix tests/models/test_variable.py for database isolation mode

* Review feedback

(cherry picked from commit 736ebfe)

* Make latest botocore tests green (#41626)

The latest botocore tests are conflicting with a few requirements
and until apache-beam upcoming version is released we need to do
some manual exclusions. Those exclusions should make latest botocore
test green again.

(cherry picked from commit a13ccbb)

* Simpler task retrieval for taskinstance test (#41389)

The test has been updated for DB isolation but the retrieval of
task was not intuitive and it could lead to flaky tests possibly

(cherry picked from commit f25adf1)

* Skip  database isolation case for task mapping taskinstance tests (#41471)

Related: #41067
(cherry picked from commit 7718bd7)

* Skipping tests for db isolation because similar tests were skipped (#41450)

(cherry picked from commit e94b508)

---------

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Brent Bovenzi <brent@astronomer.io>
Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com>
Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>
utkarsharma2 added a commit that referenced this pull request Aug 30, 2024
…41610)

* Enable pull requests to be run from v*test branches (#41474) (#41476)

Since we switch from direct push of cherry-picking to open PRs
against v*test branch, we should enable PRs to run for the target
branch.

(cherry picked from commit a9363e6)

* Prevent provider lowest-dependency tests to run in non-main branch (#41478) (#41481)

When running tests in v2-10-test branch, lowest depenency tests
are run for providers - because when calculating separate tests,
the "skip_provider_tests" has not been used to filter them out.

This PR fixes it.

(cherry picked from commit 75da507)

* Make PROD image building works in non-main PRs (#41480) (#41484)

The PROD image building fails currently in non-main because it
attempts to build source provider packages rather than use them from
PyPi when PR is run against "v-test" branch.

This PR fixes it:

* PROD images in non-main-targetted build will pull providers from
  PyPI rather than build them
* they use PyPI constraints to install the providers
* they use UV - which should speed up building of the images

(cherry picked from commit 4d5f1c4)

* Add WebEncoder for trigger page rendering to avoid render failure (#41350) (#41485)

Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com>

* Incorrect try number subtraction producing invalid span id for OTEL airflow (issue #41501) (#41502) (#41535)

* Fix for issue #39336

* removed unnecessary import

(cherry picked from commit dd3c3a7)

Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>

* Fix failing pydantic v1 tests (#41534) (#41541)

We need to exclude some versions of Pydantic v1 because it conflicts
with aws provider.

(cherry picked from commit a033c5f)

* Fix Non-DB test calculation for main builds (#41499) (#41543)

Pytest has a weird behaviour that it will not collect tests
from parent folder when subfolder of it is specified after the
parent folder. This caused some non-db tests from providers folder
have been skipped during main build.

The issue in Pytest 8.2 (used to work before) is tracked at
pytest-dev/pytest#12605

(cherry picked from commit d489826)

* Add changelog for airflow python client 2.10.0 (#41583) (#41584)

* Add changelog for airflow python client 2.10.0

* Update client version

(cherry picked from commit 317a28e)

* Make all test pass in Database Isolation mode (#41567)

This adds dedicated "DatabaseIsolation" test to airflow v2-10-test
branch..

The DatabaseIsolation test will run all "db-tests" with enabled
DB isolation mode and running `internal-api` component - groups
of tests marked with "skip-if-database-isolation" will be skipped.

* Upgrade build and chart dependencies (#41570) (#41588)

(cherry picked from commit c88192c)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

* Limit watchtower as depenendcy as 3.3.0 breaks moin. (#41612)

(cherry picked from commit 1b602d5)

* Enable running Pull Requests against v2-10-stable branch (#41624)

(cherry picked from commit e306e7f)

* Fix tests/models/test_variable.py for database isolation mode (#41414)

* Fix tests/models/test_variable.py for database isolation mode

* Review feedback

(cherry picked from commit 736ebfe)

* Make latest botocore tests green (#41626)

The latest botocore tests are conflicting with a few requirements
and until apache-beam upcoming version is released we need to do
some manual exclusions. Those exclusions should make latest botocore
test green again.

(cherry picked from commit a13ccbb)

* Simpler task retrieval for taskinstance test (#41389)

The test has been updated for DB isolation but the retrieval of
task was not intuitive and it could lead to flaky tests possibly

(cherry picked from commit f25adf1)

* Skip  database isolation case for task mapping taskinstance tests (#41471)

Related: #41067
(cherry picked from commit 7718bd7)

* Skipping tests for db isolation because similar tests were skipped (#41450)

(cherry picked from commit e94b508)

---------

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Brent Bovenzi <brent@astronomer.io>
Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com>
Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>
utkarsharma2 added a commit that referenced this pull request Aug 30, 2024
…41610)

* Enable pull requests to be run from v*test branches (#41474) (#41476)

Since we switch from direct push of cherry-picking to open PRs
against v*test branch, we should enable PRs to run for the target
branch.

(cherry picked from commit a9363e6)

* Prevent provider lowest-dependency tests to run in non-main branch (#41478) (#41481)

When running tests in v2-10-test branch, lowest depenency tests
are run for providers - because when calculating separate tests,
the "skip_provider_tests" has not been used to filter them out.

This PR fixes it.

(cherry picked from commit 75da507)

* Make PROD image building works in non-main PRs (#41480) (#41484)

The PROD image building fails currently in non-main because it
attempts to build source provider packages rather than use them from
PyPi when PR is run against "v-test" branch.

This PR fixes it:

* PROD images in non-main-targetted build will pull providers from
  PyPI rather than build them
* they use PyPI constraints to install the providers
* they use UV - which should speed up building of the images

(cherry picked from commit 4d5f1c4)

* Add WebEncoder for trigger page rendering to avoid render failure (#41350) (#41485)

Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com>

* Incorrect try number subtraction producing invalid span id for OTEL airflow (issue #41501) (#41502) (#41535)

* Fix for issue #39336

* removed unnecessary import

(cherry picked from commit dd3c3a7)

Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>

* Fix failing pydantic v1 tests (#41534) (#41541)

We need to exclude some versions of Pydantic v1 because it conflicts
with aws provider.

(cherry picked from commit a033c5f)

* Fix Non-DB test calculation for main builds (#41499) (#41543)

Pytest has a weird behaviour that it will not collect tests
from parent folder when subfolder of it is specified after the
parent folder. This caused some non-db tests from providers folder
have been skipped during main build.

The issue in Pytest 8.2 (used to work before) is tracked at
pytest-dev/pytest#12605

(cherry picked from commit d489826)

* Add changelog for airflow python client 2.10.0 (#41583) (#41584)

* Add changelog for airflow python client 2.10.0

* Update client version

(cherry picked from commit 317a28e)

* Make all test pass in Database Isolation mode (#41567)

This adds dedicated "DatabaseIsolation" test to airflow v2-10-test
branch..

The DatabaseIsolation test will run all "db-tests" with enabled
DB isolation mode and running `internal-api` component - groups
of tests marked with "skip-if-database-isolation" will be skipped.

* Upgrade build and chart dependencies (#41570) (#41588)

(cherry picked from commit c88192c)

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>

* Limit watchtower as depenendcy as 3.3.0 breaks moin. (#41612)

(cherry picked from commit 1b602d5)

* Enable running Pull Requests against v2-10-stable branch (#41624)

(cherry picked from commit e306e7f)

* Fix tests/models/test_variable.py for database isolation mode (#41414)

* Fix tests/models/test_variable.py for database isolation mode

* Review feedback

(cherry picked from commit 736ebfe)

* Make latest botocore tests green (#41626)

The latest botocore tests are conflicting with a few requirements
and until apache-beam upcoming version is released we need to do
some manual exclusions. Those exclusions should make latest botocore
test green again.

(cherry picked from commit a13ccbb)

* Simpler task retrieval for taskinstance test (#41389)

The test has been updated for DB isolation but the retrieval of
task was not intuitive and it could lead to flaky tests possibly

(cherry picked from commit f25adf1)

* Skip  database isolation case for task mapping taskinstance tests (#41471)

Related: #41067
(cherry picked from commit 7718bd7)

* Skipping tests for db isolation because similar tests were skipped (#41450)

(cherry picked from commit e94b508)

---------

Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Co-authored-by: Brent Bovenzi <brent@astronomer.io>
Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com>
Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 9, 2024
Changes:
- provide custom GCS task handler
- write task logs to stdout for fluentd to expose in Cloud Logging
- write DAG processor manager logs to stdout for fluentd to expose in Cloud Logging
- write custom Composer metrics
- implement custom Composer log filter
- use same log format for Celery logs as for all other logs
- set sqlfluff logging level to WARNING to avoid polluting parsing logs
- modify write_metrics() to not decrement `try_number` value after changes from apache/airflow#39336

Change-Id: Ie6ca8e8a544dbd661bc74db38b2cc419144bb9a2
GitOrigin-RevId: 54ef1a1bcb67b4f4855ccef4ced98e0a4ad280bc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API area:logging area:plugins area:providers area:Scheduler including HA (high availability) scheduler area:serialization area:webserver Webserver related Issues full tests needed We need to run full set of tests for this PR to merge provider:amazon-aws AWS/Amazon - related issues provider:elasticsearch type:improvement Changelog: Improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants