Scheduler to handle incrementing of try_number #39336

dstandish · 2024-04-30T21:21:39Z

Previously, there was a lot of bad stuff happening around try_number.

We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule.

What I do here is try to remove all of that stuff:

no more private _try_number attr
no more getter logic
no more decrementing
no more incrementing as part of task execution

Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try.

One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change.

If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage.

As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before.

airflow/models/taskinstance.py

tests/sensors/test_base.py

jedcunningham

Overall looks good!

@SamWheating, you might be a good reviewer on this one as well.

Previously, there was a lot of bad stuff happening around try_number. We incremented it when task started running. And because of that, we had this logic to return "_try_number + 1" when task not running. But this gave the "right" try number before it ran, and the wrong number after it ran. And, since it was naively incremented when task starts running -- i.e. without regard to why it is running -- we decremented it when deferring or exiting on a reschedule. What I do here is try to remove all of that stuff: no more private _try_number attr no more getter logic no more decrementing no more incrementing as part of task execution Now what we do is increment only when the task is set to scheduled and only when it's not coming out of deferral or "up_for_reschedule". So the try_number will be more stable. It will not change throughout the course of task execution. The only time it will be incremented is when there's legitimately a new try. One consequence of this is that try number will no longer be incremented if you run either airlfow tasks run or ti.run() in isolation. But because airflow assumes that all tasks runs are scheduled by the scheduler, I do not regard this to be a breaking change. If user code or provider code has implemented hacks to get the "right" try_number when looking at it at the wrong time (because previously it gave the wrong answer), unfortunately that code will just have to be patched. There are only two cases I know of in the providers codebase -- openlineage listener, and dbt openlineage. As a courtesy for backcompat we also add property _try_number which is just a proxy for try_number, so you'll still be able to access this attr. But, it will not behave the same as it did before. --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

Previously we had code to compensate for the fact that we were decrementing try_number when deferring or rescheduling. We can remove this code now. Just missed this in apache#39336.

…irflow (issue #41501) (#41502) * Fix for issue #39336 * removed unnecessary import

…irflow (issue apache#41501) (apache#41502) * Fix for issue apache#39336 * removed unnecessary import (cherry picked from commit dd3c3a7)

…irflow (issue #41501) (#41502) (#41535) * Fix for issue #39336 * removed unnecessary import (cherry picked from commit dd3c3a7) Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com>

…irflow (issue apache#41501) (apache#41502) * Fix for issue apache#39336 * removed unnecessary import

…41610) * Enable pull requests to be run from v*test branches (#41474) (#41476) Since we switch from direct push of cherry-picking to open PRs against v*test branch, we should enable PRs to run for the target branch. (cherry picked from commit a9363e6) * Prevent provider lowest-dependency tests to run in non-main branch (#41478) (#41481) When running tests in v2-10-test branch, lowest depenency tests are run for providers - because when calculating separate tests, the "skip_provider_tests" has not been used to filter them out. This PR fixes it. (cherry picked from commit 75da507) * Make PROD image building works in non-main PRs (#41480) (#41484) The PROD image building fails currently in non-main because it attempts to build source provider packages rather than use them from PyPi when PR is run against "v-test" branch. This PR fixes it: * PROD images in non-main-targetted build will pull providers from PyPI rather than build them * they use PyPI constraints to install the providers * they use UV - which should speed up building of the images (cherry picked from commit 4d5f1c4) * Add WebEncoder for trigger page rendering to avoid render failure (#41350) (#41485) Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com> * Incorrect try number subtraction producing invalid span id for OTEL airflow (issue #41501) (#41502) (#41535) * Fix for issue #39336 * removed unnecessary import (cherry picked from commit dd3c3a7) Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com> * Fix failing pydantic v1 tests (#41534) (#41541) We need to exclude some versions of Pydantic v1 because it conflicts with aws provider. (cherry picked from commit a033c5f) * Fix Non-DB test calculation for main builds (#41499) (#41543) Pytest has a weird behaviour that it will not collect tests from parent folder when subfolder of it is specified after the parent folder. This caused some non-db tests from providers folder have been skipped during main build. The issue in Pytest 8.2 (used to work before) is tracked at pytest-dev/pytest#12605 (cherry picked from commit d489826) * Add changelog for airflow python client 2.10.0 (#41583) (#41584) * Add changelog for airflow python client 2.10.0 * Update client version (cherry picked from commit 317a28e) * Make all test pass in Database Isolation mode (#41567) This adds dedicated "DatabaseIsolation" test to airflow v2-10-test branch.. The DatabaseIsolation test will run all "db-tests" with enabled DB isolation mode and running `internal-api` component - groups of tests marked with "skip-if-database-isolation" will be skipped. * Upgrade build and chart dependencies (#41570) (#41588) (cherry picked from commit c88192c) Co-authored-by: Jarek Potiuk <jarek@potiuk.com> * Limit watchtower as depenendcy as 3.3.0 breaks moin. (#41612) (cherry picked from commit 1b602d5) * Enable running Pull Requests against v2-10-stable branch (#41624) (cherry picked from commit e306e7f) * Fix tests/models/test_variable.py for database isolation mode (#41414) * Fix tests/models/test_variable.py for database isolation mode * Review feedback (cherry picked from commit 736ebfe) * Make latest botocore tests green (#41626) The latest botocore tests are conflicting with a few requirements and until apache-beam upcoming version is released we need to do some manual exclusions. Those exclusions should make latest botocore test green again. (cherry picked from commit a13ccbb) * Simpler task retrieval for taskinstance test (#41389) The test has been updated for DB isolation but the retrieval of task was not intuitive and it could lead to flaky tests possibly (cherry picked from commit f25adf1) * Skip database isolation case for task mapping taskinstance tests (#41471) Related: #41067 (cherry picked from commit 7718bd7) * Skipping tests for db isolation because similar tests were skipped (#41450) (cherry picked from commit e94b508) --------- Co-authored-by: Jarek Potiuk <jarek@potiuk.com> Co-authored-by: Brent Bovenzi <brent@astronomer.io> Co-authored-by: M. Olcay Tercanlı <muhammed_tercanli@epam.com> Co-authored-by: Howard Yoo <32691630+howardyoo@users.noreply.github.com> Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com> Co-authored-by: Bugra Ozturk <bugraoz93@users.noreply.github.com>

Changes: - provide custom GCS task handler - write task logs to stdout for fluentd to expose in Cloud Logging - write DAG processor manager logs to stdout for fluentd to expose in Cloud Logging - write custom Composer metrics - implement custom Composer log filter - use same log format for Celery logs as for all other logs - set sqlfluff logging level to WARNING to avoid polluting parsing logs - modify write_metrics() to not decrement `try_number` value after changes from apache/airflow#39336 Change-Id: Ie6ca8e8a544dbd661bc74db38b2cc419144bb9a2 GitOrigin-RevId: 54ef1a1bcb67b4f4855ccef4ced98e0a4ad280bc

dstandish force-pushed the remove-try-number-shenanigans branch 2 times, most recently from 1d54320 to 310f923 Compare May 1, 2024 20:15

dstandish commented May 1, 2024

View reviewed changes

airflow/models/taskinstance.py Show resolved Hide resolved

dstandish commented May 1, 2024

View reviewed changes

tests/sensors/test_base.py Show resolved Hide resolved

dstandish marked this pull request as ready for review May 1, 2024 22:05

dstandish requested review from eladkal, o-nikolas, ryanahamilton, ashb, bbovenzi, pierrejeambrun, ephraimbuddy, kaxil and XD-DENG as code owners May 1, 2024 22:05

dstandish requested review from uranusjr and Lee-W May 1, 2024 22:33

jedcunningham reviewed May 2, 2024

View reviewed changes

dstandish mentioned this pull request May 2, 2024

Increment try_number while clearing deferred tasks. #38984

Closed

dstandish force-pushed the remove-try-number-shenanigans branch 2 times, most recently from 36fcb5a to ed839d4 Compare May 6, 2024 19:28

dstandish requested a review from mobuchowski as a code owner May 6, 2024 19:28

eladkal mentioned this pull request May 26, 2024

Status of testing Providers that were prepared on May 26, 2024 #39842

Closed

utkarsharma2 added type:improvement Changelog: Improvements type:misc/internal Changelog: Misc changes that should appear in change log and removed type:improvement Changelog: Improvements type:misc/internal Changelog: Misc changes that should appear in change log labels Jun 3, 2024

utkarsharma2 added this to the Airflow 2.10.0 milestone Jun 4, 2024

dstandish mentioned this pull request Jun 25, 2024

Ensure try_number incremented for empty operator #40426

Merged

jscheffl mentioned this pull request Aug 14, 2024

Task Try History in UI Wrong Color in Status Badges before Task Run #41462

Closed

2 tasks

howardyoo mentioned this pull request Aug 15, 2024

Incorrect try number producing invalid span id for OTEL airflow #41501

Closed

2 tasks

howardyoo added a commit to howardyoo/airflow that referenced this pull request Aug 15, 2024

Fix for issue apache#39336

0b6eb8f

howardyoo mentioned this pull request Aug 15, 2024

Incorrect try number subtraction producing invalid span id for OTEL airflow (issue #41501) #41502

Merged

potiuk pushed a commit that referenced this pull request Aug 16, 2024

Incorrect try number subtraction producing invalid span id for OTEL a…

dd3c3a7

…irflow (issue #41501) (#41502) * Fix for issue #39336 * removed unnecessary import

potiuk mentioned this pull request Aug 16, 2024

Incorrect try number subtraction producing invalid span id for OTEL a… #41535

Merged

Artuz37 pushed a commit to Artuz37/airflow that referenced this pull request Aug 19, 2024

Incorrect try number subtraction producing invalid span id for OTEL a…

9203881

…irflow (issue apache#41501) (apache#41502) * Fix for issue apache#39336 * removed unnecessary import

romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Aug 20, 2024

Incorrect try number subtraction producing invalid span id for OTEL a…

c48892e

…irflow (issue apache#41501) (apache#41502) * Fix for issue apache#39336 * removed unnecessary import

jscheffl mentioned this pull request Oct 1, 2024

Try number inconsistency between webserver and the actual log generated #42549

Closed

2 tasks

dheerajturaga mentioned this pull request Oct 1, 2024

Spot clean try_number references to use ti.try_number #42633

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler to handle incrementing of try_number #39336

Scheduler to handle incrementing of try_number #39336

dstandish commented Apr 30, 2024 •

edited

Loading

jedcunningham left a comment

Scheduler to handle incrementing of try_number #39336

Scheduler to handle incrementing of try_number #39336

Conversation

dstandish commented Apr 30, 2024 • edited Loading

jedcunningham left a comment

Choose a reason for hiding this comment

dstandish commented Apr 30, 2024 •

edited

Loading