Skip to content

Conversation

@oboki
Copy link
Contributor

@oboki oboki commented May 4, 2025

In environments with multiple workers (e.g., CeleryExecutor), task logs for previous tries (i.e., not the latest try_number) may fail to load with the following error:

Could not read served logs: 404 Client Error: NOT FOUND for url: http://worker-2:8793/log/dag_id=tutorial_dag/run_id=manual__2025-05-04T12:54:25.273420+00:00/task_id=extract/attempt=5.log

As shown in the screenshots below, attempt=5 was actually executed on worker-1, but the Web UI incorrectly tries to fetch the logs from worker-2:

image

image

This happens because _get_log_retrieval_url generates the log URL based on TaskInstance.hostname, which only stores the hostname of the latest execution attempt. It does not keep track of the history of previous tries.

To fix this, I updated the logic to use the TaskInstanceHistory model, just as the "Details" tab does, so the correct hostname is used for each specific try_number.

With this change, logs for previous attempts load correctly as expected.

image

image

image

Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Thanks for the PR, but I think we can fix on API side instead of file_task_handler side.

@oboki oboki force-pushed the fix-could-not-read-served-logs-from branch from c0dc269 to 2d209f9 Compare May 7, 2025 15:16
@eladkal eladkal added this to the Airflow 3.0.2 milestone May 7, 2025
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label May 7, 2025
Copy link
Contributor

@bugraoz93 bugraoz93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR! Looks good. It would be great to include a test case in the unit tests where ti is not None but try_number is different than the ti.try_number to test the additional condition

Copy link
Member

@jason810496 jason810496 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Looks good to me, but it seems like there is side effect for mocking session in the test, which leads to dead lock for SQLite.

Copy link
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice looks good overall just a few tests to fix.

@oboki oboki force-pushed the fix-could-not-read-served-logs-from branch 3 times, most recently from 458c374 to c325ee1 Compare May 14, 2025 16:20
Copy link
Contributor

@bugraoz93 bugraoz93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the tests are still failing. Could you please check them?

@bugraoz93
Copy link
Contributor

A more general thing about routes and services: There are similar cases where we have methods inside route files. We already have some level of distinction between them. How about moving them to services? What do you think?
cc: @pierrejeambrun @rawwar @jason810496 @shubhamraj-git @ephraimbuddy

@jason810496
Copy link
Member

A more general thing about routes and services: There are similar cases where we have methods inside route files. We already have some level of distinction between them. How about moving them to services? What do you think? cc: @pierrejeambrun @rawwar @jason810496 @shubhamraj-git @ephraimbuddy

Looks good to me to move the query helper function to the airflow-core/src/airflow/api_fastapi/core_api/services module level. (I think airflow-core/src/airflow/api_fastapi/common/db would also be a suitable location.)

Just to double-check, do you mean to move the _find_task_instance_for_try_number function directly into the services module and not introduce a new service class, right?

Since this helper simply wraps a query as a function, and the existing service classes (e.g., for connections, pools, variables) are designed for bulk operations, creating a new class might not be necessary here.

Copy link
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits regarding discussion, but LGTM.

@bugraoz93
Copy link
Contributor

A more general thing about routes and services: There are similar cases where we have methods inside route files. We already have some level of distinction between them. How about moving them to services? What do you think? cc: @pierrejeambrun @rawwar @jason810496 @shubhamraj-git @ephraimbuddy

Looks good to me to move the query helper function to the airflow-core/src/airflow/api_fastapi/core_api/services module level. (I think airflow-core/src/airflow/api_fastapi/common/db would also be a suitable location.)

Just to double-check, do you mean to move the _find_task_instance_for_try_number function directly into the services module and not introduce a new service class, right?

Since this helper simply wraps a query as a function, and the existing service classes (e.g., for connections, pools, variables) are designed for bulk operations, creating a new class might not be necessary here.

Yes, I agree on the second look, airflow-core/src/airflow/api_fastapi/common/db is more suitable. I intend to decouple these methods from routes. Even with collapse all in some routes, there are a lot of endpoints and making the code less readable.

Yes, it can be on the root level. For sure, it depends on the use case if we look at it in general. We aren't doing full object-oriented programming in general, it is mixed with Functional mostly due to the nature of Python. Unless we use service classes as a singleton across all routes, it can only add overhead of creating the object rather than calling the method. I don't have a strong opinion on that since the entire stack is using a mixed approach and makes sense in most cases :)

@bugraoz93
Copy link
Contributor

I will take a look moving other methods from routes to proper places soon

@oboki oboki force-pushed the fix-could-not-read-served-logs-from branch from 4dbeea1 to dbdeda7 Compare May 18, 2025 14:15
@oboki oboki closed this May 18, 2025
@oboki oboki force-pushed the fix-could-not-read-served-logs-from branch 2 times, most recently from 508d0ea to 75f2296 Compare May 19, 2025 22:51
@oboki oboki force-pushed the fix-could-not-read-served-logs-from branch from 75f2296 to 9d338aa Compare May 20, 2025 02:01
@jason810496
Copy link
Member

The test seems flaky, update with latest main.

@oboki
Copy link
Contributor Author

oboki commented May 20, 2025

The test seems flaky, update with latest main.

Finally, all tests have passed! Thanks @jason810496. I really appreciate your support.

@jason810496 jason810496 merged commit 0b0ff5d into apache:main May 20, 2025
95 checks passed
@github-actions
Copy link

Backport failed to create: v3-0-test. View the failure log Run details

Status Branch Result
v3-0-test Commit Link

You can attempt to backport this manually by running:

cherry_picker 0b0ff5d v3-0-test

This should apply the commit to the v3-0-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

jason810496 added a commit to jason810496/airflow that referenced this pull request May 20, 2025
…rker environments (apache#50175)

* fix: resolve 404 log error for non-latest task tries in multi-host worker environments

* refactor: extract TaskInstance and TaskInstanceHistory query logic from `get_log` endpoint function

* test: add unit test for `get_task_instance_or_history_for_try_number` function

* fix: resolve sqlite lock error apache#50763

Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>

---------

Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>
(cherry picked from commit 0b0ff5d)
@jason810496
Copy link
Member

jason810496 commented May 20, 2025

Manual packport to v3-10-test due to conflicts: #50833

potiuk pushed a commit to jason810496/airflow that referenced this pull request May 21, 2025
…rker environments (apache#50175)

* fix: resolve 404 log error for non-latest task tries in multi-host worker environments

* refactor: extract TaskInstance and TaskInstanceHistory query logic from `get_log` endpoint function

* test: add unit test for `get_task_instance_or_history_for_try_number` function

* fix: resolve sqlite lock error apache#50763

Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>

---------

Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>
(cherry picked from commit 0b0ff5d)
potiuk pushed a commit that referenced this pull request May 21, 2025
…rker environments (#50175) (#50833)

* fix: resolve 404 log error for non-latest task tries in multi-host worker environments

* refactor: extract TaskInstance and TaskInstanceHistory query logic from `get_log` endpoint function

* test: add unit test for `get_task_instance_or_history_for_try_number` function

* fix: resolve sqlite lock error #50763



---------


(cherry picked from commit 0b0ff5d)

Co-authored-by: oboki <oboki@kakao.com>
dadonnelly316 pushed a commit to dadonnelly316/airflow that referenced this pull request May 26, 2025
…rker environments (apache#50175)

* fix: resolve 404 log error for non-latest task tries in multi-host worker environments

* refactor: extract TaskInstance and TaskInstanceHistory query logic from `get_log` endpoint function

* test: add unit test for `get_task_instance_or_history_for_try_number` function

* fix: resolve sqlite lock error apache#50763

Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>

---------

Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>
@dstandish
Copy link
Contributor

This PR has some kind of bug.

When you clear a TI you see two log files with try == 1

image

dstandish added a commit to astronomer/airflow that referenced this pull request May 27, 2025
@dstandish
Copy link
Contributor

Reverting here #51135

dstandish added a commit that referenced this pull request May 27, 2025
kaxil pushed a commit that referenced this pull request Jun 3, 2025
…rker environments (#50175) (#50833)

* fix: resolve 404 log error for non-latest task tries in multi-host worker environments

* refactor: extract TaskInstance and TaskInstanceHistory query logic from `get_log` endpoint function

* test: add unit test for `get_task_instance_or_history_for_try_number` function

* fix: resolve sqlite lock error #50763



---------


(cherry picked from commit 0b0ff5d)

Co-authored-by: oboki <oboki@kakao.com>
sanederchik pushed a commit to sanederchik/airflow that referenced this pull request Jun 7, 2025
…rker environments (apache#50175)

* fix: resolve 404 log error for non-latest task tries in multi-host worker environments

* refactor: extract TaskInstance and TaskInstanceHistory query logic from `get_log` endpoint function

* test: add unit test for `get_task_instance_or_history_for_try_number` function

* fix: resolve sqlite lock error apache#50763

Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>

---------

Co-authored-by: LIU ZHE YOU <68415893+jason810496@users.noreply.github.com>
sanederchik pushed a commit to sanederchik/airflow that referenced this pull request Jun 7, 2025
jose-lehmkuhl pushed a commit to jose-lehmkuhl/airflow that referenced this pull request Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:logging backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants