Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try number inconsistency between webserver and the actual log generated #42549

Closed
1 of 2 tasks
dheerajturaga opened this issue Sep 27, 2024 · 17 comments
Closed
1 of 2 tasks
Labels
area:logging area:UI Related to UI/UX. For Frontend Developers. kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet pending-response provider:celery

Comments

@dheerajturaga
Copy link

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.2

What happened?

Some tasks in airflow 2.10.2 are being launched with try number 0. However, the webserver is looking for logs starting with try number 1. Because of this we are having cases where tasks are running and passing however, the server throws a "cannot read served logs" error.

Executor=CeleryExecutor

try number issue

What you think should happen instead?

Airflow should start attempt with try number 1 instead.

How to reproduce

This is happening intermittently. Not all tasks start with try number 0. I have checked to make sure both the webserver and the celery worker are on the same version 2.10.2

Operating System

SUSE Linux Enterprise Server 12 SP5

Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.8.1
celery==5.4.0

Deployment

Docker-Compose

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@dheerajturaga dheerajturaga added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 27, 2024
Copy link

boring-cyborg bot commented Sep 27, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@dosubot dosubot bot added area:webserver Webserver related Issues provider:celery labels Sep 27, 2024
@jscheffl
Copy link
Contributor

I am not sure whether this is really a bug, might be rather a semantic gap or something that could be (visually) improved to just hide logs when no logs are there.

In Airflow 2.10 the internal semantic was changed to increment the try number of a task only at the time the task is really scheduled/started. As long as it is just there as instance (not scheduled) the try=0 is the sign for it has never tried. In this case the visual improvement might be to hide the log panel as no logs are existing.

Would you like to submit a small improvement?

@jscheffl jscheffl added good first issue area:UI Related to UI/UX. For Frontend Developers. and removed area:webserver Webserver related Issues area:core provider:celery labels Sep 30, 2024
@dheerajturaga
Copy link
Author

@jscheffl, thats not really the case. Here the task has started and completed, and the log is captured under attempt-0.log. However, the webserver is trying to look for attempt-1.log
it appears that there are cases where the try number isn't incremented on the log side even when the task has started.

du -sh attempt-0.log 64K attempt-0.log

[2024-09-25T14:24:04.772+0000] {taskinstance.py:1402} DEBUG - previous_execution_date was called
[2024-09-25T14:24:04.822+0000] {taskinstance.py:1402} DEBUG - previous_execution_date was called
[2024-09-25T14:24:04.838+0000] {local_task_job_runner.py:123} INFO - ::group::Pre task execution logs
[2024-09-25T14:24:04.854+0000] {taskinstance.py:2612} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: [queued]>
[2024-09-25T14:24:04.867+0000] {taskinstance.py:2612} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: scheduled__2024-09-25T05:00:00+00:00 [queued]>
[2024-09-25T14:24:04.868+0000] {taskinstance.py:2865} INFO - Starting attempt 0 of 1

@dheerajturaga dheerajturaga changed the title Try number inconsistency between webserver Try number inconsistency between webserver and the actual log generated Sep 30, 2024
@jscheffl
Copy link
Contributor

Thanks for the response. You say this is intermittently - do you have any means to re-produce this?

@dheerajturaga
Copy link
Author

let me see if I can isolate this

@jscheffl
Copy link
Contributor

jscheffl commented Oct 1, 2024

Might this be a side effect of #39336 @dstandish ?

@dstandish
Copy link
Contributor

How was this task launched / created / scheduled?

@dheerajturaga
Copy link
Author

@dstandish, this is essentially a bash operator The task was triggered on schedule and is part of a task_group. I suspect whats happening here is the log_filename_template is using attempt={{ try_number }}.log as its default. It could be that {try_number} and {ti.try_number} are possibly diverging. Please review PR #42633

@dheerajturaga
Copy link
Author

@dstandish
Copy link
Contributor

dstandish commented Oct 1, 2024

@dstandish, @jscheffl looks like my pr is exposing a bug with #39336

Take a look at https://github.com/apache/airflow/actions/runs/11133289143/job/30939468337?pr=42633

It's certainly possible. But I have not seen this. So can you provide some repro steps? Like, give us the simplest dag possible which also produces this behavior.

@dheerajturaga
Copy link
Author

dheerajturaga commented Oct 2, 2024

@dstandish, Im trying hard to reproduce the conditions as to when this is happening. This happens intermittently. Take a look at this screenshot. This scenario has {try_number} and {ti.try_number} different

I added both these templates in the log_filename_template.

log_filename_template = dag-{{ ti.dag_id }}/run-{{ ts_nodash }}/task-{{ ti.task_id }}/{%% if ti.map_index >= 0 %%}mapid-{{ ti.map_index }}/{%% endif %%}attempt-{{ try_number }}-{{ ti.try_number }}.log

image

Also, there are many cases where the first task of the dag starts with try_number 2

@dheerajturaga
Copy link
Author

@dstandish, @jscheffl Here's how you reproduce this. Just run the simple dag below. On completion, clear the first task and let it rerun. You will notice {try_number} and {ti.try_number} diverge. {try_number} accurately changes between 1 and 2 but {ti.try_number} is stuck at 2.

make sure you set this cfg option:
log_filename_template = dag-{{ ti.dag_id }}/run-{{ ts_nodash }}/task-{{ ti.task_id }}/{%% if ti.map_index >= 0 %%}mapid-{{ ti.map_index }}/{%% endif %%}attempt-{{ try_number }}-{{ ti.try_number }}.log

from __future__ import annotations
import pendulum

import airflow
from airflow.decorators import dag, task_group
from airflow.operators.bash import BashOperator
from datetime import timedelta

import yaml
import os

@dag(
    dag_id="debug_me",
    start_date=pendulum.today('America/Chicago'),
    catchup=False,
    tags=["debug"]
)
def single_hello():
    hw = BashOperator(
            task_id="try_number_check",
            bash_command="""
            # echo try_number:
            echo ti.try_number: {{ti.try_number}}
            """
        )
    hw1 = BashOperator(
            task_id="try_number_check_1",
            bash_command="""
            # echo try_number:
            echo ti.try_number: {{ti.try_number}}
            """
        )

    hw >> hw1

single_hello()

@dstandish
Copy link
Contributor

dstandish commented Oct 2, 2024

I tried reproing on main via clearing task but could not do

clearing-try-number.mov

Here's my code

with DAG(dag_id="dag1", schedule="@daily", catchup=False, start_date=pendulum.now()) as dag1:

    @task(retries=2, retry_delay=timedelta(seconds=10))
    def the_task(try_number=None, ti=None):
        sleep(rand() * 10)

    tasks = []
    for num in range(3):
        tasks.append(the_task.override(task_id=f"task_{num+1}")())

    chain_linear(*tasks)

Can you try it and see if you get a diff result?

@dheerajturaga
Copy link
Author

@dstandish, you are missing changing attempt-{{ try_number }}.log to attempt-{{ ti.try_number }}.log in your airflow.cfg

once you do that, try clicking between the logs of try 1 and try 2

even better would be setting the following in your airflow config:

log_filename_template = dag-{{ ti.dag_id }}/run-{{ ts_nodash }}/task-{{ ti.task_id }}/{%% if ti.map_index >= 0 %%}mapid-{{ ti.map_index }}/{%% endif %%}attempt-{{ try_number }}-{{ ti.try_number }}.log

@dheerajturaga
Copy link
Author

@dstandish, Issue is that it happens at random. I have a cluster of celery workers which keep listening for jobs from the scheduler. There maybe a case where the try number isn't incremented even when the job is running on the worker. I can't isolate this with airflow standalone and it seems to be effecting all tasks ( there doesn't seem to be a pattern here ).

image

Clearing this task increments the try_number to 2

image

A workaround here would be to find all attempt-0.log and softlink them to attempt-1.log to fix the gui. I have a dag that is doing this every 20min and there are many tasks scattered all over that are randomly impacted by this

@dheerajturaga
Copy link
Author

@dstandish, I figured out what's happening. I had a few workers which were still on airflow 2.7.2 which never got recycled.
the issue went away when I upgraded them to 2.10.2

Thanks for your help on triaging this!

@dstandish
Copy link
Contributor

@dstandish, I figured out what's happening. I had a few workers which were still on airflow 2.7.2 which never got recycled. the issue went away when I upgraded them to 2.10.2

Thanks for your help on triaging this!

Awesome, great result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging area:UI Related to UI/UX. For Frontend Developers. kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet pending-response provider:celery
Projects
None yet
Development

No branches or pull requests

3 participants