Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

task failed with a null hostname #13692

Closed
doowhtron opened this issue Jan 15, 2021 · 21 comments
Closed

task failed with a null hostname #13692

doowhtron opened this issue Jan 15, 2021 · 21 comments
Labels
affected_version:2.0 Issues Reported for 2.0 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug pending-response stale Stale PRs per the .github/workflows/stale.yml policy file

Comments

@doowhtron
Copy link

Apache Airflow version: 2.0.0

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment:

  • Cloud provider or hardware configuration: tencent cloud
  • OS (e.g. from /etc/os-release): centos7
  • Kernel (e.g. uname -a): 3.10
  • Install tools:
  • Others: Server version: 8.0.22 MySQL Community Server - GPL

What happened:

Task is in the failed state. I found the log file on one of the worker node and the task is actually success. And in the task instance details tab, the hostname field is null.

And the logs are as follows:

*** Log file does not exist: /data/app/epic-airflow/logs/tiny_demo80677608236/task_0/2021-01-15T09:04:00+00:00/1.log
*** Fetching from: http://:8793/log/tiny_demo80677608236/task_0/2021-01-15T09:04:00+00:00/1.log
*** Failed to fetch log file from worker. Invalid URL 'http://:8793/log/tiny_demo80677608236/task_0/2021-01-15T09:04:00+00:00/1.log': No host supplied

What you expected to happen:

Task should be success.

How to reproduce it:

I don't know how to reproduce it steadyly. It happened sometimes.

Anything else we need to know:

No

@doowhtron doowhtron added the kind:bug This is a clearly a bug label Jan 15, 2021
@vikramkoka vikramkoka added the affected_version:2.0 Issues Reported for 2.0 label Jan 15, 2021
@potiuk potiuk added this to the Airflow 2.0.1 milestone Jan 17, 2021
@kaxil kaxil removed this from the Airflow 2.0.1 milestone Jan 19, 2021
@kaxil
Copy link
Member

kaxil commented Jan 19, 2021

Need more information about the setup of your Airflow Installation and if possible steps to reproduce

@doowhtron
Copy link
Author

doowhtron commented Mar 16, 2021

Need more information about the setup of your Airflow Installation and if possible steps to reproduce

It still happens after I upgrade Airflow to 2.0.1

Maybe it is caused by the situation that a single task is scheduled more than once? I have four schedulers running on four hosts and two of them scheduled the same dag nearly the same time? Maybe a bug happened?

企业微信截图_16158859086958

@doowhtron
Copy link
Author

Is it possible to have a mechanism prevent a task from scheduling twice?

@emukans
Copy link

emukans commented Mar 23, 2021

I noticed the same issue on airflow 2.0.1 when I used cron notation for specifying schedule_interval. Here is my DAG:

DAG(
        "id",
        default_args=default_args,
        description="description",
        schedule_interval="0 14 * * *",
        start_date=datetime(2021, 3, 23)
    )

I wanted to specify the time. I solved the issue by using DateTimeSensor or TimeSensor and changing schedule_interval="@daily"

@vikramkoka vikramkoka added the area:Scheduler including HA (high availability) scheduler label Mar 25, 2021
@eladkal eladkal modified the milestone: Airflow 2.1.1 May 14, 2021
@hafid-d
Copy link

hafid-d commented May 20, 2021

Getting the same issue as well on airflow 2.0.2. Anyone solved it ?

@pparthesh
Copy link

Getting the same issue.

@victorouttes
Copy link

Same issue here. Airflow 1.10.15 on kubernetes.

@vazmeee
Copy link

vazmeee commented Jun 13, 2021

Facing a similar issue on celery worker as well. Airflow 2.0.0

@uranusjr
Copy link
Member

Everyone facing the issue, please provide some more information about the setup of your Airflow installation, and if possible, steps to reproduce.

@huozhanfeng
Copy link
Contributor

huozhanfeng commented Jun 30, 2021

Everyone facing the issue, please provide some more information about the setup of your Airflow installation, and if possible, steps to reproduce.

Just run a dag and let its task runs failed to reproduce the bug. More details in #16729. It's a bug of the scheduler, I have resolved it in my local env and will give a PR to solve it later.

kaxil pushed a commit that referenced this issue Jul 22, 2021
The log can't be shown normally when the task runs failed. Users can only get useless logs as follows. #13692

<pre>
*** Log file does not exist: /home/airflow/airflow/logs/dag_id/task_id/2021-06-28T00:00:00+08:00/28.log
*** Fetching from: http://:8793/log/dag_id/task_id/2021-06-28T00:00:00+08:00/28.log
*** Failed to fetch log file from worker. Unsupported URL protocol 
</pre>

The root cause is that scheduler will overwrite the hostname info into the task_instance table in DB by using blank str in the progress of `_execute_task_callbacks` when tasks into failed.  Webserver can't get the right host of the task from task_instance because the hostname info of  task_instance table is lost in the progress.

Co-authored-by: huozhanfeng <huozhanfeng@vipkid.cn>
jhtimmins pushed a commit that referenced this issue Aug 12, 2021
The log can't be shown normally when the task runs failed. Users can only get useless logs as follows. #13692

<pre>
*** Log file does not exist: /home/airflow/airflow/logs/dag_id/task_id/2021-06-28T00:00:00+08:00/28.log
*** Fetching from: http://:8793/log/dag_id/task_id/2021-06-28T00:00:00+08:00/28.log
*** Failed to fetch log file from worker. Unsupported URL protocol
</pre>

The root cause is that scheduler will overwrite the hostname info into the task_instance table in DB by using blank str in the progress of `_execute_task_callbacks` when tasks into failed.  Webserver can't get the right host of the task from task_instance because the hostname info of  task_instance table is lost in the progress.

Co-authored-by: huozhanfeng <huozhanfeng@vipkid.cn>
(cherry picked from commit 34478c2)
@eladkal
Copy link
Contributor

eladkal commented Jan 16, 2022

Is anyone experiencing this issue on Airflow 2.2.3? If so please share reproduce steps

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 16, 2022
@changxiaoju
Copy link

happens on Airflow2.2.4rc1

@potiuk
Copy link
Member

potiuk commented Feb 20, 2022

happens on Airflow2.2.4rc1

Almost for sure it is an environmental/deployment issue. But I would love to the bottom of it so I will need your help @changxiaoju.

Does it happen all the time or only sporadically ? How often? Any details on the conditions that it happens ?

Can you check and possibly re-configure your hostname callable @changxiaoju https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#hostname-callable

Those issues can happen if your host cannot rertrieve hostname quickly enough - for example when your DNS experiences slow responses, delays, latency etc. Can you double check please if the hostname_callable of yours responds quickly (for example by repetitive calling it and making sure that you get the right response and report to use what are your findings?

Also stating what kind of deployment and configuration you have woudl be actually helpful. Specifically also how your DNS works, whether it is a cloud deployment etc?

You might want to use another method if your DNS is slow to respond and has "lagss"

@changxiaoju
Copy link

thankyou potiuk and #18239 @dimon222.
it is now ok by setting hostname_callable = airflow.utils.net.get_host_ip_address.

I really want to offer you more information, but i am not very familiar with what you said.
It happened all the time, i use it on remote server, the calculation clusters. And i only changed the postgresql related settings in airflow.cfg.

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 21, 2022
@changxiaoju
Copy link

well, it worked several times and happened again,

happens on Airflow2.2.4rc1

Almost for sure it is an environmental/deployment issue. But I would love to the bottom of it so I will need your help @changxiaoju.

Does it happen all the time or only sporadically ? How often? Any details on the conditions that it happens ?

Can you check and possibly re-configure your hostname callable @changxiaoju https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#hostname-callable

Those issues can happen if your host cannot rertrieve hostname quickly enough - for example when your DNS experiences slow responses, delays, latency etc. Can you double check please if the hostname_callable of yours responds quickly (for example by repetitive calling it and making sure that you get the right response and report to use what are your findings?

Also stating what kind of deployment and configuration you have woudl be actually helpful. Specifically also how your DNS works, whether it is a cloud deployment etc?

You might want to use another method if your DNS is slow to respond and has "lagss"

Shut down for the same reson again, it is really hard to understand.
image

@potiuk
Copy link
Member

potiuk commented Feb 26, 2022

I really want to offer you more information, but i am not very familiar with what you said. It happened all the time, i use it on remote server, the calculation clusters. And i only changed the postgresql related settings in airflow.cfg.

I think you need to run at the environmental stuff - we cannot help to solve those if we see no logs pointing to the issues. If you manage a software on k8s you should really be able to take a look at the logs of k8s - not only the logs of application - I think the reason is there - but it's a bit beyond the scope of Airlfow. You just need to review the logs of your environment to look for some anomalies.

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Mar 29, 2022
@github-actions
Copy link

github-actions bot commented Apr 7, 2022

This issue has been closed because it has not received response from the issue author.

@mfridrikhson-tp
Copy link

mfridrikhson-tp commented May 26, 2022

We had the same issue with retrieving task logs but from what I've been able to find out it looks like it is not related to any connection issues with an actual log provider.

Our specific case of this issue was the following: we got a task failure and the task instance was marked failed in the DAG view. When you go to the task view you could see 2 log tabs: first with the Failed to fetch log file message and second with a successful run and all the logs.
I went to check the log folder and what I found out is that there was only logs for the second tab (task try):
image
(Ignore the 3rd log file - it's from a later rerun after the incident I'm describing had happened)

The actual cause why the UI renders such an error (at least in our case) is that it renders max_try_number number of tabs and queries the contents for each one by its index. It happened to be that there is no file for tab #1 and thus it shown this error message.

So my next guess was that for some reason task instance's try_number got increased one extra time. And that's why I find @doowhtron's comment important here - maybe some kind of race condition happens which causes the task to execute twice or something.

Some notes about our case and environment:

  • The issue happens intermittently (usually everything works fine)
  • It is not happening to some specific task or operator (both sensors and default tasks had this issue)
  • It is not happening specifically when the cluster is experiencing high load or the opposite, so I don't think it has something to do with the performance
  • We use Airflow v2.2.4
  • We store logs on S3
  • We run a single scheduler

@grihabor
Copy link
Contributor

If you're struggling with this issue locally in docker compose, check the health of your worker. My deployment was missing redis dependency, which caused worker to throw an exception. But redis was not required to run the webserver, so I could see the UI and there was exactly this error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.0 Issues Reported for 2.0 area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug pending-response stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

No branches or pull requests