Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow Celery Worker logs inaccessible #18239

Closed
2 tasks done
datsabk opened this issue Sep 14, 2021 · 13 comments
Closed
2 tasks done

Airflow Celery Worker logs inaccessible #18239

datsabk opened this issue Sep 14, 2021 · 13 comments
Labels
area:core kind:bug This is a clearly a bug pending-response stale Stale PRs per the .github/workflows/stale.yml policy file

Comments

@datsabk
Copy link

datsabk commented Sep 14, 2021

Apache Airflow version

2.1.3 (latest released)

Operating System

Python 3.6 Apache Airflow Docker

Versions of Apache Airflow Providers

2.1.3

Deployment

Other Docker-based deployment

Deployment details

Kubernetes based deployment - Workers and Master in Kubernetes as pods. Logs accessed via NodePort Service

What happened

*** Log file does not exist: /xxx/airflow/home/logs/xxxx/2021-09-14T12:13:32.383510+00:00/1.log
*** Fetching from: http://tsc-aflow-orca:8793/log/xxxx/2021-09-14T12:13:32.383510+00:00/1.log
*** Failed to fetch log file from worker. 503 Server Error: Service Unavailable for url: http://tsc-aflow-orca:8793/log/xxxx/2021-09-14T12:13:32.383510+00:00/1.log
For more information check: https://httpstatuses.com/503

Checked the Worker pod - Logs exist. However, it seems like the Worker web service is unable to access the logs.

What you expected to happen

Worker logs worked fine with the same setup but older version v1.10.12

How to reproduce

Run a sample Dag with Webserver, Scheduler and one Worker running in Kubernetes - No custom setup.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@datsabk datsabk added area:core kind:bug This is a clearly a bug labels Sep 14, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Sep 14, 2021

Thanks for opening your first issue here! Be sure to follow the issue template!

@dimon222
Copy link
Contributor

dimon222 commented Sep 17, 2021

In airflow.cfg set this

hostname_callable = airflow.utils.get_host_ip_address

If still doesn't work, ensure worker containers are exposing port 8793 in kubernetes template.

Log existence check failed because webserver probably tries to find logs locally, but they're probably stored on worker. It does fallback to logic of retrieval of those logs from worker containers using REST.

@datsabk
Copy link
Author

datsabk commented Sep 20, 2021

Hello @dimon222 - The workers are sitting behind a Kubernetes service. Hence, the logs need to be accessible via the service name. To support this, I have used a different hostname callable and gave it the DNS name of service.

I do not see how it could be a problem though. You can imagine the situation like:

master.abc.com -> Airflow master
worker.abc.com -> Airflow workers (multiple behind a load balancer sharing logs storage)

I need logs to be accessible behind worker.abc.com instead of via IP address

@pvanliefland
Copy link
Contributor

I'm using the helm chart and experiencing something similar. When using KubernetesExecutor and logs.persistence.existingClaim, it works fine. As soon as I switch to CeleryExecutor (without touching logs.persistence.existingClaim), I also get Log file does not exist.

@SamWheating
Copy link
Contributor

worker.abc.com -> Airflow workers (multiple behind a load balancer sharing logs storage)

I don't think this is going to work, since each airflow worker runs a background flask application which serves logs from tasks run on that worker at the specified logging port (in this case 8793).

@flask_app.route('/log/<path:filename>')
def serve_logs_view(filename):
return send_from_directory(log_directory, filename, mimetype="application/json", as_attachment=False)

By running the workers behind a load balancer, you're removing the webserver's ability to specify which server the logs are stored on. I suspect that if you reload the page enough times it may work eventually when your request happens to be routed to the correct worker via the load balancer.

@dimon222
Copy link
Contributor

dimon222 commented Oct 16, 2021

By running the workers behind a load balancer, you're removing the webserver's ability to specify which server the logs are stored on..

I would assume "shared logs" implies that storage is shared (mounted volume, PVC in kubernetes, etc). Unless there's something on worker itself restricts to go above what was allocated to this specific worker to do?

@changxiaoju
Copy link

In airflow.cfg set this

hostname_callable = airflow.utils.get_host_ip_address

If still doesn't work, ensure worker containers are exposing port 8793 in kubernetes template.

Log existence check failed because webserver probably tries to find logs locally, but they're probably stored on worker. It does fallback to logic of retrieval of those logs from worker containers using REST.

Hi dimon222, i set hostname_callable = airflow.utils.get_host_ip_address and it did work for several times and then shut down again, what can i do next, and by the way how to "ensure worker containers are exposing port 8793 in kubernetes template". Really appreciate your help.

@dimon222
Copy link
Contributor

@changxiaoju
containerPort on container of airflow worker in property ports
https://kubernetes.io/docs/concepts/services-networking/connect-applications-service/

@changxiaoju
Copy link

@changxiaoju containerPort on container of airflow worker in property ports https://kubernetes.io/docs/concepts/services-networking/connect-applications-service/

Thankyou, but i am not using CeleryExecutor , instead i use LocalExcutor, what may cause the error then?

@dimon222
Copy link
Contributor

@changxiaoju containerPort on container of airflow worker in property ports https://kubernetes.io/docs/concepts/services-networking/connect-applications-service/

Thankyou, but i am not using CeleryExecutor , instead i use LocalExcutor, what may cause the error then?

If error does include the port still in a same way as you see in first message, you could attempt expose that port on your scheduler. If not, I suspect you have some unrelated exception and should probably make separate ticket for that.

@eladkal
Copy link
Contributor

eladkal commented Sep 3, 2022

Please check if the issue happens on latest Airflow version (there has been some work related to this)

@github-actions
Copy link

github-actions bot commented Oct 4, 2022

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Oct 4, 2022
@github-actions
Copy link

This issue has been closed because it has not received response from the issue author.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug pending-response stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

No branches or pull requests

6 participants