-
Notifications
You must be signed in to change notification settings - Fork 14.5k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The scheduler does not appear to be running. Last heartbeat was received X minutes ago. #19192
Comments
It looks like you are actually using SequentialExecutor. That would perfectly explain the behaviour. Are you sure you are using local executor and scheduler is running ? Can you run |
In this screenshot the scheduler is running 4 of the same process / task, because As stated above, the issue is airflow will not run other dags and the scheduler is not responding. (Strangely, the scheduler is apparently also quite happy to run 1 task from 1 dag in 4 parallel processes.) I suspect some value in the configuration or not enough database connections. |
|
It does not look like like Airflow problem to be honest. It certainly looks like your long running task blocks some resources that blocks scheduler somehow (but there is no indication how it can be blocked). There must be something in your DAGs or task that simply causes the Airflow scheduler to lock up. This is almost certainly something specific to your deployment (others do not experience it). But what it is, it's hard to say from the information provided. My best guess is that you have some lock on the database and it makes scheduler wait while running some query. Is it possible to dump the state of scheduler and dump generaly more of the state of your machine, resouces, sockets, DB locks while it happens (this shoudl be possible with py-spy for example). Also getting all logs of scheduler and seing if it actually "does" something might help. Unfortunately the information we have now is not enough to deduce the reason. Any insight of WHERE the schduler is locked might help with investigating it. |
BTW. Yeah, checking the limits of connections opened in your DB might be a good idea. Especially if you are using variables in your DAGs at top level, it MIGHT lead to a significant number of connections opened, which *might eventually cause scheduler to try to open a new connection and patiently wait until the DB server will have any connection free. It might simply be that your long running tasks are written in the way that they (accidentally) open a lot of those connections and do not close them until the task completes. I think PGBouncer might help with that, but if too many connections are opened by a single long running task and they are not closed, that might also not help too much. |
I think you are misusing Airlfow. Airlfow by definition can run multiple tasks even with single local executor, so you are likely misunderstanding how airflow operators are run and run it as a top-level code, and not as task? Can you please copy your DAG here? You are not supposed to run long running operations when the DAG file is parsed - parsing should be run rather quickly and from what it looks like you execute a long running process while parsing happens rather than when tasks are executed: https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#top-level-python-code |
|
I'm not sure how airflow is intended to be used, but sometimes people find other use cases for a tool they haven't designed. We run a task that can take a few hours to collect all the historical data and process it. And then we want the task to run once per day. It appears, from my side, that the airflow server UI can't contact the scheduler while the long task is running, and other DAGs can't be run. Perhaps the scheduler wants my code to yield control back to it frequently (once per day of data, for example), but I prefer to let my own code manage the date ranges, because that's where the unit tests are, and all the heavy lifting is in rust anyway. |
What I do find use for in airflow
|
This is what airflow is designed for. I tihnk you just use it wrongly (or misconfigured it). It is supposed to handle that case perfectly (and it works this way for thousands of users. So it's your configuration/setup/way of using it is wrong.
No. This is no the case (unless you use Sequential Executor which is only suposed to be used for debugging). . Airflow is designed to run multiple paralllel tass at a time:. You likely have some problem in your airlfow installation/configuration Questions:
I just run it in 2.2.3 and I was able to successuly start even 5 paralllel runs and no problems with Scheduler |
|
I'm sorry, I didn't intend this to turn into an unpaid debugging session, and it looks like a cosmetic problem more than anything. So I'm comfortable closing this thread if you prefer. |
Could it be the scheduler settings?:
|
I still think when you run Airflow schduler via systemd you run it as "sequential" executor. The problem is that when you run airflow as systemd, you very likely use default environment variables - not the ones you have in your .bashrc or wherever you keep them. In this case airlfow scheduler falls back to default settings. So when you run Please check in the logs of your scheduler - it should print the executor used when it starts. |
|
It says LocalExecutor in the logs^^. Plus the env vars are in the system-wide /etc/environment, and also in the systemd service file. I think it may be the scheduler-specific settings in the airflow.cfg. |
I propose you regenerate the default config and only apply changes you really want to make. You seem to have some very old, deprecated options there. |
I have the same problem, and I don't know how to solve it. but my airflow-scheduler.err the errors reported in this log are: |
|
It's likely different problem. Make sure you do not use SQLite and Sequential executor. Recreate installation/database if needed. There is not enough information here to help you @LIKEL123. If you still will have the problem open a new discussion and describe in detail what your configuration/deployment is and all the logs you can get - maybe someone will be able to help you |
Since the assumption seems to be that this is an isolated issue, I just want to report that we are seeing it, too. After successfully migrating to v2 several months ago, started seeing this message 3 days ago. It's getting worse everyday (started reporting "heartbeat not detected in 2 minutes" before resolving, now it's up to 5 minutes). |
As @bondgeek we have also started seeing it within the last few days, after having run the same configuration with the same dags for more than a week. Our setup is running on a Kubernetes cluster, and I tried to redeploy everything (airflow-webserver, airflow-scheduler, and the PostgreSQL db) today. We have not seen the problem since, but I expect it to return after some time. |
@khintz , @bondgeek Did you guys find any solution? We are also running airflow on Kubernetes cluster and facing this issue time and time again. For us, it happens when we run a long running process with KubernetesPodOperator. But it's not reproducible every time but comes after few days of running it. Not sure if it's useful but this is the last error we received. |
@bh-ashishsharma We ended up scaling up the task instance for the airflow scheduler (we are running a dockerized version on AWS ECS), and have not seen the error since. It's still not clear what the scheduler is doing that is so resource intensive. |
No, we've never found a solution. One hypothesis was that the contention for resources comes from the long running task writing data to the same hard disk where the postgres database is. |
We are seeing the same issue running v2.2.4 on K8S
We first increased the setting "scheduler_health_check_threshold" from default 30 to 180 (i.e. seconds), that prevented at least that Airflow kills tasks because of the timeout (somehow related?). I suspected too much logging being written maybe but there is only a few GB of logging and with log-groomer running (enabled by default in the chart) retention should be fine. Also i couldn't find any DB issue's, it's rather small even; few MB's only. For now the issue is resolved by a simple redeploy, btw reusing the DB. I would like to know how to debug this issue further, can airflow be profiled? What debug logging to enable to get some better insight? Paul |
It might be related to the PR #23944 - are you usuing Python 3.10? If so - switching to Python 3.9 might help and we might fix it one of the future versions. |
@bh-ashishsharma :We "fixed it" by allocating more memory to our running containers on Kubernetes. Since then we have not seen the issue. |
Well. Increasing resource usage when you do not have enough of them is really good fix. Unfortunately only your deploymenent can monitor and indicate if you lack memory or not, so this was indeed a fix to a deployment issue. Good to know - we would likely have a good case to explain to others to pay attention to. @khintz - maybe you would like to make a PR to our docs describing what you observed and remediation that you applied ? It's super easy - just use "Suggest change on this page" link at top-bottom and it will open a GitHub UI where you will be able to make changes to the docs directly using GitHub UI - without any need for any environment. That would definitely help others who might have similar issue. |
A lot of the answers in this thread are upgrade your python, check your deployment, update your airflow, only you know your deployment and so on. I'm sure my server has plenty unused resources. The only question would be if the config can make airflow self-limit the amount of resources it uses, and the other possibility is the logic of the scheduler loop has some function call that's blocking. I understand the responses are because development time is valuable. I suspect it's a multithreading/async problem. So, would you consider linking me to the part of the code where the schedule loop is and maybe I can help figure out what's blocking? |
Just hopping in here, we've also been running into this issue after upgrading from 1.10.15 -> 2.2.3. As others in the thread have described, the problem is getting progressively worse and there is no obvious issues in the deployment (CPU Utilization for our scheduler is low, etc). It's interesting hearing that, in spite of this, increasing resources seemed to help others. We'll give that a shot and see if it helps. |
@t4n1o - (but also others in this thread) a lot of the issues are because we have diffuculties with seeing clear reproduction of the problem - and we have to rely on users like you to spend their time on trying to analyze the settings, configuration, deployment they have, perform enough analysis and get enought clues that those who know the code best can make intelligent guesses what is wrong even if there is no "full and clear reproduction steps". It's not that development time is valuable. Airflow is developed by > 2100 contributors - often people like you. The source code is available, anyone can take a look and while some people know more about some parts of code, you get the software for free, and you cannot "expect" those people to spend a lot of time on trying to figure out what's wrong if they have no clear reproduction steps and enough clues. Making analysis and enough evidences to see what you observe is the best you can do to pay back for the free software - and possibly give those looking here enough clues to fix or direct you how to solve the problem. So absolutely - if you feel like looking at the code and analysing it is something you can offer the community as your "pay back" - this is fantastic. The scheduler Job is here: https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job.py. But we can give you more than that: https://www.youtube.com/watch?v=DYC4-xElccE - this is video from Airlfow Summit 2021 where Ash explains how scheduler works - i.e. what were the design assumptions. And it can guide you in understanding what Scheduler Job does. Also, before you dive deep, it might well be that your set of DAGs and way you structure them is a problem and you can simply follow our guidelines on Fine tuning your scheduler performance So if you want to help - absolutely, make more analysis, look at the guidelines of ours, if you feel like it, dive deep into how scheduler works and look at the code. All that might be great way to get more clues, and evidences, and even if you won't be able to fix it in a PR you can give others enough clues that that they can find root cause and implement solutions. |
I have also facing the same problem, below are the details
While we search in schedulers log we have found scheduling loop time gets increase |
This is an interesting finding @ashb @uranusjr @ephraimbuddy. This is not a blocker for 2.3.4 but this is interesting to see that really high variance in the scheduling loop time. Maybe we could come up with some hypotheses why this is happening. |
I think the max_tis_per_query is quite high, but even with it, it is suspicious to see that 512 tis are processed in 594 seconds. Some of the queries must simply run for a very long time. Is it possible to get some stats from Postgres on what are the longest running queries @deepchand ? There are a number of guides in the internet - for example this one that show how to do it: https://www.shanelynn.ie/postgresql-find-slow-long-running-and-blocked-queries/ |
I wonder if this is related to the locking issue we just resolved recently. |
Do you remember which one @uranusjr ? |
This one ? #25532 |
I have tried to debug it more and found its |
Yes that one |
@potiuk i have debugged more on this issue and found like updating the task state in the scheduler loop significantly increase the total time of self.executor.heartbeat() function which is causing this problem airflow/airflow/executors/celery_executor.py Line 312 in 9ac7428
"The scheduler does not appear to be running. Last heartbeat was received X minutes ago" Please have a look and let me know if any other info is needed |
I do not know that well, but for me it looks like you have some bottlences with your Celery Queu. You have not specified what kind of queue you used - but I think you should look at your Redis or RabbitMQ and see if there are no problems there. Also it might be that simply your Redis or RabbitMQ is badly configured or overloaded and it is somehow blocking state update. Can you please set log level to debug and see if there are more information printed on what's going on in update_state method? (you will find how to do it in airlfow configuration docs) |
@potiuk We are using Redis as queue, i have cross verified redis is not bottleneck and messages are consumed from redis as soon as they land in redis queue. We have enough resources available on redis side as well. |
Any chance for debug logs ? |
Debug logs of scheduler ? |
Yep:
|
@potiuk I have added some more debug logs to find the total time taken while fetching state of a single result and for total results in a single loop, please find below |
Too much debug logs so tried to filter out only important one, let me know if more info needed |
I think your problem is simply extremely slow connection to your database. 5 seconds to run single query indicate a HUGE problem you have with your database. It should take single mlilliseconds . THIS IS your problem. Not airflow. You should fix your DB/connectivity and debug why your database is 1000x slower than it should. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Apache Airflow version
2.1.4
Operating System
Linux / Ubuntu Server
Versions of Apache Airflow Providers
apache-airflow-providers-ftp==2.0.1
apache-airflow-providers-http==2.0.1
apache-airflow-providers-imap==2.0.1
apache-airflow-providers-postgres==2.3.0
Deployment
Virtualenv installation
Deployment details
Airflow v2.1.4
Postgres 14
LocalExecutor
Installed with Virtualenv / ansible - https://github.com/idealista/airflow-role
What happened
I run a single BashOperator (for a long running task, we have to download data for 8+ hours initially to download from the rate-limited data source API, then download more each day in small increments).
We're only using 3% CPU and 2 GB of memory (out of 64 GB) but the scheduler is unable to run any other simple task at the same time.
Currently only the long task is running, everything else is queued, even thought we have more resources:
What you expected to happen
I expect my long running BashOperator task to run, but for airflow to have the resources to run other tasks without getting blocked like this.
How to reproduce
I run a command with bashoperator (I use it because I have python, C, and rust programs being scheduled by airflow).
bash_command='umask 002 && cd /opt/my_code/ && /opt/my_code/venv/bin/python -m path.to.my.python.namespace'
Configuration:
Anything else
This occurs every time consistently, also on 2.1.2
The other tasks have this state:
When the long-running task finishes, the other tasks resume normally. But I expect to be able to do some parallel execution /w LocalExecutor.
I haven't tried using pgbouncer.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: