-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AIrflow Scheduler does not schedule any tasks when >max running tasks queued with non-existant pool #20788
Comments
I think that this is mostly functioning as intended, but I'm wondering if we can improve the behaviour around nonexistent pools 🤔 I think that this is a somewhat common issue and it can lead to pretty unclear behaviour if a user makes a mistake in the name of a pool. Maybe we should be failing tasks immediately if they're assigned to a pool which doesn't exist? I'll have a look into whether this is possible, but would definitely appreciate any other suggestions here. |
OK, so looking into this a bit, the scheduler will log a warning if a task is unschedulable due to non-existent pool: airflow/airflow/jobs/scheduler_job.py Lines 335 to 339 in 905baf9
This warning is also visible in the TaskInstance Details UI: And then it will remain in It would be really simple to just mark the tasks as failed after logging something like With this in mind, does anyone have another idea for how to prevent these tasks from clogging the scheduler, or should we just consider this to be intended behaviour? |
I mean, I just dont think the whole scheduler should stop scheduling tasks because one dag is misconfigured. This caused an entire cross-team airflow installation to stop working because one team made a mistake on one dag, and gods help us if that had been prod. It seems like if a task is misconfigured in some way that prevents it from running, it shouldn't be considered to be in the queue. Maybe it could at least be shoved to the back of the queue so that other tasks can try to run? |
Agreed - although in this case though I think that setting
Agreed, I'll have a look through the scheduler logic to see how viable this is. |
Thats an interesting point though, because we do have most of those set. We cant do dagrun_timeout because it is a 15 day long dag, but max_active_runs is 1 and dag_concurrency is 16, while scheduler parallelism is 64, so it sounds like something is amiss there too. |
@SamWheating @fjmacagno let me know what you think about my PR |
Your PR looks good! I think that #19747 also fixes this issue, but I like your approach more as it will prevent this un-runnable DAG from ever making it to the scheduler. |
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
Apache Airflow version
2.2.3 (latest released)
What happened
Our airflow instance was not scheduling any tasks, even simple ones using the default pools. The log showed that it was attempting to run 64 tasks, and that every one was trying to use a pool that didn't exist. When i created the missing pool the scheduler started the tasks and started clearing the queue.
What you expected to happen
The scheduler to continue running correctly-configured tasks, ignoring the incorrectly configured ones, rather than blocking.
How to reproduce
Create a dag with 64 concurrent tasks, and set a pool that doesnt exist. Create a second dag using the default pool for a single task. Trigger the first, then the second.
Operating System
ubuntu
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
Using KubernetesExecutor connected to EKS.
Anything else
Unfortunately i don't have access to the logs anymore.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: