Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIrflow Scheduler does not schedule any tasks when >max running tasks queued with non-existant pool #20788

Closed
1 of 2 tasks
fjmacagno opened this issue Jan 10, 2022 · 9 comments
Labels

Comments

@fjmacagno
Copy link

Apache Airflow version

2.2.3 (latest released)

What happened

Our airflow instance was not scheduling any tasks, even simple ones using the default pools. The log showed that it was attempting to run 64 tasks, and that every one was trying to use a pool that didn't exist. When i created the missing pool the scheduler started the tasks and started clearing the queue.

What you expected to happen

The scheduler to continue running correctly-configured tasks, ignoring the incorrectly configured ones, rather than blocking.

How to reproduce

Create a dag with 64 concurrent tasks, and set a pool that doesnt exist. Create a second dag using the default pool for a single task. Trigger the first, then the second.

Operating System

ubuntu

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

Using KubernetesExecutor connected to EKS.

Anything else

Unfortunately i don't have access to the logs anymore.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@fjmacagno fjmacagno added area:core kind:bug This is a clearly a bug labels Jan 10, 2022
@SamWheating
Copy link
Contributor

I think that this is mostly functioning as intended, but I'm wondering if we can improve the behaviour around nonexistent pools 🤔 I think that this is a somewhat common issue and it can lead to pretty unclear behaviour if a user makes a mistake in the name of a pool.

Maybe we should be failing tasks immediately if they're assigned to a pool which doesn't exist?

I'll have a look into whether this is possible, but would definitely appreciate any other suggestions here.

@SamWheating
Copy link
Contributor

OK, so looking into this a bit, the scheduler will log a warning if a task is unschedulable due to non-existent pool:

for pool, task_instances in pool_to_task_instances.items():
pool_name = pool
if pool not in pools:
self.log.warning("Tasks using non-existent pool '%s' will not be scheduled", pool)
continue

This warning is also visible in the TaskInstance Details UI:
image

And then it will remain in queued state indefinitely (or until it times out I suppose)

It would be really simple to just mark the tasks as failed after logging something like Tasks using non-existent pool '%s' being marked as Failed, but this might be a worse user experience as it leaves no logs or visible warnings about why the task failed (other than the scheduler logs, which are not easily accessible to most users)

With this in mind, does anyone have another idea for how to prevent these tasks from clogging the scheduler, or should we just consider this to be intended behaviour?

@fjmacagno
Copy link
Author

I mean, I just dont think the whole scheduler should stop scheduling tasks because one dag is misconfigured. This caused an entire cross-team airflow installation to stop working because one team made a mistake on one dag, and gods help us if that had been prod.

It seems like if a task is misconfigured in some way that prevents it from running, it shouldn't be considered to be in the queue. Maybe it could at least be shoved to the back of the queue so that other tasks can try to run?

@SamWheating
Copy link
Contributor

I just dont think the whole scheduler should stop scheduling tasks because one dag is misconfigured.

Agreed - although in this case though I think that setting dagrun_timeout, max_active_runs or max_active_tasks on your DAG can reduce the ability of a single DAG to use too many resources and thus limit the blast radius of such an incident.

Maybe it could at least be shoved to the back of the queue so that other tasks can try to run?

Agreed, I'll have a look through the scheduler logic to see how viable this is.

@fjmacagno
Copy link
Author

Thats an interesting point though, because we do have most of those set. We cant do dagrun_timeout because it is a 15 day long dag, but max_active_runs is 1 and dag_concurrency is 16, while scheduler parallelism is 64, so it sounds like something is amiss there too.

@ephraimbuddy
Copy link
Contributor

@SamWheating @fjmacagno let me know what you think about my PR

@SamWheating
Copy link
Contributor

Your PR looks good!

I think that #19747 also fixes this issue, but I like your approach more as it will prevent this un-runnable DAG from ever making it to the scheduler.

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

@github-actions
Copy link

This issue has been closed because it has not received response from the issue author.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants