AIrflow Scheduler does not schedule any tasks when >max running tasks queued with non-existant pool #20788

fjmacagno · 2022-01-10T17:13:35Z

Apache Airflow version

2.2.3 (latest released)

What happened

Our airflow instance was not scheduling any tasks, even simple ones using the default pools. The log showed that it was attempting to run 64 tasks, and that every one was trying to use a pool that didn't exist. When i created the missing pool the scheduler started the tasks and started clearing the queue.

What you expected to happen

The scheduler to continue running correctly-configured tasks, ignoring the incorrectly configured ones, rather than blocking.

How to reproduce

Create a dag with 64 concurrent tasks, and set a pool that doesnt exist. Create a second dag using the default pool for a single task. Trigger the first, then the second.

Operating System

ubuntu

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

Using KubernetesExecutor connected to EKS.

Anything else

Unfortunately i don't have access to the logs anymore.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

SamWheating · 2022-01-11T16:24:17Z

I think that this is mostly functioning as intended, but I'm wondering if we can improve the behaviour around nonexistent pools 🤔 I think that this is a somewhat common issue and it can lead to pretty unclear behaviour if a user makes a mistake in the name of a pool.

Maybe we should be failing tasks immediately if they're assigned to a pool which doesn't exist?

I'll have a look into whether this is possible, but would definitely appreciate any other suggestions here.

SamWheating · 2022-01-11T17:00:24Z

OK, so looking into this a bit, the scheduler will log a warning if a task is unschedulable due to non-existent pool:

airflow/airflow/jobs/scheduler_job.py

Lines 335 to 339 in 905baf9

    
           for pool, task_instances in pool_to_task_instances.items(): 
        
               pool_name = pool 
        
               if pool not in pools: 
        
                   self.log.warning("Tasks using non-existent pool '%s' will not be scheduled", pool) 
        
                   continue

This warning is also visible in the TaskInstance Details UI:

And then it will remain in queued state indefinitely (or until it times out I suppose)

It would be really simple to just mark the tasks as failed after logging something like Tasks using non-existent pool '%s' being marked as Failed, but this might be a worse user experience as it leaves no logs or visible warnings about why the task failed (other than the scheduler logs, which are not easily accessible to most users)

With this in mind, does anyone have another idea for how to prevent these tasks from clogging the scheduler, or should we just consider this to be intended behaviour?

fjmacagno · 2022-01-11T17:05:14Z

I mean, I just dont think the whole scheduler should stop scheduling tasks because one dag is misconfigured. This caused an entire cross-team airflow installation to stop working because one team made a mistake on one dag, and gods help us if that had been prod.

It seems like if a task is misconfigured in some way that prevents it from running, it shouldn't be considered to be in the queue. Maybe it could at least be shoved to the back of the queue so that other tasks can try to run?

SamWheating · 2022-01-11T18:03:05Z

I just dont think the whole scheduler should stop scheduling tasks because one dag is misconfigured.

Agreed - although in this case though I think that setting dagrun_timeout, max_active_runs or max_active_tasks on your DAG can reduce the ability of a single DAG to use too many resources and thus limit the blast radius of such an incident.

Maybe it could at least be shoved to the back of the queue so that other tasks can try to run?

Agreed, I'll have a look through the scheduler logic to see how viable this is.

fjmacagno · 2022-01-11T18:59:08Z

Thats an interesting point though, because we do have most of those set. We cant do dagrun_timeout because it is a 15 day long dag, but max_active_runs is 1 and dag_concurrency is 16, while scheduler parallelism is 64, so it sounds like something is amiss there too.

ephraimbuddy · 2022-02-26T01:09:35Z

@SamWheating @fjmacagno let me know what you think about my PR

SamWheating · 2022-02-26T01:59:01Z

Your PR looks good!

I think that #19747 also fixes this issue, but I like your approach more as it will prevent this un-runnable DAG from ever making it to the scheduler.

github-actions · 2023-02-26T07:01:07Z

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

github-actions · 2023-03-28T07:01:19Z

This issue has been closed because it has not received response from the issue author.

fjmacagno added area:core kind:bug This is a clearly a bug labels Jan 10, 2022

ephraimbuddy mentioned this issue Feb 26, 2022

Raise import error if a task uses a non-existent pool #21829

Closed

github-actions bot added the Stale Bug Report label Feb 26, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIrflow Scheduler does not schedule any tasks when >max running tasks queued with non-existant pool #20788

AIrflow Scheduler does not schedule any tasks when >max running tasks queued with non-existant pool #20788

fjmacagno commented Jan 10, 2022

SamWheating commented Jan 11, 2022

SamWheating commented Jan 11, 2022

fjmacagno commented Jan 11, 2022

SamWheating commented Jan 11, 2022

fjmacagno commented Jan 11, 2022

ephraimbuddy commented Feb 26, 2022

SamWheating commented Feb 26, 2022

github-actions bot commented Feb 26, 2023

github-actions bot commented Mar 28, 2023

AIrflow Scheduler does not schedule any tasks when >max running tasks queued with non-existant pool #20788

AIrflow Scheduler does not schedule any tasks when >max running tasks queued with non-existant pool #20788

Comments

fjmacagno commented Jan 10, 2022

Apache Airflow version

What happened

What you expected to happen

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

SamWheating commented Jan 11, 2022

SamWheating commented Jan 11, 2022

fjmacagno commented Jan 11, 2022

SamWheating commented Jan 11, 2022

fjmacagno commented Jan 11, 2022

ephraimbuddy commented Feb 26, 2022

SamWheating commented Feb 26, 2022

github-actions bot commented Feb 26, 2023

github-actions bot commented Mar 28, 2023