Remove global from task_runner supervisor-comms #59876

jscheffl · 2025-12-28T22:28:07Z

Another small (in this case rather medium complex) increment to remove global statements for PR #58116

This removes 2 global statements from task_runner.py where explicitly a global wariable was used as shared SUPERVISOR_COMMS by intent. Proposing to change with via a static class and accessor-methods to prevent usage of global variables.

global is evil.

For this PR to merge seeking for explicit approval from one of the Task SDK creators @ashb, @kaxil and/or @amoghrajesh

kaxil

imo globals can be bad, but this specific use case is one of the legitimate ones, in fact better than the approach without globals.

def supervisor_comms() -> CommsDecoder[ToTask, ToSupervisor]:
    return _SupervisorCommsHolder.comms

But then violates it in supervisor.py:1669:

task_runner._SupervisorCommsHolder.comms = temp_comms  #

and every call site becomes verbose:

Before:

SUPERVISOR_COMMS.send(msg)

After (unnecessary function call)

supervisor_comms().send(msg)

Every access now requires a function call + None check instead of direct variable access.

My 2 cents :)

kaxil · 2025-12-29T17:09:33Z

airflow-core/src/airflow/models/connection.py

        # If this is set it means are in some kind of execution context (Task, Dag Parse or Triggerer perhaps)
        # and should use the Task SDK API server path
-        if hasattr(sys.modules.get("airflow.sdk.execution_time.task_runner"), "SUPERVISOR_COMMS"):
+        from airflow.sdk.execution_time.task_runner import is_supervisor_comms_initialized


this used hasattr on sys so we don't need to have airflow.sdk installed on the server components

Yeah. I think we should solve it differently, I would say here just checking if "task-sdk" is installed should be a better check?

Or smth else - but this should be revised after we complete the isolation.

Okay, have reverted/adjusted to previous state - as it was repeated Copy&Paste have moved it into a utility in https://github.com/apache/airflow/pull/59876/changes#diff-7694d13e2f87c84d20b0b8b44797bf96d754ae270204217e518082decc74649bR104

Would leave other improvements of detection to another PR.

airflow-core/src/airflow/models/variable.py

task-sdk/src/airflow/sdk/execution_time/task_runner.py

potiuk · 2025-12-29T17:16:53Z

Every access now requires a function call + None check instead of direct variable access.

I think if we add @cache to def supervisor_comms(), this will be faster and will only require a hash lookup - and then, I think performance is not a concern any more.

I like this pattarn that Jens introduced a lot more to be honest, it is more unit-test friendly IMHO.

Also that removes the None-check for every call. When None is detected in a first call - RuntimeException is thrown and interpreter will exit - so None will be checked only at first use when cache is set.

task-sdk/src/airflow/sdk/execution_time/task_runner.py

amoghrajesh

I would be OK with the current pattern we have for the sake of ease of understanding but if we have to change it, I do not think there is a better solution than this one. Would require some getting used to to use the new pattern

jscheffl · 2025-12-30T09:47:50Z

and every call site becomes verbose:

Before:
SUPERVISOR_COMMS.send(msg)
After (unnecessary function call)
supervisor_comms().send(msg)
Every access now requires a function call + None check instead of direct variable access.

@kaxil @amoghrajesh fair points and I am also not 100% convinced the solution is perfect. Te usage of the SUPERVISOR_COMMS seems still to be a case where we mess a lot with a global variable and encapsulating the state is still beneficial.
Had another sleep about this and understand that supervisor_comms().send() is also not cool, how about if we change it to:

supervisor_send(data)

...as a shortcut? Would it make it better to handle the sending in the function and have better readable code w/o global?

By the way with @cache I assume execution is slower as the has need to be built and then a lookup must be made which is more expensive than a function pointer lookup and jump. Therefore would not like a @cache solution.

potiuk · 2025-12-30T10:12:32Z

By the way with @cache I assume execution is slower as the has need to be built and then a lookup must be made which is more expensive than a function pointer lookup and jump. Therefore would not like a @cache solution.

I don't think so. Calling a function in Python is generally very slow operation. It's not a classic pointer jump. Functions calls in Python are done by interpreter, they are not using jumps as C programs do. Interpreter has to look-up the method to call, create a new frame, push it on "Python stack" and clean the frame after it returns. This is all done "in the interpreter" - it's not even using the processor stack. "Python stack" for method frames is actually stored in heap memory, not in processor stack - so any stack manipulation (calling and returning from function) is kinda slow.

I did some basic micro-benchmarks:

import time
from functools import lru_cache

class MethodBenchmark:
    def __init__(self):
        self.call_count = 0

    def empty_method(self):
        """Empty method without caching"""
        self.call_count += 1
        return None

    @lru_cache(maxsize=128)
    def cached_empty_method(self):
        """Empty method with caching"""
        return None


def benchmark():
    obj = MethodBenchmark()
    iterations = 1_000_000

    # Benchmark non-cached method
    start = time.perf_counter()
    for _ in range(iterations):
        obj.empty_method()
    non_cached_time = time.perf_counter() - start

    # Benchmark cached method
    start = time.perf_counter()
    for _ in range(iterations):
        obj.cached_empty_method()
    cached_time = time.perf_counter() - start

    # Results
    print(f"Non-cached method: {non_cached_time:.6f} seconds")
    print(f"Cached method: {cached_time:.6f} seconds")
    print(f"Speedup: {non_cached_time / cached_time:.2f}x")
    print(f"Non-cached call count: {obj.call_count}")


if __name__ == "__main__":
    benchmark()

Result with Python 3.10

/Users/jarekpotiuk/code/airflow/.venv/bin/python /Users/jarekpotiuk/Library/Application Support/JetBrains/IntelliJIdea2025.3/scratches/scratch_5.py 
Non-cached method: 0.026673 seconds
Cached method: 0.026702 seconds
Speedup: 1.00x
Non-cached call count: 1000000

Process finished with exit code 0

This means that when you put @cache -> it never runs slower than method call, and additionaly you save on all the executed code inside.

I modified the code of both methods to do single if:

        if self.call_count == 0:
            self.call_count += 1
        else:
            self.call_count += 1

And there are the results:

Non-cached method: 0.032580 seconds
Cached method: 0.026062 seconds
Speedup: 1.25x
Non-cached call count: 1000001

Process finished with exit code 0

(100001 is because cached call increased it by 1)

There are plenty of optimisations in Python 3.11 - 3.14 that might skew this simple example (specializing adaptive interpreter changes and JIT) - so I run it with Python 3.10

Similar discussion: https://stackoverflow.com/questions/14648374/python-function-calls-are-really-slow

potiuk · 2025-12-30T10:17:17Z

supervisor_send(data)

I think that's a good idea, it's likely better to expose "comms actions" than "comm" itself.

potiuk · 2025-12-30T10:23:55Z

BTW. To be perfectly honest, In this case I think performance is not as important (at least until we will start doing the communication very frequently - for example following @dabla optimisation / async task reporting back status of individual async coroutines back to scheduler). The comms overhead for inter-process communication and serialization of data involved (every such call needs to serialize data sent across the wire - even if shared memory is used to communicate between processes) is already likely order of magnitude slower than single method call in Python, so this should not be too much of a concern.

In this case I think we should optimise for readability, I also do not like the extra () needed in the original proposal - that's why supervisor_send(data) is probably best approach.

amoghrajesh · 2025-12-31T06:01:22Z

In this case I think we should optimise for readability, I also do not like the extra () needed in the original proposal - that's why supervisor_send(data) is probably best approach.

I agree with this. If we have to go down the way of removing globals here, I would prioritise using supervisor_send or supervisor_comms_send

jscheffl · 2025-12-31T15:51:20Z

Thanks for the feedback, have it now adjusted to use supervisor_send() which is a bit shorter.

Re-review appreciated :-D

(Am surprised actually that I wanted to remove two global statements but actually am slimming down the codebase by 40 LoC now :-D)

potiuk · 2026-01-04T01:24:51Z

LVGTM (V=Very)

Most of the slimming goes from consolidation of copy&pasted comments to a single place - but this is one of the best trimming you can do as DRY in this case is important :D

dabla · 2026-01-05T06:08:34Z

Nice work Jens!

ashb

Right now I'm -1 to most of this change (the has_execution_context() fn part is good):

Moving from a global variable of SUPERVISOR_COMMS to a class containing a class variable is changing form a global variable, to a global variable by a different name. It's a change for no benefit. Globals aren't inherrently bad, they are tool that have their place.

ashb · 2026-01-07T10:42:47Z

airflow-core/src/airflow/models/base.py

    map_index: Mapped[int] = mapped_column(Integer, nullable=False, server_default=text("-1"))
+
+
+def has_execution_context() -> bool:


This is incorrectly named. -- _SupervisorCommsHolder.comms") would be set in parsing, and parsing is not an "execution" context.

@kaxil Didn't you recently add some other context (server vs something else)? What was that?

The naming was made from the previous code comment If this is set it means are in some kind of execution context - but other proposals welcome.

Naming things is hard!

Yeah _AIRFLOW_PROCESS_CONTEXT that accepts client / server

airflow/task-sdk/src/airflow/sdk/execution_time/supervisor.py

Lines 1939 to 1942 in e36e6ca

# 2. Check for explicit server context

if os.environ.get("_AIRFLOW_PROCESS_CONTEXT") == "server":

# Server context: API server, scheduler

# uses the default server list

airflow/airflow-core/src/airflow/dag_processing/processor.py

Lines 181 to 183 in e36e6ca

# Mark as client-side (runs user DAG code)

# Prevents inheriting server context from parent DagProcessorManager

os.environ["_AIRFLOW_PROCESS_CONTEXT"] = "client"

Naming is hard. Have renamed now to is_client_process_context() and to improve it when I had my hands on it, the decision is also now based on the ENV.

ashb · 2026-01-07T10:44:49Z

airflow-core/src/airflow/models/base.py

+
+    # If this is set it means are in some kind of execution context (Task, Dag Parse or Triggerer perhaps)
+    # and should use the Task SDK API server path
+    return hasattr(sys.modules.get("airflow.sdk.execution_time.task_runner"), "_SupervisorCommsHolder.comms")


I'm 90% sure that this will always return false. hasattr doesn't look "deeply".

In [2]: airflow.sdk.execution_time.task_runner.AirflowException.__mro__ Out[2]: (airflow.sdk.exceptions.AirflowException, Exception, BaseException, object) In [3]: hasattr(airflow.sdk.execution_time.task_runner, "AirflowException.__mro__") Out[3]: False

Yeah. And this means our unit tests do not exercise this path correctly.

In [1]: import airflow.sdk.execution_time.task_runner 2026-01-07T10:53:37.236044Z [warning ] Skipping masking for a secret as it's too short (<5 chars) [airflow._shared.secrets_masker.secrets_masker] loc=secrets_masker.py:551 2026-01-07T10:53:37.236167Z [warning ] Skipping masking for a secret as it's too short (<5 chars) [airflow.sdk._shared.secrets_masker.secrets_masker] loc=secrets_masker.py:551 In [2]: import sys In [3]: airflow.sdk.execution_time.task_runner._SupervisorCommsHolder.comms = 1 In [4]: hasattr(sys.modules.get("airflow.sdk.execution_time.task_runner"), "_SupervisorCommsHolder.comms") Out[4]: False

Very good catch! Thenaks! Have reworked the logic now.

ashb · 2026-01-07T11:03:11Z

task-sdk/src/airflow/sdk/execution_time/task_runner.py

+def supervisor_send(msg: ToSupervisor) -> ToTask | None:
+    """Send a message to the supervisor as convenience for get_supervisor_comms().send()."""
+    if _SupervisorCommsHolder.comms is None:
+        raise RuntimeError("Supervisor comms not initialized yet. Call set_supervisor_comms() instead.")
+    return _SupervisorCommsHolder.comms.send(msg)


If we want to keep this sort of pattern, I think it might be better to do something similar to what we do for metrics/Stats

class UnsetComms(CommsDecoder[ToTask, ToSupervisor]): def send(self, msg): raise RuntimeError("Supervisor comms not initialized yet. Call set_supervisor_comms() instead.") ... initalized: Final[bool] = False

And then SUPERVISOR_COMMS/_SupervisorCommsHolder.comms can be initialized to an instance of this class.

In the past I created some kind of Singleton metadata class, then you only needed to extend your class with it and it would behave like a singleton, even if you could call the constructor multiple times, you would always end up having the same instance. In Java this was simple to achieve by defining you class as an enum, but not in Python.

So I ended up creating a Singleton class for Python:

class Singleton(type): _instances: dict = {} def __call__(cls, *args, **kwargs): if cls not in cls._instances: cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) return cls._instances[cls]

I did the same for the Flyweight pattern:

import inspect from itertools import count class Flyweight(type): _instances: dict = {} _ignore_args: dict = {} @classmethod def __prepare__(mcs, name, bases, **kwargs): return super().__prepare__(name, bases, **kwargs) def __new__(mcs, name, bases, namespace, **kwargs): return super().__new__(mcs, name, bases, namespace) def __init__(cls, name, bases, namespace, **kwargs): super().__init__(name, bases, namespace, **kwargs) if kwargs.get("ignore_args"): cls._ignore_args[name] = kwargs.get("ignore_args") def __filter_ignored_arguments(cls, *args, **kwargs): parameters = list(inspect.signature(cls.__init__).parameters.values()) parameters.pop(0) if len(kwargs) > 0: constructor_args = [ (index, name) for index, name in enumerate(kwargs.keys()) ] args = tuple(kwargs.values()) else: constructor_args = [ (index, parameter.name) for index, parameter in enumerate(parameters) ] ignored_indices = list( map( lambda arg: arg[0], filter( lambda arg: arg[1] in cls._ignore_args.get(cls.__name__, []), constructor_args, ), ) ) index = count(start=0, step=1) return [value for value in args if next(index) not in ignored_indices] def __call__(cls, *args, **kwargs): key = "{}-{}".format( cls.__name__, cls.__filter_ignored_arguments(*args, **kwargs).__str__() ) if key not in cls._instances: cls._instances[key] = super(Flyweight, cls).__call__(*args, **kwargs) return cls._instances[key]

Dunno if this could be a nice addition to solve those generic problems?

I have reworked the SupervisorComms now to be a Singleton.

jscheffl · 2026-01-11T19:38:11Z

Reworked PR with the review feedback, especially changed to a Singleton implementation to prevent a static class with static member. Better like this?

jscheffl · 2026-01-15T22:45:13Z

Up by another 109 commits - can I have another round of feedback @ashb / @amoghrajesh / @kaxil ?

jscheffl · 2026-01-18T10:41:38Z

Up by another 59 commits

amoghrajesh · 2026-01-19T07:06:42Z

I'd leave @ashb / @kaxil to be a judge of this one

dabla · 2026-01-19T20:30:00Z

Reworked PR with the review feedback, especially changed to a Singleton implementation to prevent a static class with static member. Better like this?

I like that refactor.

boring-cyborg bot added area:DAG-processing area:task-sdk area:Triggerer labels Dec 28, 2025

jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from d1fdaaa to 4bb5081 Compare December 29, 2025 08:44

jscheffl added full tests needed We need to run full set of tests for this PR to merge all versions If set, the CI build will be forced to use all versions of Python/K8S/DBs labels Dec 29, 2025

jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 4bb5081 to 780ce42 Compare December 29, 2025 12:58

jscheffl marked this pull request as ready for review December 29, 2025 12:58

jscheffl requested review from XD-DENG, amoghrajesh, ashb, dstandish, ephraimbuddy, hussein-awala, jedcunningham and kaxil as code owners December 29, 2025 12:59

kaxil reviewed Dec 29, 2025

View reviewed changes

airflow-core/src/airflow/models/variable.py Outdated Show resolved Hide resolved

potiuk reviewed Dec 29, 2025

View reviewed changes

task-sdk/src/airflow/sdk/execution_time/task_runner.py Outdated Show resolved Hide resolved

amoghrajesh reviewed Dec 30, 2025

View reviewed changes

task-sdk/src/airflow/sdk/execution_time/task_runner.py Outdated Show resolved Hide resolved

amoghrajesh reviewed Dec 30, 2025

View reviewed changes

jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 780ce42 to df71bea Compare December 31, 2025 15:48

jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 69efc3d to 3ec9115 Compare December 31, 2025 17:06

jscheffl requested a review from amoghrajesh January 2, 2026 09:45

jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 3ec9115 to 8acbbc4 Compare January 3, 2026 12:18

jscheffl requested a review from kaxil January 6, 2026 21:02

jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 8acbbc4 to b45f6cd Compare January 6, 2026 21:03

ashb requested changes Jan 7, 2026

View reviewed changes

ashb reviewed Jan 7, 2026

View reviewed changes

jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from b45f6cd to 626b383 Compare January 11, 2026 11:11

jscheffl requested review from ashb and dabla January 11, 2026 19:38

jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 2df8bfe to 18f93b3 Compare January 15, 2026 22:44

jscheffl and others added 10 commits January 18, 2026 11:41

Remove global from task_runner supervisor-comms

654bf75

Fix pytests

665d640

Rework supervisor_comms().send() to supervisor_send()

ea79dbd

Review feedback Kaxil and Amogh

14056f2

Harden check for execution context

c37728b

Rework supervisor comms to a singleton pattern

172be4e

Rework has_execution_context() to use env _AIRFLOW_PROCESS_CONTEXT

21d73e2

Fix code comments

6522fb7

Fix references in Airflow core

4e940be

Fix pytest

c103ddf

jscheffl force-pushed the bugfix/remove-global-from-task-runner-supervisor-comms branch from 18f93b3 to c103ddf Compare January 18, 2026 10:41

		map_index: Mapped[int] = mapped_column(Integer, nullable=False, server_default=text("-1"))


		def has_execution_context() -> bool:

	# 2. Check for explicit server context
	if os.environ.get("_AIRFLOW_PROCESS_CONTEXT") == "server":
	# Server context: API server, scheduler
	# uses the default server list

	# Mark as client-side (runs user DAG code)
	# Prevents inheriting server context from parent DagProcessorManager
	os.environ["_AIRFLOW_PROCESS_CONTEXT"] = "client"

Remove global from task_runner supervisor-comms #59876

Are you sure you want to change the base?

Remove global from task_runner supervisor-comms #59876

Uh oh!

Conversation

jscheffl commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaxil left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

potiuk commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

amoghrajesh left a comment

Choose a reason for hiding this comment

Uh oh!

jscheffl commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Dec 30, 2025

Uh oh!

potiuk commented Dec 30, 2025

Uh oh!

amoghrajesh commented Dec 31, 2025

Uh oh!

jscheffl commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

potiuk commented Jan 4, 2026

Uh oh!

dabla commented Jan 5, 2026

Uh oh!

ashb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaxil Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jscheffl commented Jan 11, 2026

Uh oh!

jscheffl commented Dec 28, 2025 •

edited

Loading

kaxil left a comment •

edited

Loading

potiuk commented Dec 29, 2025 •

edited

Loading

jscheffl commented Dec 30, 2025 •

edited

Loading

potiuk commented Dec 30, 2025 •

edited

Loading

jscheffl commented Dec 31, 2025 •

edited

Loading

ashb left a comment •

edited

Loading

kaxil Jan 8, 2026 •

edited

Loading