-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Remove global from task_runner supervisor-comms #59876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Remove global from task_runner supervisor-comms #59876
Conversation
d1fdaaa to
4bb5081
Compare
4bb5081 to
780ce42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo globals can be bad, but this specific use case is one of the legitimate ones, in fact better than the approach without globals.
def supervisor_comms() -> CommsDecoder[ToTask, ToSupervisor]:
return _SupervisorCommsHolder.commsBut then violates it in supervisor.py:1669:
task_runner._SupervisorCommsHolder.comms = temp_comms # and every call site becomes verbose:
Before:
SUPERVISOR_COMMS.send(msg)After (unnecessary function call)
supervisor_comms().send(msg)Every access now requires a function call + None check instead of direct variable access.
My 2 cents :)
| # If this is set it means are in some kind of execution context (Task, Dag Parse or Triggerer perhaps) | ||
| # and should use the Task SDK API server path | ||
| if hasattr(sys.modules.get("airflow.sdk.execution_time.task_runner"), "SUPERVISOR_COMMS"): | ||
| from airflow.sdk.execution_time.task_runner import is_supervisor_comms_initialized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this used hasattr on sys so we don't need to have airflow.sdk installed on the server components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think we should solve it differently, I would say here just checking if "task-sdk" is installed should be a better check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or smth else - but this should be revised after we complete the isolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, have reverted/adjusted to previous state - as it was repeated Copy&Paste have moved it into a utility in https://github.com/apache/airflow/pull/59876/changes#diff-7694d13e2f87c84d20b0b8b44797bf96d754ae270204217e518082decc74649bR104
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would leave other improvements of detection to another PR.
I think if we add I like this pattarn that Jens introduced a lot more to be honest, it is more unit-test friendly IMHO. Also that removes the None-check for every call. When None is detected in a first call - RuntimeException is thrown and interpreter will exit - so None will be checked only at first use when cache is set. |
amoghrajesh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be OK with the current pattern we have for the sake of ease of understanding but if we have to change it, I do not think there is a better solution than this one. Would require some getting used to to use the new pattern
@kaxil @amoghrajesh fair points and I am also not 100% convinced the solution is perfect. Te usage of the supervisor_send(data)...as a shortcut? Would it make it better to handle the sending in the function and have better readable code w/o global? By the way with |
I don't think so. Calling a function in Python is generally very slow operation. It's not a classic pointer jump. Functions calls in Python are done by interpreter, they are not using jumps as C programs do. Interpreter has to look-up the method to call, create a new frame, push it on "Python stack" and clean the frame after it returns. This is all done "in the interpreter" - it's not even using the processor stack. "Python stack" for method frames is actually stored in heap memory, not in processor stack - so any stack manipulation (calling and returning from function) is kinda slow. I did some basic micro-benchmarks: import time
from functools import lru_cache
class MethodBenchmark:
def __init__(self):
self.call_count = 0
def empty_method(self):
"""Empty method without caching"""
self.call_count += 1
return None
@lru_cache(maxsize=128)
def cached_empty_method(self):
"""Empty method with caching"""
return None
def benchmark():
obj = MethodBenchmark()
iterations = 1_000_000
# Benchmark non-cached method
start = time.perf_counter()
for _ in range(iterations):
obj.empty_method()
non_cached_time = time.perf_counter() - start
# Benchmark cached method
start = time.perf_counter()
for _ in range(iterations):
obj.cached_empty_method()
cached_time = time.perf_counter() - start
# Results
print(f"Non-cached method: {non_cached_time:.6f} seconds")
print(f"Cached method: {cached_time:.6f} seconds")
print(f"Speedup: {non_cached_time / cached_time:.2f}x")
print(f"Non-cached call count: {obj.call_count}")
if __name__ == "__main__":
benchmark()Result with Python 3.10 This means that when you put I modified the code of both methods to do single if: And there are the results: (100001 is because cached call increased it by 1) There are plenty of optimisations in Python 3.11 - 3.14 that might skew this simple example (specializing adaptive interpreter changes and JIT) - so I run it with Python 3.10 Similar discussion: https://stackoverflow.com/questions/14648374/python-function-calls-are-really-slow |
I think that's a good idea, it's likely better to expose "comms actions" than "comm" itself. |
|
BTW. To be perfectly honest, In this case I think performance is not as important (at least until we will start doing the communication very frequently - for example following @dabla optimisation / async task reporting back status of individual async coroutines back to scheduler). The comms overhead for inter-process communication and serialization of data involved (every such call needs to serialize data sent across the wire - even if shared memory is used to communicate between processes) is already likely order of magnitude slower than single method call in Python, so this should not be too much of a concern. In this case I think we should optimise for readability, I also do not like the extra |
I agree with this. If we have to go down the way of removing globals here, I would prioritise using |
780ce42 to
df71bea
Compare
|
Thanks for the feedback, have it now adjusted to use Re-review appreciated :-D (Am surprised actually that I wanted to remove two global statements but actually am slimming down the codebase by 40 LoC now :-D) |
69efc3d to
3ec9115
Compare
3ec9115 to
8acbbc4
Compare
|
LVGTM (V=Very) Most of the slimming goes from consolidation of copy&pasted comments to a single place - but this is one of the best trimming you can do as DRY in this case is important :D |
|
Nice work Jens! |
8acbbc4 to
b45f6cd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now I'm -1 to most of this change (the has_execution_context() fn part is good):
Moving from a global variable of SUPERVISOR_COMMS to a class containing a class variable is changing form a global variable, to a global variable by a different name. It's a change for no benefit. Globals aren't inherrently bad, they are tool that have their place.
| map_index: Mapped[int] = mapped_column(Integer, nullable=False, server_default=text("-1")) | ||
|
|
||
|
|
||
| def has_execution_context() -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrectly named. -- _SupervisorCommsHolder.comms") would be set in parsing, and parsing is not an "execution" context.
@kaxil Didn't you recently add some other context (server vs something else)? What was that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming was made from the previous code comment If this is set it means are in some kind of execution context - but other proposals welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming things is hard!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah _AIRFLOW_PROCESS_CONTEXT that accepts client / server
airflow/task-sdk/src/airflow/sdk/execution_time/supervisor.py
Lines 1939 to 1942 in e36e6ca
| # 2. Check for explicit server context | |
| if os.environ.get("_AIRFLOW_PROCESS_CONTEXT") == "server": | |
| # Server context: API server, scheduler | |
| # uses the default server list |
airflow/airflow-core/src/airflow/dag_processing/processor.py
Lines 181 to 183 in e36e6ca
| # Mark as client-side (runs user DAG code) | |
| # Prevents inheriting server context from parent DagProcessorManager | |
| os.environ["_AIRFLOW_PROCESS_CONTEXT"] = "client" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is hard. Have renamed now to is_client_process_context() and to improve it when I had my hands on it, the decision is also now based on the ENV.
|
|
||
| # If this is set it means are in some kind of execution context (Task, Dag Parse or Triggerer perhaps) | ||
| # and should use the Task SDK API server path | ||
| return hasattr(sys.modules.get("airflow.sdk.execution_time.task_runner"), "_SupervisorCommsHolder.comms") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 90% sure that this will always return false. hasattr doesn't look "deeply".
In [2]: airflow.sdk.execution_time.task_runner.AirflowException.__mro__
Out[2]: (airflow.sdk.exceptions.AirflowException, Exception, BaseException, object)
In [3]: hasattr(airflow.sdk.execution_time.task_runner, "AirflowException.__mro__")
Out[3]: False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. And this means our unit tests do not exercise this path correctly.
In [1]: import airflow.sdk.execution_time.task_runner
2026-01-07T10:53:37.236044Z [warning ] Skipping masking for a secret as it's too short (<5 chars) [airflow._shared.secrets_masker.secrets_masker] loc=secrets_masker.py:551
2026-01-07T10:53:37.236167Z [warning ] Skipping masking for a secret as it's too short (<5 chars) [airflow.sdk._shared.secrets_masker.secrets_masker] loc=secrets_masker.py:551
In [2]: import sys
In [3]: airflow.sdk.execution_time.task_runner._SupervisorCommsHolder.comms = 1
In [4]: hasattr(sys.modules.get("airflow.sdk.execution_time.task_runner"), "_SupervisorCommsHolder.comms")
Out[4]: False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good catch! Thenaks! Have reworked the logic now.
| def supervisor_send(msg: ToSupervisor) -> ToTask | None: | ||
| """Send a message to the supervisor as convenience for get_supervisor_comms().send().""" | ||
| if _SupervisorCommsHolder.comms is None: | ||
| raise RuntimeError("Supervisor comms not initialized yet. Call set_supervisor_comms() instead.") | ||
| return _SupervisorCommsHolder.comms.send(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to keep this sort of pattern, I think it might be better to do something similar to what we do for metrics/Stats
class UnsetComms(CommsDecoder[ToTask, ToSupervisor]):
def send(self, msg):
raise RuntimeError("Supervisor comms not initialized yet. Call set_supervisor_comms() instead.")
...
initalized: Final[bool] = FalseAnd then SUPERVISOR_COMMS/_SupervisorCommsHolder.comms can be initialized to an instance of this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past I created some kind of Singleton metadata class, then you only needed to extend your class with it and it would behave like a singleton, even if you could call the constructor multiple times, you would always end up having the same instance. In Java this was simple to achieve by defining you class as an enum, but not in Python.
So I ended up creating a Singleton class for Python:
class Singleton(type):
_instances: dict = {}
def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
return cls._instances[cls]
I did the same for the Flyweight pattern:
import inspect
from itertools import count
class Flyweight(type):
_instances: dict = {}
_ignore_args: dict = {}
@classmethod
def __prepare__(mcs, name, bases, **kwargs):
return super().__prepare__(name, bases, **kwargs)
def __new__(mcs, name, bases, namespace, **kwargs):
return super().__new__(mcs, name, bases, namespace)
def __init__(cls, name, bases, namespace, **kwargs):
super().__init__(name, bases, namespace, **kwargs)
if kwargs.get("ignore_args"):
cls._ignore_args[name] = kwargs.get("ignore_args")
def __filter_ignored_arguments(cls, *args, **kwargs):
parameters = list(inspect.signature(cls.__init__).parameters.values())
parameters.pop(0)
if len(kwargs) > 0:
constructor_args = [
(index, name) for index, name in enumerate(kwargs.keys())
]
args = tuple(kwargs.values())
else:
constructor_args = [
(index, parameter.name) for index, parameter in enumerate(parameters)
]
ignored_indices = list(
map(
lambda arg: arg[0],
filter(
lambda arg: arg[1] in cls._ignore_args.get(cls.__name__, []),
constructor_args,
),
)
)
index = count(start=0, step=1)
return [value for value in args if next(index) not in ignored_indices]
def __call__(cls, *args, **kwargs):
key = "{}-{}".format(
cls.__name__, cls.__filter_ignored_arguments(*args, **kwargs).__str__()
)
if key not in cls._instances:
cls._instances[key] = super(Flyweight, cls).__call__(*args, **kwargs)
return cls._instances[key]
Dunno if this could be a nice addition to solve those generic problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reworked the SupervisorComms now to be a Singleton.
b45f6cd to
626b383
Compare
|
Reworked PR with the review feedback, especially changed to a Singleton implementation to prevent a static class with static member. Better like this? |
2df8bfe to
18f93b3
Compare
|
Up by another 109 commits - can I have another round of feedback @ashb / @amoghrajesh / @kaxil ? |
18f93b3 to
c103ddf
Compare
|
Up by another 59 commits |
I like that refactor. |
Another small (in this case rather medium complex) increment to remove global statements for PR #58116
This removes 2 global statements from task_runner.py where explicitly a global wariable was used as shared SUPERVISOR_COMMS by intent. Proposing to change with via a static class and accessor-methods to prevent usage of global variables.
globalis evil.For this PR to merge seeking for explicit approval from one of the Task SDK creators @ashb, @kaxil and/or @amoghrajesh