-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-1641] Handle executor events in the scheduler #2715
Conversation
cd81845
to
561cc8e
Compare
Codecov Report
@@ Coverage Diff @@
## master #2715 +/- ##
==========================================
+ Coverage 72.37% 72.46% +0.09%
==========================================
Files 154 154
Lines 11815 11836 +21
==========================================
+ Hits 8551 8577 +26
+ Misses 3264 3259 -5
Continue to review full report at Codecov.
|
561cc8e
to
9758a10
Compare
cc @andylockran @JohnCulver ready for testing |
airflow/executors/base_executor.py
Outdated
return d | ||
else: | ||
d = {} | ||
for key, state in list(self.event_buffer.items()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why cast to list here? Is this some python 3 thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (python 2 actually afaik), but more importantly I am changing event_buffer in this loop, copying it into a list makes sure not to skip any events.
ti.task = dag.get_task(task_id) | ||
ti.handle_failure(msg) | ||
except Exception: | ||
self.log.error("Cannot load the dag bag to handle failure for %s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will alerts be sent? If there are no callbacks or retries, and no alerts are sent either, this is going to cause operational issues for folks, I suspect. We depend heavily on alerting messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not if it cannot load the DAG. Basically if DAG loading fails there is no way where to know where to send that email to. That information is unavailable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense.
dag = dagbag.get_dag(dag_id) | ||
ti.task = dag.get_task(task_id) | ||
ti.handle_failure(msg) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there nothing better we can do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I share this concern (see note above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do a suggestion. I don't like it either, but to have a task instance manage its own failure it needs the associated task. Simple_dag cannot and does not hold the operators (pickle errors), so it is required to get it from the normal DagBag.
I very open to suggestions, but I have been pursuing different scenarios already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It almost feels like the alert should be sent to the admin of airflow rather than the task owner, since the appropriate action involves a config change (or machine change) to increase capacity or timeout limit on DAG imports. Is this possible in some palatable way? I don't think we have this concept right now in Airflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, hence the message in the logs. I mean that is in the end what logging is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, but where do these logs get piped to? Is it the airflow scheduler logs, or is it the DAG-specific logs? Monitoring dag-specific logs for "error" is way too chatty since people break their DAGs all the time. If it scheduler, should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha.
airflow/jobs.py
Outdated
continue | ||
|
||
if ti.state == State.RUNNING or ti.state == State.QUEUED: | ||
msg = ("Executor reports task instance {} finished ({}) " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably shouldn't handle it in the running case in case this failure does not correspond to this try for some reason. The state should get unset from running anyway because of heartbeats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is tied to the try. The executor reports on the task_instance which does have the try associated with it. Please note that this exactly the same code as we use for backfills. Basically you want to fail also when running, because the worker is returning and does not manage the task anymore if it is still in RUNNING.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases where it fails, then is run manually again, then the executor inspects, we will then handle_failure which is probably not ideal. If anything, in the running state, the heartbeat logic should handle the failure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW @saguziel : I see what you mean. There is a small change that the scheduler has scheduled a new task instance with the same key and that instance is running now. While I do think that goes outside the scope of this PR (or maybe not?), we could guard against this by not allowing task_instances known to the executor (i.e. queued_tasks, running, and new: event_buffer) to be executed again or check against the start_date (the 'try' im really not enthusiastic about). Checking against the start_date would require more in-depth changes (start_date should be part of the key) while maybe speeding the scheduler up by a little bit (at best a heartbeat)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saguziel (async comm and no threads):
In cases where it fails, then is run manually again, then the executor inspects, we will then handle_failure which is probably not ideal. If anything, in the running state, the heartbeat logic should handle the failure
I don't follow. You mean a kind of race condition? You would need to reset the state first. The task will not start when it finds itself to be in RUNNING
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: 2nd comment:
If the TI is in the running state, the task is either running or not running.
If it is running, we should not interrupt it by calling handle failure on the scheduler.
If it is not, the scheduler heartbeat logic will take care of it.
The executor event in the case when the TI is actually running should able to be discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im not sure if I agree. If the monitor (=worker/executor) reports a failure or success why should we discard that?
I don't have the heartbeat logic entirely in my mind but what happens if the executor reports SUCCESS and the task stays in RUNNING?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the monitor reports a failure, it is subject to arbitrary delays. So the failure was definitely in the past. If a TI is in RUNNING, we know it is running now (or will be reaped soon), and this is a tested path of Airflow, so we're relatively certain it was the only task running at the time.
If a task stays in RUNNING but is not actually running, it will get set to failure after 2* heartbeat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Been following this thread. I'm wondering if this logic to check for stuck queued TIs from timeout should be in the kill_zombies
method instead of here, then we can let scheduler heartbeat takes care of all the stale tasks, whether they are said to be running or queued.
I'm not super familiar with the scheduler side of things, but given that all _process_executor_events
did in the past was logging, it seems to suggest that the executor events are processed somewhere else. May be more clear to rename it to _log_executor_events
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgao54 it can’t be in kill_zombies as there is nothing to kill. The executor actually works as a kind of monitor on the OS level.
Backfills are managing the executor state and we should have done it here as well.
LGTM but will defer to @saguziel for a final signoff |
airflow/executors/base_executor.py
Outdated
d = self.event_buffer | ||
self.event_buffer = {} | ||
return d | ||
if not dag_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this if dag_ids is None:
?
The risk of someone programatically passing in an empty list and clearing everything instead of nothing seems like something we want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
airflow/executors/base_executor.py
Outdated
self.event_buffer = {} | ||
return d | ||
if not dag_ids: | ||
d = self.event_buffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, can we use a more descriptive variable name than d
? How about cleared_events = self.event_buffer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@@ -141,13 +141,27 @@ def fail(self, key): | |||
def success(self, key): | |||
self.change_state(key, State.SUCCESS) | |||
|
|||
def get_event_buffer(self): | |||
def get_event_buffer(self, dag_ids=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I suggest a slight refactor to remove some duplication and use the dict.pop return value:
cleared_events = {}
if dag_ids is None:
cleared_events = self.event_buffer
self.event_buffer = {}
else:
for key in list(previous_buffer):
dag_id, _, _ = key
if dag_id in dag_ids:
cleared_events[key] = self.event_buffer.pop(key)
return cleared_events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will, will slightly change yours
dag = dagbag.get_dag(dag_id) | ||
ti.task = dag.get_task(task_id) | ||
ti.handle_failure(msg) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be more specific about the errors we catch here than a blanket except Exception:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would want to catch any exception here, to make sure we do not let any "QUEUED" task_instances linger.
airflow/jobs.py
Outdated
msg = ("Executor reports task instance {} finished ({}) " | ||
"although the task says its {}. Was the task " | ||
"killed externally?".format(ti, state, ti.state)) | ||
self.log.error(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this self.log.error("...", ti, state, ti.state)
instead of self.log.error(msg)
to be consistent with the other logging and to take advantage of Python's lazy log processing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using it twice (see handle_failure below), so not in this case I guess
While in Backfills we do handle the executor state, we do not in the Scheduler. In case there is an unspecified error (e.g. a timeout, airflow command failure) tasks can get stuck.
9758a10
to
4ec5df4
Compare
@gwax addressed your comments @saguziel I updated the PR to only fail QUEUED tasks. I'm not convinced we should ignore RUNNING, as it is inconsistent with Backfills and I don't like that we are ignoring executor state here which functions as a monitor. But if it takes that to have it pass for now lets do that. |
lgtm |
The problem is that executor ends up not being consistent as it should, and we et state in other places typically. |
@saguziel lets discuss that f2f soon |
While in Backfills we do handle the executor state, we do not in the Scheduler. In case there is an unspecified error (e.g. a timeout, airflow command failure) tasks can get stuck. Closes #2715 from bolkedebruin/AIRFLOW-1641 (cherry picked from commit 2abead7) Signed-off-by: Bolke de Bruin <bolke@xs4all.nl>
While in Backfills we do handle the executor state, we do not in the Scheduler. In case there is an unspecified error (e.g. a timeout, airflow command failure) tasks can get stuck. Closes apache#2715 from bolkedebruin/AIRFLOW-1641 (cherry picked from commit 2abead7) Signed-off-by: Bolke de Bruin <bolke@xs4all.nl>
Dear Airflow maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
While in Backfills we do handle the executor state,
we do not in the Scheduler. In case there is an unspecified
error (e.g. a timeout, airflow command failure) tasks
can get stuck.
Tests
Should be added.
Commits
cc @saguziel @aoen @criccomini