-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-160] Parse DAG files through child processes #1559
[AIRFLOW-160] Parse DAG files through child processes #1559
Conversation
3f47717
to
3d95a79
Compare
@@ -459,39 +700,24 @@ def schedule_dag(self, dag): | |||
session.commit() | |||
return next_run | |||
|
|||
def process_dag(self, dag, queue): | |||
def create_task_instances(self, dag, queue): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be called process_task_instances, task_instances are eagerly created thus the name does not reflect what is happening here
I do have some general remarks that might be smarter to discuss in person to align. My concerns lie in the fact that DagRuns do not seem to be really used and it seems to rely on TaskInstances. Also some divergence with the work I and @jlowin have been doing from a first review. (I do really like spinning of the dag handling in separate processes, dont get me wrong!) |
ti = TI(task, ti_key[2]) | ||
# Task starts out in the queued state. All tasks in the queued | ||
# state will be scheduled later in the execution loop. | ||
ti.state = State.QUEUED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big change that alters the semantics of QUEUED
. There may be places in the code that check for State.NONE
(or state is None
) that could break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I wonder what the decision is to start using QUEUD for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was planned to define QUEUED
as the state when a task instance needs to be sent to the executor (or has been sent already). What are the use cases for the NONE
state? I searched for references to the NONE
state, but didn't have much luck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@plypaul might have a look at https://cwiki.apache.org/confluence/display/AIRFLOW/Scheduler+Basics .. @bolkedebruin took a shot at documenting the states a bit. If things are insufficient on that doc, we should update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm lacking clarity around the NONE
state and its purpose. One other strange thing that I saw was how task instances with pools get created in the NONE
state, get sent to the executor, runs, and then it puts itself in the QUEUED
state. See:
After that, it gets run again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pool stuff is pretty jacked. See this JIRA for a related discussion:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(A quick overview of task states. This isn't necessarily how things should be but it is how they are :) )
Indeed -- all tasks used to start with State.NONE
. @bolkedebruin's latest PR introduces State.SCHEDULED
when the TI has been created by the scheduler but not yet run. When the executor loads a TI, it eventually calls TI.run() which does one of two things:
- If the TI can be run, sets
State.RUNNING
and calls theexecute()
method. - If the TI has a pool, sets
State.QUEUED
and returns
The scheduler has aprioritize_queued
method which loads up all the queued tasks and tries to run them if there are slots available in their respective pools. That second run is the one that actually moves pooled tasks toState.RUNNING
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SCHEDULED only gets set when the scheduler does a handover to the executor. It only does that for tis that were NONE before. See also the wiki.
A lot of good work here, thanks @plypaul. It does need some realignment with the current push toward DagRun centralization but I think a lot of the core ideas are compatible. |
@@ -776,10 +795,10 @@ class CLIFactory(object): | |||
default=False), | |||
# scheduler | |||
'dag_id_opt': Arg(("-d", "--dag_id"), help="The id of the dag to run"), | |||
'num_runs': Arg( | |||
("-n", "--num_runs"), | |||
'run_duration': Arg( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commands that used to not fail will fail now, we'll need to put that in the header of the next release notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could re-add this feature for now, and potentially remove for the next major version?
c15982c
to
43cb91a
Compare
Does this PR have sufficient test coverage? |
The tests need some work at the moment - I will update when they're in a better state. |
a595c10
to
c6030c8
Compare
class DagFileProcessor(AbstractDagFileProcessor): | ||
"""Helps call SchedulerJob.process_file() in a separate process.""" | ||
|
||
# Counter that increments every an instance of this class is created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every time
Great work! All in all I do prefer a DagRun based approach and not using the QUEUED state here but rather SCHEDULED when a handover to the executor occurs. Additionally, I do think the shallow objects need to have a real relationship with the full objects otherwise we will end up with code maintenance issues. If we could get rid of the shallow objects that would even be better. |
Landscape shows 58 new PEP issues, and there appear to be some sketchy non-determinstic failures from travis:
|
@plypaul the error @criccomini is seeing with tests does increase my concerns with the changed semantics of "QUEUED". In addition, the tests need to be extended to be testing the new functionality you are creating not just updating the old tests. For instance I would like to see a test that has a faulty dag and see how it recovers from that. |
Yeah, testing is in progress, and I'll post a comment once those are updated. |
ccb5f64
to
79a1159
Compare
5db183e
to
aae2ba8
Compare
self.do_pickle = do_pickle | ||
super(SchedulerJob, self).__init__(*args, **kwargs) | ||
|
||
self.logger.error("Executor is {}".format(self.executor.__class__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info
@plypaul No I actually mean non-serialized. The dag files in itself are not so big and we are not pickling for state right? So that raises the question why serialize at all? |
|
||
dag_folder = dag_folder or DAGS_FOLDER | ||
self.logger.error("Using dag_folder {}".format(dag_folder)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info
8808bdd
to
0fc11f8
Compare
@bolkedebruin Sending around the contents of the DAG file isn't sufficient as it could import modules that aren't available on the host. |
With the zip dags it is actually |
Cause they can contain modules |
@bolkedebruin, isn't that the case iff the modules do not rely on external system libraries? |
If you mean that the modules in the zip require any native dependencies (.so) to be present on the host system then yes. If your hosts are architecturally the same then you should be able to package almost anything into the zipped dag. |
We could even extend that by allowing .so in the package and use LD_PRELOAD |
(So you should be able to shoot numpy around) Another way that comes to mind (sorry typing on a phone) is to use the wheel format and let pip resolve dependencies for DAGs before we execute a task. Doing that in a virtual-Env / docker will not require root access. |
Doing that in a virtualenv per task will add a bit of time to the startup of a task but will allow versioning of dependencies. |
ace7aa2
to
a389019
Compare
Instead of parsing the DAG definition files in the same process as the scheduler, this change parses the files in a child process. This helps to isolate the scheduler from bad user code.
2e03b4f
to
084196a
Compare
This preliminary PR addresses the following issue:
https://issues.apache.org/jira/browse/AIRFLOW-160
As referenced in the JIRA issue, this PR aims to parse user DAG files in child processes to make the scheduler more robust. The main change occurs in the
SchedulerJob._execute()
. The overall flow for scheduling is:In the main process
In the child process
QUEUED
stateIn the main process
QUEUED
stateSome more notes:
DagFileProcessor
class creates the child process.DAG
object wasn't passed from the child process to the main process. Instead, a simplified DAG object with the fields necessary for scheduling were passed instead.DagBag
to sync to DB. Instead, DAGs are sync'ed to the DB in a separate call.Example log output: