More efficient/sophisticated work scheduler #1057

BoPeng · 2018-09-21T03:59:20Z

Although sos and ipyparallel have quite different design ideas, the scheduler, especially the zmq connections is very interesting because we are migrating from native python multi-processing to zmq. Since ipyparallel has done a lot of work/testing to achieve better performance, perhaps we should learn (really hard) what they have done. Note that, however,

sos spawns task engine, DAG executor, and works locally. There is no need for users to start them.
sos does not allow remote workers (cluster etc), these are done through external tasks.
SoS DAG is very complex because of the flexible multiple style design, concurrent substep, external tasks, nested workflow etc. It is therefore very difficult to improve the performance of SoS.

BoPeng · 2018-09-21T04:57:43Z

My understanding of "performance" of workflows can be summarized as follows:

DAG is limited in complexity, at most a few hundred nodes
- The performance of interpreting DAG and finding the right steps to execute might not matter much
- Idle workers (idle processes during execution #1056): Performance can be bad if we have idle processes that prevents the full speed execution of steps.
- Load balancing: ???
The biggest time consuming parts should be step themselves, including
- Large step or task: performance of interpreter does not matter.
- Many substeps: we are using pool + async map, which is as fast as it can be. The performance of signature checking matters
- Many tasks: This can be a problem because of the overhead of saving, executing and monitoring external tasks. However, tasks are not recommended for local executions and users are recommended to use substeps for these cases.

So in the end I am not quite sure how much performance improvement we can achieve when we optimize the SoS execution engine.

gaow · 2018-09-21T13:57:49Z

Sorry, but I am not sure I appreciated the big picture here. The goal is to "optimize the SoS execution engine" by borrowing from ipyparallel and extend it to fit the complicated DAG pattern of SoS, right? Then my understanding is that there are major limitations in the current SoS execution engine that we'd like to overcome. Here you have named #1056 which seems the only problem relevant to execution pattern itself. But that ticket already proposed a solution not involving ipyparallel. So I'm not sure what exactly we would like to adopt ipyparallel for.

BoPeng · 2018-09-21T14:02:28Z

No. I am just checking how others are using zmq in a task execution setting, and ipyparallel seems to be mature tool that fits the bill, which can be more robust and/or efficient, at least they said that their current model is much more efficient than the model they used before. Whether or not we are suffering from the same set of problems they had is currently uncertain.

As I said in another thread, I will not adopt ipyparallel unless something seriously wrong about the SoS DAG execution engine is found.

BoPeng mentioned this issue Sep 21, 2018

Job status report inconsistency #1058

Open

BoPeng closed this as completed Feb 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient/sophisticated work scheduler #1057

More efficient/sophisticated work scheduler #1057

BoPeng commented Sep 21, 2018

BoPeng commented Sep 21, 2018 •

edited

Loading

gaow commented Sep 21, 2018

BoPeng commented Sep 21, 2018

More efficient/sophisticated work scheduler #1057

More efficient/sophisticated work scheduler #1057

Comments

BoPeng commented Sep 21, 2018

BoPeng commented Sep 21, 2018 • edited Loading

gaow commented Sep 21, 2018

BoPeng commented Sep 21, 2018

BoPeng commented Sep 21, 2018 •

edited

Loading