Skip to content

Spend less time waiting for LocalTaskJob's subprocss process to finish#11373

Merged
ashb merged 2 commits intoapache:masterfrom
astronomer:less-time-waiting-task-runner
Oct 13, 2020
Merged

Spend less time waiting for LocalTaskJob's subprocss process to finish#11373
ashb merged 2 commits intoapache:masterfrom
astronomer:less-time-waiting-task-runner

Conversation

@ashb
Copy link
Member

@ashb ashb commented Oct 9, 2020

This is about is about a 20% speed up for short running tasks!

This change doesn't affect the "duration" reported in the TI table, but
does affect the time before the slot is freeded up from the executor -
which does affect overall task/dag throughput.

(All these tests are with the same BashOperator tasks, just running echo 1.)

Before

Task airflow.executors.celery_executor.execute_command[5e0bb50c-de6b-4c78-980d-f8d535bbd2aa] succeeded in 6.597011625010055s: None
Task airflow.executors.celery_executor.execute_command[0a39ec21-2b69-414c-a11b-05466204bcb3] succeeded in 6.604327297012787s: None

After

Task airflow.executors.celery_executor.execute_command[57077539-e7ea-452c-af03-6393278a2c34] succeeded in 1.7728257849812508s: None
Task airflow.executors.celery_executor.execute_command[9aa4a0c5-e310-49ba-a1aa-b0760adfce08] succeeded in 1.7124666879535653s: None

After, including change from #11372

Task airflow.executors.celery_executor.execute_command[35822fc6-932d-4a8a-b1d5-43a8b35c52a5] succeeded in 0.5421732050017454s: None
Task airflow.executors.celery_executor.execute_command[2ba46c47-c868-4c3a-80f8-40adaf03b720] succeeded in 0.5469810889917426s: None

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Oct 9, 2020
@ashb ashb requested review from kaxil, mik-laj, potiuk and turbaszek October 9, 2020 10:25
@github-actions
Copy link

github-actions bot commented Oct 9, 2020

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks$,^Build docs$,^Spell check docs$,^Backport packages$,^Checks: Helm tests$,^Test OpenAPI*.

@ashb ashb force-pushed the less-time-waiting-task-runner branch from 1fa0694 to 6d5a76c Compare October 9, 2020 12:15
@ashb ashb force-pushed the less-time-waiting-task-runner branch from 6d5a76c to a585f93 Compare October 12, 2020 10:33
This is about is about a 20% speed up for short running tasks!

This change doesn't affect the "duration" reported in the TI table, but
does affect the time before the slot is freeded up from the executor -
which does affect overall task/dag throughput.

(All these tests are with the same BashOperator tasks, just running `echo 1`.)

**Before**

```
Task airflow.executors.celery_executor.execute_command[5e0bb50c-de6b-4c78-980d-f8d535bbd2aa] succeeded in 6.597011625010055s: None
Task airflow.executors.celery_executor.execute_command[0a39ec21-2b69-414c-a11b-05466204bcb3] succeeded in 6.604327297012787s: None

```

**After**

```
Task airflow.executors.celery_executor.execute_command[57077539-e7ea-452c-af03-6393278a2c34] succeeded in 1.7728257849812508s: None
Task airflow.executors.celery_executor.execute_command[9aa4a0c5-e310-49ba-a1aa-b0760adfce08] succeeded in 1.7124666879535653s: None
```

**After, including change from apache#11372**

```
Task airflow.executors.celery_executor.execute_command[35822fc6-932d-4a8a-b1d5-43a8b35c52a5] succeeded in 0.5421732050017454s: None
Task airflow.executors.celery_executor.execute_command[2ba46c47-c868-4c3a-80f8-40adaf03b720] succeeded in 0.5469810889917426s: None
```
@ashb ashb force-pushed the less-time-waiting-task-runner branch from a585f93 to c0ea217 Compare October 12, 2020 14:08
@ashb
Copy link
Member Author

ashb commented Oct 12, 2020

One heisen test failed in a slightly odd way:

self = <airflow.executors.dask_executor.DaskExecutor object at 0x7efe33cd1090>

    def end(self) -> None:
        if not self.client:
            raise AirflowException(NOT_STARTED_MESSAGE)
        if not self.futures:
>           raise AirflowException(NOT_STARTED_MESSAGE)
E           airflow.exceptions.AirflowException: The executor should be started first!

airflow/executors/dask_executor.py:113: AirflowException

Do we think it's safe to merge anyway?

@potiuk
Copy link
Member

potiuk commented Oct 12, 2020

One heisen test failed in a slightly odd way:

self = <airflow.executors.dask_executor.DaskExecutor object at 0x7efe33cd1090>

    def end(self) -> None:
        if not self.client:
            raise AirflowException(NOT_STARTED_MESSAGE)
        if not self.futures:
>           raise AirflowException(NOT_STARTED_MESSAGE)
E           airflow.exceptions.AirflowException: The executor should be started first!

airflow/executors/dask_executor.py:113: AirflowException

Do we think it's safe to merge anyway?

You can run the tests easily now for the whole group :) . LEt me do it for you

@ashb
Copy link
Member Author

ashb commented Oct 12, 2020

Oh you mean locally :) Of course.

@potiuk
Copy link
Member

potiuk commented Oct 12, 2020

From what I see it failed similarly in 3 cases. Looking at the stability of the errors in the previous builds today - It looks like it's a "real" problem,

@potiuk
Copy link
Member

potiuk commented Oct 12, 2020

And I think we have the first failure with exit 137 on the backfill jobs since the split.

@potiuk
Copy link
Member

potiuk commented Oct 12, 2020

So it looks like the backfill_job tests have some real problems - it failed after 90 minutes.

@potiuk
Copy link
Member

potiuk commented Oct 12, 2020

It failed on me with the first attempt I made. So the problem is real.

./breeze --github-image-id 302320596 --backend mysql --mysql-version 5.7 --python 3.8 --db-reset --test-type Core tests

Screenshot from 2020-10-12 20-34-17

@ashb
Copy link
Member Author

ashb commented Oct 12, 2020

I have a fix for the Dask Executor, but I'm not sure what is going really on (why it is suddenly failing.)

It has

if not self.futures:

But the problem is that self.futures is a dict, so evaluates to false when it's empty.

@ashb ashb merged commit 623d5cd into apache:master Oct 13, 2020
@ashb ashb deleted the less-time-waiting-task-runner branch October 13, 2020 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:performance area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants