Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dask): make dask autoscaler configurable #835

Merged
merged 1 commit into from
Nov 11, 2024

Conversation

Alputer
Copy link
Member

@Alputer Alputer commented Sep 20, 2024

closes #834

@Alputer Alputer self-assigned this Sep 20, 2024
Copy link

codecov bot commented Sep 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 31.08%. Comparing base (c0d2e92) to head (debb354).
Report is 1 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #835   +/-   ##
=======================================
  Coverage   31.08%   31.08%           
=======================================
  Files          26       26           
  Lines        2487     2487           
=======================================
  Hits          773      773           
  Misses       1714     1714           

@Alputer Alputer changed the title Dask autoscaler config feat(dask): dask autoscaler config (#835) Sep 20, 2024
@Alputer Alputer force-pushed the dask-autoscaler-config branch from e3ce610 to 128e7de Compare October 31, 2024 14:36
Alputer added a commit to Alputer/reana that referenced this pull request Oct 31, 2024
@Alputer Alputer force-pushed the dask-autoscaler-config branch from 128e7de to c91a810 Compare October 31, 2024 14:37
@Alputer Alputer changed the title feat(dask): dask autoscaler config (#835) feat(dask): make dask autoscaler configurable Oct 31, 2024
Alputer added a commit to Alputer/reana that referenced this pull request Nov 1, 2024
@Alputer Alputer force-pushed the dask-autoscaler-config branch from c91a810 to b513ddb Compare November 1, 2024 09:27
@@ -101,6 +101,8 @@ spec:
- name: DASK_ENABLED
value: {{ .Values.dask.enabled | quote }}
{{- if .Values.dask.enabled }}
- name: DASK_AUTOSCALER_ENABLED
value: {{ .Values.dask.autoscaler_enabled | quote }}
Copy link
Member

@tiborsimko tiborsimko Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW when testing the autoscaler PR set with the auto scaler turned off:

  autoscaler_enabled: false
  cluster_default_number_of_workers: 4

I'm seeing a problem with stopping workflows:

$ rc run -w calver
...
$ rc stop -w calver --force
==> ERROR: Cannot stop workflow calver:
(404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '4c08dda6-899b-4da2-bdf7-2c2c76006ce6', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '6b060355-5047-4214-899c-623e82a0c569', 'X-Kubernetes-Pf-Prioritylevel-Uid': '498afda6-b2ba-4561-9313-9bda4f75865d', 'Date': 'Fri, 01 Nov 2024 14:13:25 GMT', 'Content-Length': '374'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"daskautoscalers.kubernetes.dask.org \"dask-autoscaler-reana-run-dask-ce1f4c15-36ff-42fc-b320-b5f277322db7\" not found","reason":"NotFound","details":{"name":"dask-autoscaler-reana-run-dask-ce1f4c15-36ff-42fc-b320-b5f277322db7","group":"kubernetes.dask.org","kind":"daskautoscalers"},"code":404}

The worklow is stopped, the Dask workers are stopped, but the job pod fails with:

$ k logs reana-run-job-42c23954-93d7-4524-8e9b-daa28e8c1fcd-5p4fg
...
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
Exception ignored in: <function Client.__del__ at 0x7ff1dc223100>
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/distributed/client.py", line 1736, in __del__
  File "/usr/local/lib/python3.11/site-packages/distributed/client.py", line 1991, in close
  File "/usr/local/lib/python3.11/site-packages/distributed/utils.py", line 434, in sync
TimeoutError: timed out after 60 s.

$ k logs reana-run-batch-ce1f4c15-36ff-42fc-b320-b5f277322db7-qhzq2
2024-11-01 14:14:31,967 | root | MainThread | INFO | Workflow ce1f4c15-36ff-42fc-b320-b5f277322db7 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/ce1f4c15-36ff-42fc-b320-b5f277322db7.

P.S. When auto-scaler is enabled, then the stopping works perfectly well.

@@ -69,6 +69,7 @@ This Helm automatically prefixes all names using the release name to avoid colli
| `components.reana_workflow_engine_snakemake.image` | [REANA-Workflow-Engine-Snakemake image](https://hub.docker.com/r/reanahub/reana-workflow-engine-snakemake) to use | `docker.io/reanahub/reana-workflow-engine-snakemake:<chart-release-version>` |
| `compute_backends` | List of supported compute backends (kubernetes, htcondorcern, slurmcern) | "kubernetes" |
| `dask.enabled` | Enable support for running Dask workflows | false |
| `dask.autoscaler_enabled` | Enable Dask autoscaler | true |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is one boolean variable enough?

Looking at the Dask auto-scaler spec, some parameters are possible:

apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: example
spec:
  cluster: "name of DaskCluster to autoscale"
  minimum: 0  # minimum number of workers to create
  maximum: 10 # maximum number of workers to create

Perhaps instead of using only one autoscaler_enabled Helm value we should use two for min and max, such as:

dask.cluster_autoscale_workers_min: 1
dask.cluster_autoscale_workers_max: 4

And, if both min and max values would be equal to the same amount, then this would indicate that the auto-scaling would be disabled.

This would kind of "override" the meaning of cluster_default_number_of_workers so we could get rid of that one...

We can think IRL.

Copy link
Member Author

@Alputer Alputer Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit susceptible about these new parameters. If they are equal for example, it not only means that auto-scaling is disabled but also users are not allowed to override the number of workers for their Dask cluster.

I also think that we should be able to somehow restrict users to ask for unbounded number of workers (e.g 1000 workers with 1Mi memory regardless of if the autoscaling is enabled or disabled. We might specify the max number of workers in another way by setting a minimum memory limit for one worker. This would implicitly indicate that max_workers = max_dask_cluster_memory // min_worker_memory. I think this approach is better. We should talk about how we approach this one IRL.

And for the minimum number of workers, I do not see any case where we may want to force users to have at least X number of workers where X>1. So, maybe we can simply check if users request <1 number of workers and then reject such requests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see the point. We could keep the autoscale to be independent. However, allowing users to specify 1000 workers with low memory per worker would not be good, since users could generate 1000s of pods which would overload Kubernetes master and slow down everything... So we may want to keep memory independent. Let's muse IRL.

Alputer added a commit to Alputer/reana that referenced this pull request Nov 4, 2024
@Alputer Alputer force-pushed the dask-autoscaler-config branch from b513ddb to 26d5078 Compare November 4, 2024 10:08
Copy link
Member

@tiborsimko tiborsimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit log cosmetics: the conventional commit scope should represent which component was touched by the change, so we should use "helm" here:

feat(helm): make Dask autoscaler configurable (#835)

Also, please put Closes #NNN in the commit message body, so that we have the reference to the issue which this PR closes also locally, not only on GitHub.

Alputer added a commit to Alputer/reana that referenced this pull request Nov 11, 2024
@Alputer Alputer force-pushed the dask-autoscaler-config branch from 26d5078 to aeebee1 Compare November 11, 2024 09:03
Alputer added a commit to Alputer/reana that referenced this pull request Nov 11, 2024
@Alputer Alputer force-pushed the dask-autoscaler-config branch from aeebee1 to 2ebee54 Compare November 11, 2024 09:30
Alputer added a commit to Alputer/reana that referenced this pull request Nov 11, 2024
@Alputer Alputer force-pushed the dask-autoscaler-config branch from 2ebee54 to b90e25f Compare November 11, 2024 10:58
@Alputer Alputer force-pushed the dask-autoscaler-config branch from b90e25f to debb354 Compare November 11, 2024 11:17
@tiborsimko tiborsimko merged commit debb354 into reanahub:master Nov 11, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

feat(dask): make dask autoscaler optional
2 participants