-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential issues when using Dask-Gateway with multiple simultaneous users #1750
Comments
What does the ESIPFed cluster use for the dask-gateway pod? Is it the same as default? Have they seen this issue before? |
Thanks for the heads up on this issue @dharhas. We have not seen this issue on the ESIP Nebari deployment, but we also haven't had 50 people all try to launch a cluster at the same time. I thought I remembered that someone (the Berkeley Jupyter team?) tested with ~1000 users, all with Dask clusters on Dask Gateway though. Perhaps I'm mistaken? Does this ring a bell @yuvipanda ? The configuration for the Nebari deployment for ESIP is: node_groups:
general:
instance: m5.2xlarge
min_nodes: 1
max_nodes: 1
user:
instance: m5.2xlarge
min_nodes: 1
max_nodes: 100
worker:
instance: m5.2xlarge
min_nodes: 1
max_nodes: 450 |
I'm pretty sure Dask Gateway can handle the load. I assume we have a bad configuration/undersized pod or similar. @iameskild said that the pod kept rebooting. Also as an fyi, that wasn't the only error message. At various time retrieving options, getting a client etc all failed. i.e. the API was either erroring out or unresponsive as folks were hitting it. |
The I think we should try and recreate the issue and then see what happens when we increase the dask-gateway resource limits. The default dask-gateway pod resources: Lines 206 to 215 in 37226b7
|
wow that seems pretty tiny. |
My initial thoughts as to why this is happening. Dask-gateway should not require significant resources. All dask-gateway is responsible for:
Currently the "check available conda environments" crawls through the filesystem to get available environments. We should be using conda-store server api to get the available environments and this would reduce the load on the server. |
An example of this can be seen in the cdsdashboards code https://github.com/nebari-dev/nebari/blob/develop/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/jupyterhub/files/jupyterhub/02-spawner.py#L55 |
In that sense, we would be updating the inner logic of Dask-Gateway to retrieve the envs from requests to the Conda-store API. What would be the endpoint? is it public? |
Just to note, this is a super high priority, and we need this by 10th May. |
During the recent PyCon Nebari tutorial, we had 30+ people trying to connect to the Dask-Gateway cluster at the same time. Some users were able to connect and others were not. Those who were not able to connect ran into the following error message:
Based on the CPU/memory limits the Dask-Gateway pod has by default, it might be a resource limitation. Another possible guess is that the Dask-Gateway API was simply overwhelmed. We need to investigate further to isolate what the root cause of this issue might be.
The text was updated successfully, but these errors were encountered: