Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential issues when using Dask-Gateway with multiple simultaneous users #1750

Closed
iameskild opened this issue Apr 24, 2023 · 9 comments · Fixed by #1777
Closed

Potential issues when using Dask-Gateway with multiple simultaneous users #1750

iameskild opened this issue Apr 24, 2023 · 9 comments · Fixed by #1777
Assignees
Labels
area: integration/Dask Issues related to Dask on QHub impact: high 🟥 This issue affects most of the nebari users or is a critical issue type: enhancement 💅🏼 New feature or request

Comments

@iameskild
Copy link
Member

During the recent PyCon Nebari tutorial, we had 30+ people trying to connect to the Dask-Gateway cluster at the same time. Some users were able to connect and others were not. Those who were not able to connect ran into the following error message:

ClientConnectorError: Cannot connect to host nebari-dask-gateway-gateway-api.dev:8000 ssl:default [Connect call failed ('10.35.246.207', 8000)]

Based on the CPU/memory limits the Dask-Gateway pod has by default, it might be a resource limitation. Another possible guess is that the Dask-Gateway API was simply overwhelmed. We need to investigate further to isolate what the root cause of this issue might be.

@iameskild iameskild added area: integration/Dask Issues related to Dask on QHub needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug labels Apr 24, 2023
@dharhas
Copy link
Member

dharhas commented Apr 24, 2023

@iameskild @rsignell-usgs

What does the ESIPFed cluster use for the dask-gateway pod? Is it the same as default? Have they seen this issue before?

@rsignell-usgs
Copy link
Contributor

rsignell-usgs commented Apr 25, 2023

Thanks for the heads up on this issue @dharhas. We have not seen this issue on the ESIP Nebari deployment, but we also haven't had 50 people all try to launch a cluster at the same time. I thought I remembered that someone (the Berkeley Jupyter team?) tested with ~1000 users, all with Dask clusters on Dask Gateway though. Perhaps I'm mistaken?

Does this ring a bell @yuvipanda ?

The configuration for the Nebari deployment for ESIP is:

  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 100
    worker:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 450

@dharhas
Copy link
Member

dharhas commented Apr 25, 2023

I'm pretty sure Dask Gateway can handle the load. I assume we have a bad configuration/undersized pod or similar. @iameskild said that the pod kept rebooting.

Also as an fyi, that wasn't the only error message. At various time retrieving options, getting a client etc all failed. i.e. the API was either erroring out or unresponsive as folks were hitting it.

@iameskild
Copy link
Member Author

The worker node group is only for the dask-worker and dask-scheduler and our deployment has roughly the same instance type (though on GCP) that @rsignell-usgs shared above.

I think we should try and recreate the issue and then see what happens when we increase the dask-gateway resource limits. The default dask-gateway pod resources:

resources {
limits = {
cpu = "0.5"
memory = "512Mi"
}
requests = {
cpu = "250m"
memory = "50Mi"
}
}

@dharhas
Copy link
Member

dharhas commented Apr 25, 2023

wow that seems pretty tiny.

@costrouc
Copy link
Member

My initial thoughts as to why this is happening. Dask-gateway should not require significant resources. All dask-gateway is responsible for:

  • authenticate user
  • call Kubernetes api
  • check available conda environments

Currently the "check available conda environments" crawls through the filesystem to get available environments. We should be using conda-store server api to get the available environments and this would reduce the load on the server.

@viniciusdc
Copy link
Contributor

viniciusdc commented Apr 25, 2023

An example of this can be seen in the cdsdashboards code https://github.com/nebari-dev/nebari/blob/develop/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/jupyterhub/files/jupyterhub/02-spawner.py#L55

In that sense, we would be updating the inner logic of Dask-Gateway to retrieve the envs from requests to the Conda-store API. What would be the endpoint? is it public?

@Adam-D-Lewis Adam-D-Lewis self-assigned this May 2, 2023
@pavithraes
Copy link
Member

Just to note, this is a super high priority, and we need this by 10th May.

@pavithraes pavithraes added type: enhancement 💅🏼 New feature or request impact: high 🟥 This issue affects most of the nebari users or is a critical issue and removed needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug labels May 3, 2023
@github-project-automation github-project-automation bot moved this from New 📬 to Done 💪🏾 in 🪴 Nebari Project Management May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: integration/Dask Issues related to Dask on QHub impact: high 🟥 This issue affects most of the nebari users or is a critical issue type: enhancement 💅🏼 New feature or request
Projects
Development

Successfully merging a pull request may close this issue.

7 participants