Potential issues when using Dask-Gateway with multiple simultaneous users #1750

iameskild · 2023-04-24T22:33:02Z

During the recent PyCon Nebari tutorial, we had 30+ people trying to connect to the Dask-Gateway cluster at the same time. Some users were able to connect and others were not. Those who were not able to connect ran into the following error message:

ClientConnectorError: Cannot connect to host nebari-dask-gateway-gateway-api.dev:8000 ssl:default [Connect call failed ('10.35.246.207', 8000)]

Based on the CPU/memory limits the Dask-Gateway pod has by default, it might be a resource limitation. Another possible guess is that the Dask-Gateway API was simply overwhelmed. We need to investigate further to isolate what the root cause of this issue might be.

The text was updated successfully, but these errors were encountered:

dharhas · 2023-04-24T22:39:45Z

@iameskild @rsignell-usgs

What does the ESIPFed cluster use for the dask-gateway pod? Is it the same as default? Have they seen this issue before?

rsignell-usgs · 2023-04-25T12:26:42Z

Thanks for the heads up on this issue @dharhas. We have not seen this issue on the ESIP Nebari deployment, but we also haven't had 50 people all try to launch a cluster at the same time. I thought I remembered that someone (the Berkeley Jupyter team?) tested with ~1000 users, all with Dask clusters on Dask Gateway though. Perhaps I'm mistaken?

Does this ring a bell @yuvipanda ?

The configuration for the Nebari deployment for ESIP is:

  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 100
    worker:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 450

dharhas · 2023-04-25T13:01:36Z

I'm pretty sure Dask Gateway can handle the load. I assume we have a bad configuration/undersized pod or similar. @iameskild said that the pod kept rebooting.

Also as an fyi, that wasn't the only error message. At various time retrieving options, getting a client etc all failed. i.e. the API was either erroring out or unresponsive as folks were hitting it.

iameskild · 2023-04-25T13:18:16Z

The worker node group is only for the dask-worker and dask-scheduler and our deployment has roughly the same instance type (though on GCP) that @rsignell-usgs shared above.

I think we should try and recreate the issue and then see what happens when we increase the dask-gateway resource limits. The default dask-gateway pod resources:

nebari/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/dask-gateway/gateway.tf

Lines 206 to 215 in 37226b7

    
           resources { 
        
             limits = { 
        
               cpu    = "0.5" 
        
               memory = "512Mi" 
        
             } 
        
             requests = { 
        
               cpu    = "250m" 
        
               memory = "50Mi" 
        
             } 
        
           }

dharhas · 2023-04-25T13:21:20Z

wow that seems pretty tiny.

costrouc · 2023-04-25T14:15:35Z

My initial thoughts as to why this is happening. Dask-gateway should not require significant resources. All dask-gateway is responsible for:

authenticate user
call Kubernetes api
check available conda environments

Currently the "check available conda environments" crawls through the filesystem to get available environments. We should be using conda-store server api to get the available environments and this would reduce the load on the server.

costrouc · 2023-04-25T14:17:33Z

An example of this can be seen in the cdsdashboards code https://github.com/nebari-dev/nebari/blob/develop/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/jupyterhub/files/jupyterhub/02-spawner.py#L55

viniciusdc · 2023-04-25T14:22:16Z

An example of this can be seen in the cdsdashboards code https://github.com/nebari-dev/nebari/blob/develop/nebari/template/stages/07-kubernetes-services/modules/kubernetes/services/jupyterhub/files/jupyterhub/02-spawner.py#L55

In that sense, we would be updating the inner logic of Dask-Gateway to retrieve the envs from requests to the Conda-store API. What would be the endpoint? is it public?

pavithraes · 2023-05-02T14:50:26Z

Just to note, this is a super high priority, and we need this by 10th May.

iameskild added area: integration/Dask Issues related to Dask on QHub needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug labels Apr 24, 2023

github-project-automation bot added this to 🪴 Nebari Project Management Apr 24, 2023

github-project-automation bot moved this to New 📬 in 🪴 Nebari Project Management Apr 24, 2023

Adam-D-Lewis self-assigned this May 2, 2023

Adam-D-Lewis mentioned this issue May 3, 2023

Overloaded dask gateway fix #1777

Merged

10 tasks

pavithraes added type: enhancement 💅🏼 New feature or request impact: high 🟥 This issue affects most of the nebari users or is a critical issue and removed needs: investigation 🔍 Someone in the team needs to find the root cause and replicate this bug labels May 3, 2023

Adam-D-Lewis closed this as completed in #1777 May 4, 2023

github-project-automation bot moved this from New 📬 to Done 💪🏾 in 🪴 Nebari Project Management May 4, 2023

dcmcand mentioned this issue Feb 8, 2024

[BUG] - Dask dashboard freezes intermittently #1730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential issues when using Dask-Gateway with multiple simultaneous users #1750

Potential issues when using Dask-Gateway with multiple simultaneous users #1750

iameskild commented Apr 24, 2023

dharhas commented Apr 24, 2023

rsignell-usgs commented Apr 25, 2023 •

edited

Loading

dharhas commented Apr 25, 2023

iameskild commented Apr 25, 2023

dharhas commented Apr 25, 2023

costrouc commented Apr 25, 2023

costrouc commented Apr 25, 2023

viniciusdc commented Apr 25, 2023 •

edited

Loading

pavithraes commented May 2, 2023

Potential issues when using Dask-Gateway with multiple simultaneous users #1750

Potential issues when using Dask-Gateway with multiple simultaneous users #1750

Comments

iameskild commented Apr 24, 2023

dharhas commented Apr 24, 2023

rsignell-usgs commented Apr 25, 2023 • edited Loading

dharhas commented Apr 25, 2023

iameskild commented Apr 25, 2023

dharhas commented Apr 25, 2023

costrouc commented Apr 25, 2023

costrouc commented Apr 25, 2023

viniciusdc commented Apr 25, 2023 • edited Loading

pavithraes commented May 2, 2023

rsignell-usgs commented Apr 25, 2023 •

edited

Loading

viniciusdc commented Apr 25, 2023 •

edited

Loading