Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm Job Fails Due to Missing SSL Certificates When Creating Cluster using dask-gateway-server #705

Open
woestler opened this issue May 29, 2023 · 3 comments

Comments

@woestler
Copy link

woestler commented May 29, 2023

When I created a cluster on HPC using Slurm and dask-gateway-server, I encountered a problem. My understanding of the running process is as follows: when dask-gateway-server receives the new_cluster command from the client, it converts the command into an sbatch command. I have edited the dask_gateway_server/backends/jobqueue/slurm.py file and print the variables cmd, env, and script in get_submit_cmd_env_stdin, the output are as follows:

cmd


['/usr/bin/sbatch', '--parsable', '--job-name=dask-gateway', '--chdir=/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d', '--output=/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask-scheduler-014af831909a4d8ab6b900b03fc9598d.log', '--cpus-per-task=2', '--mem=4096M', '--export=DASK_DISTRIBUTED__COMM__REQUIRE_ENCRYPTION,DASK_DISTRIBUTED__COMM__TLS__CA_FILE,DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__CERT,DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__KEY,DASK_GATEWAY_API_TOKEN,DASK_GATEWAY_API_URL,DASK_GATEWAY_CLUSTER_NAME']

env


 {'DASK_DISTRIBUTED__COMM__REQUIRE_ENCRYPTION': 'True', 'DASK_GATEWAY_API_URL': '<http://local3:8000/api>', 'DASK_GATEWAY_API_TOKEN': '3497e6f64a16424eae3b5545f151fb79', 'DASK_GATEWAY_CLUSTER_NAME': '014af831909a4d8ab6b900b03fc9598d', 'DASK_DISTRIBUTED__COMM__TLS__CA_FILE': '/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask.crt', 'DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__KEY': '/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask.pem', 'DASK_DISTRIBUTED__COMM__TLS__SCHEDULER__CERT': '/home/dask/.dask-gateway/014af831909a4d8ab6b900b03fc9598d/dask.crt'}

script

#!/bin/sh
source /opt/dask-gateway/anaconda/bin/activate /opt/dask
dask-scheduler --protocol tls --port 0 --host 0.0.0.0 --dashboard-address 0.0.0.0:0 --preload dask_gateway.scheduler_preload --dg-api-address 0.0.0.0:0 --dg-heartbeat-period 15 --dg-adaptive-period 3.0 --dg-idle-timeout 0.0

When the Slurm node receives this command and begins execution, if the non-edge node receives the Slurm Job, it will try to find the dask.crt and dask.pem files that appear in the environment variables above, but these files do not exist on this node. The Slurm task will fail and the error message is as follows:

2023-05-29 17:09:58,047 - distributed.preloading - INFO - Import preload module: dask_gateway.scheduler_preload
/opt/dask/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py:140: FutureWarning: dask-scheduler is deprecated and will be removed in a future release; use `dask scheduler` instead
  warnings.warn(
2023-05-29 17:09:58,049 - distributed.scheduler - INFO - -----------------------------------------------
2023-05-29 17:09:58,050 - distributed.preloading - INFO - Creating preload: dask_gateway.scheduler_preload
2023-05-29 17:09:58,050 - distributed.preloading - INFO - Import preload module: dask_gateway.scheduler_preload
2023-05-29 17:09:58,050 - distributed.scheduler - INFO - End scheduler
Traceback (most recent call last):
  File "/opt/dask/bin/dask-scheduler", line 8, in <module>
    sys.exit(main())
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/dask/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/dask/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py", line 249, in main
    asyncio.run(run())
  File "/opt/dask/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/dask/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/opt/dask/lib/python3.10/site-packages/distributed/cli/dask_scheduler.py", line 209, in run
    scheduler = Scheduler(
  File "/opt/dask/lib/python3.10/site-packages/distributed/scheduler.py", line 3464, in __init__
    self.connection_args = self.security.get_connection_args("scheduler")
  File "/opt/dask/lib/python3.10/site-packages/distributed/security.py", line 342, in get_connection_args
    "ssl_context": self._get_tls_context(tls, ssl.Purpose.SERVER_AUTH),
  File "/opt/dask/lib/python3.10/site-packages/distributed/security.py", line 299, in _get_tls_context
    ctx = ssl.create_default_context(purpose=purpose, cafile=ca)
  File "/opt/dask/lib/python3.10/ssl.py", line 766, in create_default_context
    context.load_verify_locations(cafile, capath, cadata)
FileNotFoundError: [Errno 2] No such file or directory

@jcrist @consideRatio @TomAugspurger @jacobtomlinson @martindurant

@selvavm
Copy link

selvavm commented Dec 23, 2023

Hi, I am also facing the same issue. Can someone please support me on this? My understanding is dask-gateway sets the environment variable for the location of dask.crt which is the staging location but it never copies the dask.crt to that location.

@selvavm
Copy link

selvavm commented Dec 23, 2023

@woestler - Did you resolve this?
@TomAugspurger, @jacobtomlinson, @martindurant - Any support will be much appreciated

@jlynchMicron
Copy link

Could be related: dask/distributed#4617

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants