Skip to content

Long/inconsistent requeue times with 2.9.1 + slurm #2117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
keien opened this issue Oct 8, 2020 · 10 comments
Closed

Long/inconsistent requeue times with 2.9.1 + slurm #2117

keien opened this issue Oct 8, 2020 · 10 comments

Comments

@keien
Copy link

keien commented Oct 8, 2020

Required Info:

  • AWS ParallelCluster version [e.g. 2.9.0]: 2.9.1
  • Full cluster configuration without any credentials or personal data
[aws]
aws_region_name = us-west-2

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[global]
cluster_template = default
update_check = true
sanity_check = true

[cluster default]
key_name = ec2-admin
base_os = centos7
scheduler = slurm
master_instance_type = m5.large
vpc_settings = default
queue_settings = compute
tags = {"purpose": "development-ondemand"}
custom_ami = ami-0cbdde0662f037f7c
master_root_volume_size = 100
compute_root_volume_size = 100

[vpc default]
vpc_id = vpc-fa52d99e
master_subnet_id = subnet-00bbd054b223b7501
vpc_security_group_id = sg-74e42c12
use_public_ips = false

[queue compute]
enable_efa = false
compute_resource_settings = default

[compute_resource default]
instance_type = m5.xlarge
  • Cluster name: slurm-test
  • [Optional] Arn of the cluster CloudFormation main stack: arn:aws:cloudformation:us-west-2:171496337684:stack/parallelcluster-slurm-test/047735a0-09a4-11eb-8cb8-067714b800aa

Bug description and how to reproduce:
When a node with a job is terminated, the time it takes for slurm to update the job/node state is quite variable - sometimes a few minutes, but I've seen it take nearly an hour. Here is an example run:

first job running:

[ip-172-31-58-150]keien:slurm-test [14:06]
$ cat requeue.sh 
#!/bin/bash

hostname
date
sleep 600

[ip-172-31-58-150]keien:~ [14:06]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 2   compute requeue.    keien  R       2:28      1 compute-dy-m5xlarge-1

[ip-172-31-58-150]keien:~ [14:06]
$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST 
compute*        up   infinite     9  idle~ compute-dy-m5xlarge-[2-10]
compute*        up   infinite      1    mix compute-dy-m5xlarge-1

[ip-172-31-58-150]keien:slurm-test [14:07]
$ cat slurm-2.out 
compute-dy-m5xlarge-1
Thu Oct  8 14:03:45 PDT 2020

terminate node:

[compute-dy-m5xlarge-1]keien:~ [14:07]
$ sudo shutdown -h now

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for keien:
Connection to compute-dy-m5xlarge-1 closed by remote host.
Connection to compute-dy-m5xlarge-1 closed.

right after termination:


[ip-172-31-58-150]keien:slurm-test [14:08]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 2   compute requeue.    keien  R       5:49      1 compute-dy-m5xlarge-1

[ip-172-31-58-150]keien:slurm-test [14:09]
$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST 
compute*        up   infinite     9  idle~ compute-dy-m5xlarge-[2-10]
compute*        up   infinite      1    mix compute-dy-m5xlarge-1

attempting to salloc while in this state:

[ip-172-31-58-150]keien:slurm-test [14:15]
$ salloc bash
salloc: Granted job allocation 3
keien@ip-172-31-58-150 ~/ws/slurm-test $ srun hostname
srun: error: Task launch for 3.0 failed on node compute-dy-m5xlarge-1: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

the salloc job gets stuck in CG after the above:

[ip-172-31-58-150]keien:slurm-test [14:17]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 3   compute     bash    keien CG       1:03      1 compute-dy-m5xlarge-1
                 2   compute requeue.    keien  R      14:02      1 compute-dy-m5xlarge-1

job restarted:

[ip-172-31-58-150]keien:slurm-test [15:15]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 2   compute requeue.    keien  R       4:38      1 compute-dy-m5xlarge-2

[ip-172-31-58-150]keien:slurm-test [15:15]
$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST 
compute*        up   infinite     9  idle~ compute-dy-m5xlarge-[1,3-10]
compute*        up   infinite      1    mix compute-dy-m5xlarge-2

[ip-172-31-58-150]keien:slurm-test [15:15]
$ cat slurm-2.out 
compute-dy-m5xlarge-2
Thu Oct  8 15:10:45 PDT 2020

logs:

[root@ip-172-31-58-150 log]# cat slurmctld.log 
[2020-10-08T13:33:39.952] error: chdir(/var/log): Permission denied
[2020-10-08T13:33:39.953] slurmctld version 20.02.4 started on cluster parallelcluster
[2020-10-08T13:33:42.279] No memory enforcing mechanism configured.
[2020-10-08T13:33:42.279] layouts: no layout to initialize[2020-10-08T13:33:42.711] layouts: loading entities/relations information
[2020-10-08T13:33:42.711] error: Could not open node state file /var/spool/slurm.state/node_state: No such file or directory
[2020-10-08T13:33:42.711] error: NOTE: Trying backup state save file. Information may be lost!        
[2020-10-08T13:33:42.711] No node state file (/var/spool/slurm.state/node_state.old) to recover       
[2020-10-08T13:33:42.711] error: Could not open job state file /var/spool/slurm.state/job_state: No such file or directory[2020-10-08T13:33:42.711] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-10-08T13:33:42.711] No job state file (/var/spool/slurm.state/job_state.old) to recover
[2020-10-08T13:33:42.712] select/cons_tres: select_p_node_init
[2020-10-08T13:33:42.712] select/cons_tres: preparing for 3 partitions
[2020-10-08T13:33:42.997] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:42.998] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:42.999] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:42.999] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.000] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.000] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.001] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.002] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.002] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.003] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.004] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.004] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.005] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.005] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.006] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.007] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.007] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.008] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.008] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.009] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.016] error: Could not open reservation state file /var/spool/slurm.state/resv_state: No such file or directory
[2020-10-08T13:33:43.016] error: NOTE: Trying backup state save file. Reservations may be lost        
[2020-10-08T13:33:43.016] No reservation state file (/var/spool/slurm.state/resv_state.old) to recover[2020-10-08T13:33:43.016] error: Could not open trigger state file /var/spool/slurm.state/trigger_state: No such file or directory
[2020-10-08T13:33:43.016] error: NOTE: Trying backup state save file. Triggers may be lost!
[2020-10-08T13:33:43.016] No trigger state file (/var/spool/slurm.state/trigger_state.old) to recover 
[2020-10-08T13:33:43.016] read_slurm_conf: backup_controller not specified
[2020-10-08T13:33:43.016] Reinitializing job accounting state
[2020-10-08T13:33:43.016] select/cons_tres: reconfigure
[2020-10-08T13:33:43.016] select/cons_tres: select_p_node_init
[2020-10-08T13:33:43.016] select/cons_tres: preparing for 3 partitions
[2020-10-08T13:33:43.016] Running as primary controller
[2020-10-08T13:33:43.122] No parameter for mcs plugin, default values set
[2020-10-08T13:33:43.122] mcs: MCSParameters = (null). ondemand set.
[2020-10-08T13:34:43.393] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-10-08T13:38:43.546] error: Could not open job state file /var/spool/slurm.state/job_state: No such file or directory
[2020-10-08T13:38:43.546] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-10-08T13:38:43.546] No job state file (/var/spool/slurm.state/job_state.old) found
[2020-10-08T13:59:04.324] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=4294901759 usec=1258
[2020-10-08T13:59:04.472] sched: Allocate JobId=2 NodeList=compute-dy-m5xlarge-1 #CPUs=1 Partition=compute
[2020-10-08T14:03:28.136] Node compute-dy-m5xlarge-1 now responding
[2020-10-08T14:03:43.670] job_time_limit: Configuration for JobId=2 complete
[2020-10-08T14:03:43.670] Resetting JobId=2 start time for node power up
[2020-10-08T14:13:43.077] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:15:27.168] sched: _slurm_rpc_allocate_resources JobId=3 NodeList=compute-dy-m5xlarge-1 
usec=284
[2020-10-08T14:15:27.168] prolog_running_decr: Configuration for JobId=3 is complete
[2020-10-08T14:15:38.897] job_step_signal: JobId=3 StepId=0 not found
[2020-10-08T14:16:10.000] job_step_signal: JobId=3 StepId=0 not found
[2020-10-08T14:16:30.292] _job_complete: JobId=3 WEXITSTATUS 233
[2020-10-08T14:16:30.292] _job_complete: JobId=3 done
[2020-10-08T14:18:43.275] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:19:43.321] Resending TERMINATE_JOB request JobId=3 Nodelist=compute-dy-m5xlarge-1      
[2020-10-08T14:23:43.479] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:28:43.768] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:33:43.965] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:38:43.163] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:43:43.368] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:48:43.566] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:53:43.896] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:58:43.102] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T15:03:09.284] requeue job JobId=2 due to failure of node compute-dy-m5xlarge-1
[2020-10-08T15:03:09.284] Requeuing JobId=2
[2020-10-08T15:03:09.284] cleanup_completing: JobId=3 completion process took 2799 seconds
[2020-10-08T15:03:09.284] error: Nodes compute-dy-m5xlarge-1 not responding, setting DOWN
[2020-10-08T15:05:30.789] backfill: Started JobId=2 in compute on compute-dy-m5xlarge-2
[2020-10-08T15:10:33.585] Node compute-dy-m5xlarge-2 now responding
[2020-10-08T15:10:43.589] job_time_limit: Configuration for JobId=2 complete
[2020-10-08T15:10:43.589] Resetting JobId=2 start time for node power up

[root@ip-172-31-58-150 parallelcluster]# cat clustermgtd 
2020-10-08 13:33:59,248 INFO [clustermgtd:main] ClusterManager Startup
2020-10-08 13:33:59,248 INFO [clustermgtd:_get_config] Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-10-08 13:33:59,251 INFO [clustermgtd:set_config] Applying new clustermgtd config: ClustermgtdConfig(_config=<configparser.ConfigParser object at 0x7fadf58e8f60>, region='us-west-2', cluster_name='slurm-test', dynamodb_table='parallelcluster-slurm-test', master_private_ip='172.31.58.150', master_hostname='ip-172-31-58-150.us-west-2.compute.internal cerebras.aws', instance_name_type_mapping={'m5xlarge': 'm5.xlarge'}, _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, loop_time=60, disable_all_cluster_management=False, heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', boto3_config=<botocore.config.Config object at 0x7fadf3fd24a8>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_clustermgtd_logging.conf', disable_ec2_health_check=False, disable_scheduled_event_health_check=False, health_check_timeout=180, disable_all_health_checks=False, launch_max_batch_size=500, update_node_address=True, terminate_max_batch_size=1000, node_replacement_timeout=3600, terminate_drain_nodes=True, 
terminate_down_nodes=True, orphaned_instance_timeout=120, hosted_zone='Z08971062IG8KMY8Z2MYN', dns_domain='slurm-test.pcluster', use_private_hostname=False)
2020-10-08 13:34:00,432 INFO [clustermgtd:_get_config] Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-10-08 13:34:00,434 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...      
2020-10-08 13:34:00,576 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2020-10-08 13:34:00,576 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2020-10-08 13:34:05,749 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2020-10-08 13:34:06,580 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2020-10-08 13:34:07,018 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2020-10-08 13:34:07,019 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2020-10-08 13:34:07,019 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance

[root@ip-172-31-58-150 parallelcluster]# cat slurm_resume.log 
2020-10-08 13:59:05,759 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: compute-dy-m5xlarge-1
2020-10-08 13:59:05,934 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['compute-dy-m5xlarge-1']
2020-10-08 13:59:07,620 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('compute-dy-m5xlarge-1', EC2Instance(id='i-077d1784d495dd49f', private_ip='172.31.55.18', hostname='ip-172-31-55-18', launch_time=datetime.datetime(2020, 10, 8, 20, 59, 7, tzinfo=tzutc())))"]
2020-10-08 13:59:07,620 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2020-10-08 13:59:07,652 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2020-10-08 13:59:07,652 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z08971062IG8KMY8Z2MYN - slurm-test.pcluster
2020-10-08 13:59:08,503 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2020-10-08 13:59:08,504 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['compute-dy-m5xlarge-1']
2020-10-08 13:59:08,506 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
2020-10-08 15:05:32,031 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: compute-dy-m5xlarge-2
2020-10-08 15:05:32,105 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['compute-dy-m5xlarge-2']
2020-10-08 15:05:33,728 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('compute-dy-m5xlarge-2', EC2Instance(id='i-01186bf7ece5d1a85', private_ip='172.31.49.131', hostname='ip-172-31-49-131', launch_time=datetime.datetime(2020, 10, 8, 22, 5, 33, tzinfo=tzutc())))"]
2020-10-08 15:05:33,728 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2020-10-08 15:05:33,763 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2020-10-08 15:05:33,763 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z08971062IG8KMY8Z2MYN - slurm-test.pcluster
2020-10-08 15:05:34,257 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2020-10-08 15:05:34,258 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['compute-dy-m5xlarge-2']
2020-10-08 15:05:34,260 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.

[root@ip-172-31-58-150 parallelcluster]# cat slurm_suspend.log 
2020-10-08 15:03:10,025 - [slurm_plugin.suspend:main] - INFO - Suspending following nodes. Clustermgtd will cleanup orphaned instances: compute-dy-m5xlarge-1
2020-10-08 15:03:10,026 - [slurm_plugin.suspend:main] - INFO - SuspendProgram finished. Nodes will be 
available after SuspendTimeout

As you can see, the above node took just under an hour to bring down.

I'd really like to see slurm update the node state a lot faster so we can avoid this long restart time as well as getting a bunch of jobs stuck trying to run on a dead node.

@tilne
Copy link
Contributor

tilne commented Oct 9, 2020

I believe the two components that can mark a node as unhealthy are clustermgtd and the slurm daemons.

Why doesn't clustermgtd mark the node as unhealthy?

Before I detail what I think the issue is, I want to address why clustermgtd doesn't mark this node as unhealthy. clustermgtd periodically checks for the following on each compute node in the cluster:

  • healthy instance statuses
  • healthy system statuses
  • no scheduled events

Shutting down the instance doesn't affect scheduled events. When I shut down an instance as you did in your example, I see the following statuses:

(develop) ❯ aws ec2 describe-instance-status --include-all-instances --region us-east-1 --instance-ids i-01fce2bb80df78785
{
    "InstanceStatuses": [
        {
            "AvailabilityZone": "us-east-1f",
            "InstanceId": "i-01fce2bb80df78785",
            "InstanceState": {
                "Code": 80,
                "Name": "stopped"
            },
            "InstanceStatus": {
                "Status": "not-applicable"
            },
            "SystemStatus": {
                "Status": "not-applicable"
            }
        }
    ]
}

not-applicable is not a state clustermgtd would consider unhealthy (see here). Additionally, shutting down a node like this results in EC2 considering the instance stopped, which means it will be excluded from the results returned when we call the API. Generally, the intention is to have the slurm daemons handle as much of this as possible.

Why don't the slurm daemons mark the node as down?

I believe the reason it takes so long for slurm to replace the node because of the value configured for ResumeTimout. From the slurm.conf documentation:

Maximum time permitted (in seconds) between when a node resume request is issued and when the node is actually available for use. Nodes which fail to respond in this time frame will be marked DOWN and the jobs scheduled on the node requeued. The default value is 60 seconds. Related configuration options include ResumeProgram, ResumeRate, SuspendRate, SuspendTime, SuspendTimeout, SuspendProgram, SuspendExcNodes and SuspendExcParts. More information is available at the Slurm web site ( https://slurm.schedmd.com/power_save.html ).

This value is set to 3600 seconds (one hour) by ParallelCluster. I believe that value is much larger than the default of 60 seconds to handle the case where an AMI does not have the software ParallelCluster requires.

Why doesn't the SlurmdTimeout parameter apply?

I don't have an answer for this. From the same slurm.conf documentation:

The interval, in seconds, that the Slurm controller waits for slurmd to respond before configuring that node's state to DOWN.

To me, it would seem that if a node's slurmd wasn't responding, then this timeout would take precedent over the ResumeTimeout. Perhaps slurmctld waits for the duration of the ResumeTimeout even after a node has responded once in order to acommodate nodes that are intermittently responsive when resuming for whatever reason. I'll get clarification on this.

@keien
Copy link
Author

keien commented Oct 9, 2020

Just to clarify, in the example I'm manually shutting down the instance for demonstration - what I'm trying to figure out in practice is spot nodes being terminated by AWS and certain jobs causing the entire node to become unresponsive

I guess my follow-up questions would be:

  1. Should I consider the fact that stopped nodes aren't considered unhealthy a bug and expect the behavior to change in the future?
  2. I'm a little confused on ResumeTimeout - it sounds like that dictates the timeout for nodes that have requested, but fails during bootup and never gets back to the master, which seems different than nodes that were brought up successfully, but subsequently became unresponsive
  3. Assuming ResumeTimeout is the right config, should I update it myself to something shorter? If 60 seconds is too short, what would the recommendation be?
  4. I would also like to know why SlurmdTimeout is not applying, as that seems more relevant to the situation

@rexcsn
Copy link
Contributor

rexcsn commented Oct 12, 2020

Hi @keien, checking in on this issue,

Regarding your use-case question, the use case regarding spot node being terminated should be covered by current logic.
Our daemon clustermgtd will consider node with terminated backing instance as unhealthy and actively mark the node as down in slurm, which will trigger a job re-queue immediately
To test this yourself, you can try to terminate any instance backing a compute node in the cluster.
So if your spot interruption behavior is the default, terminate instance on spot interrupt, the system should be responsive in that situation. Would this satisfy your use case?

However, if your spot interrupt behavior is stopping or hibernating the instance, clustermgtd currently takes no action to actively mark the node as down in slurm, and we rely on scheduler to tell us if the node is still considered healthy or not.
From what we can tell, executing sudo shutdown -h now on an instance puts an instance in stopped state according to EC2. Just a note, if you wanted to test the use case of stopping an instance I would recommend following this guide and always try to stop instance gracefully with AWS CLI or through EC2 console.
Unfortunately, in this case you will experience the issue you described above and system looks unresponsive.
If you would like to make a feature request for clustermgtd to actively handle stopped or hibernated instances, please let us know and we can mark this as a feature request.

On the other hand, we are still looking into the issue and we are not sure why slurm would take so long to mark nodes as down when the instance is stopped. ResumeTimeout should not be a factor in this issue.

We will try to update this issue with more details once we have more information. Thanks!

@keien
Copy link
Author

keien commented Oct 12, 2020

Okay, I was testing the behavior via running shutdown on the node itself just because it was the easiest thing to do - I can check the behavior via terminating the instance on the console and see if it reflects what you're describing. If it does, then that's probably fine for the spot case, but I'd still like to see slurm respond faster when it's unable to communicate with slurmd on any given node, as I've definitely seen cases where we've overloaded a node and all connection attempts are timing out, but AWS status and health checks still show green.

@keien
Copy link
Author

keien commented Oct 13, 2020

Unfortunately I see the same thing when terminating the node instead of shutting it down - node stays in mix state and job stays in R state well after the node is already terminated

One possible theory is that if the logic is monitoring the instance health checks, the health checks only show issues very briefly after termination, before they become completely unavailable, so maybe the timing window is too short to detect that it went down?

@rexcsn
Copy link
Contributor

rexcsn commented Oct 13, 2020

Hi @keien,

The latest update from our side is that the issue seems to be OS dependent. I could not reproduce your original issue with stopping the instance on Ubuntu1804, while I could reproduce your issue on CentOS 7. Slurm behaves differently regarding offline nodes on the 2 OSs. On Ubuntu1804, nodes that are stopped are quickly put into down state by slurm.
Please see if you can use Ubuntu1804 to unblock your use case for now.
Issue might not be OS dependent, pending details

Unfortunately I see the same thing when terminating the node instead of shutting it down - node stays in mix state and job stays in R state well after the node is already terminated

Unfortunately I cannot reproduce the issue with termination. On termination clustermgtd should mark the node as down within a few minute or so. Could you upload your /var/log/parallelcluster/clustermgtd.log for us to take a look? We would also need the instance id for the instance that you terminated manually. Corresponding /var/log/slurmctld.log during the same timeframe might help as well.

We are still looking into the root cause for your original issue. Hopefully we can have update soon.

@keien
Copy link
Author

keien commented Oct 13, 2020

Let me try to reproduce it again to get the relevant section of the logs

To address the OS thing (if it happens to be the issue), no we cannot switch away from Centos 7 at this time.

@keien
Copy link
Author

keien commented Oct 13, 2020

Vaguely related symptom that I just saw: a spot instance was allocated and almost immediately terminated by AWS (not uncommon with GPU instances), but slurm never updated its power state from powering up to powering down, so now I've got a node stuck in mix# and a job in CF - waiting now to see if it will take another hour before it realizes it's down and attempts to get another node

@rexcsn
Copy link
Contributor

rexcsn commented Oct 16, 2020

Hi @keien,

After communication with SchedMD, we confirm that there is a problematic behavior from slurm of not marking non responsive node as down, if node is within ResumeTimeout

Details for this issue have been documented in our wiki here.

If you have comments/feedback regarding this slurm issue, please leave them in this issue thread #2146 so we can track inputs from our users and make a case to continually push SchedMD

As discussed in wiki, you should not see similar issue in the case of instance termination. Please let us know if you have any other open question regarding your use case.

Thank you!

@keien
Copy link
Author

keien commented Oct 16, 2020

Appreciate the follow-up - will set ResumeTimeout myself to mitigate the issue and follow #2146 for the fix

Feel free to close this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants