Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add draft gpu troubles #290

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from
Draft

Add draft gpu troubles #290

wants to merge 7 commits into from

Conversation

mhuguesaws
Copy link
Contributor

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

scancel [JOB_ID]
```

1. Reset the GPUs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the reset option for nvidia-smi

sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
```

1. Cancel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel the job in Slurm

The node will have a **DRAIN** status. Then the instance will be terminated and replaced.


1. Delete the reservation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is RES_NUMBER?

scancel [JOB_ID]
```

1. Place the node in **DRAIN**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node to terminate is an IP or name?


1. Create a reservation to isolate the node from being used by any jobs.
```bash
sudo /opt/slurm/bin/scontrol create res starttime=now duration=infinite flags=ignore_jobs user=root nodes=[NODE_TO_TERMINATE]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what NODE_TO_TERMINATE should be?

| 95 | Uncontained ECC error | Reset GPUs | [AWS ParallelCluster](#reset-gpus) |

# AWS ParallelCluster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference to ParallelCluster doc?

While running High-Performance Computing or Machine Learning workloads, GPUs may fail for various reasons captured by Xid messages.
Those messages are placed in `/var/log/messages` for Amazon Linux or for Ubuntu in `/var/log/syslog` and `/var/log/kern.log`

| Xid | Failure | Resolution | Orchestrator |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler/Orchestrator

```

## Reset GPUs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say that resetting does and what NODE_TO_TERMINATE represents (or how to get it)


1. Delete the reservation
```bash
sudo /opt/slurm/bin/scontrol delete res root_[RES_NUMBER]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is RES_NUMBER?


# Amazon SageMaker HyperPod

TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBA

@nghtm
Copy link
Collaborator

nghtm commented May 3, 2024

looks good - plan to create a new PR for HyperPod instructions after p-cluster is merged.

@KeitaW KeitaW force-pushed the feature/#289_gpu_failure branch 2 times, most recently from c9e0f5e to 30e6592 Compare June 4, 2024 02:26
@KeitaW KeitaW force-pushed the main branch 3 times, most recently from 44e448e to 1209815 Compare June 4, 2024 02:30
@mhuguesaws mhuguesaws force-pushed the feature/#289_gpu_failure branch 2 times, most recently from 90549b2 to 84f6f79 Compare June 11, 2024 21:00
awsankur and others added 7 commits June 11, 2024 16:29
Signed-off-by: AWS ParallelCluster user <ubuntu@ip-10-0-7-103.us-west-2.compute.internal>
Signed-off-by: AWS ParallelCluster user <ubuntu@ip-10-0-7-103.us-west-2.compute.internal>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants