Skip to content

Commit

Permalink
catch GPU layout mismatches in the backend
Browse files Browse the repository at this point in the history
  • Loading branch information
conradtchan committed May 10, 2024
1 parent 10c0748 commit c53f36b
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions backend/backend_ozstar.py
Original file line number Diff line number Diff line change
Expand Up @@ -774,6 +774,15 @@ def job_gpu_layout(self, job_id):
)
layout = self.scontrol_gpu(job_id)

# Make sure each item in the GPU layout is also in the CPU layout
# This is in case scontrol returns an incorrect value
for node in layout:
if node not in self.job_layout(job_id):
self.log.warn(
f"Node {node} in GPU layout for job {job_id} not in CPU layout"
)
del layout[node]

# Minimise the number of scontrol calls by caching the results
# - Assume that GPU affinity is fixed for the lifetime of the job
# - scontrol should only be called once per job
Expand Down

0 comments on commit c53f36b

Please sign in to comment.