find out mem per node on RML #99

philipmac · 2022-10-05T13:59:41Z

for Brad;s work -

philipmac · 2023-01-03T19:23:06Z

BigSky currently has:

41 Nodes
1512 CPU Cores
29.5 TB RAM
36 NVIDIA Tesla GPUs
3.6PB of storage connected with 100Gbit EDR InfiniBand

We will be adding 2PB of additional storage next month. It’s on-site and ready to go, so you can add that to the numbers if you want.

Thank you @andrew-niaid - helpfully answering here.

philipmac · 2023-01-09T21:27:20Z

A single GPU node: (There's 8 of these nodes)

NodeName=ai-rmlgpu08 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.44
AvailableFeatures=intel,skylake,v100,edr
ActiveFeatures=intel,skylake,v100,edr
Gres=gpu:tesla_v100-pcie-32gb:2(S:0-1),gpu_mem:32480
NodeAddr=ai-rmlgpu08 NodeHostName=ai-rmlgpu08 Version=22.05.6
OS=Linux 3.10.0-1160.80.1.el7.x86_64 #1 SMP Tue Nov 8 15:48:59 UTC 2022
RealMemory=514638 AllocMem=0 FreeMem=221585 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
Partitions=gpu
BootTime=2022-12-16T09:24:25 SlurmdStartTime=2022-12-16T12:16:21
LastBusyTime=2023-01-05T14:32:26
CfgTRES=cpu=16,mem=514638M,billing=16,gres/gpu=2
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

andrew-niaid · 2023-01-09T22:02:41Z

For completeness these nodes are available to you (and EM)

ai-rmlcpu01-16: 16 Cores, 256GB Memory, No GPU
ai-rmlcpu17-20, 64 Cores, 512GB Memory, No GPU
ai-rmlcpu21-28, 64 Cores, 1TB Memory, 2 x NVIDIA A100 40GB
ai-rmlgpu01-04, 16 Cores, 512GB Memory, 2 x NVIDIA P100 16GB
ai-rmlgpu05-08, 16 Cores, 512GB Memory, 2 x NVIDIA V100 32GB

On BigSky, if you don't specify a partition in slurm, you get the default partition called 'int' (for interactive). Nodes can be available in multiple partitions.

The default 'int' partition contains nodes ai-rmlcpu01-28
The 'gpu' partition contains nodes ai-rmlgpu01-08 and ai-rmlcpu21-28.

We've started putting GPUs in all new nodes, which is why there are some in nodes with an 'ai-rmlcpu##' name.

In slurm, the most correct way to see if a node has a GPU is to look at the 'Gres=' line in the 'scontrol show node' output like you have above. Gres is 'Generic RESource', a generic allocatable resource.

philipmac closed this as completed Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

find out mem per node on RML #99

find out mem per node on RML #99

philipmac commented Oct 5, 2022

philipmac commented Jan 3, 2023

philipmac commented Jan 9, 2023

andrew-niaid commented Jan 9, 2023 •

edited

Loading

find out mem per node on RML #99

find out mem per node on RML #99

Comments

philipmac commented Oct 5, 2022

philipmac commented Jan 3, 2023

philipmac commented Jan 9, 2023

andrew-niaid commented Jan 9, 2023 • edited Loading

andrew-niaid commented Jan 9, 2023 •

edited

Loading