Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find out mem per node on RML #99

Closed
philipmac opened this issue Oct 5, 2022 · 3 comments
Closed

find out mem per node on RML #99

philipmac opened this issue Oct 5, 2022 · 3 comments

Comments

@philipmac
Copy link
Member

for Brad;s work -

@philipmac
Copy link
Member Author

BigSky currently has:

41 Nodes
1512 CPU Cores
29.5 TB RAM
36 NVIDIA Tesla GPUs
3.6PB of storage connected with 100Gbit EDR InfiniBand

We will be adding 2PB of additional storage next month. It’s on-site and ready to go, so you can add that to the numbers if you want.

Thank you @andrew-niaid - helpfully answering here.

@philipmac
Copy link
Member Author

A single GPU node: (There's 8 of these nodes)

NodeName=ai-rmlgpu08 Arch=x86_64 CoresPerSocket=8
CPUAlloc=0 CPUEfctv=16 CPUTot=16 CPULoad=0.44
AvailableFeatures=intel,skylake,v100,edr
ActiveFeatures=intel,skylake,v100,edr
Gres=gpu:tesla_v100-pcie-32gb:2(S:0-1),gpu_mem:32480
NodeAddr=ai-rmlgpu08 NodeHostName=ai-rmlgpu08 Version=22.05.6
OS=Linux 3.10.0-1160.80.1.el7.x86_64 #1 SMP Tue Nov 8 15:48:59 UTC 2022
RealMemory=514638 AllocMem=0 FreeMem=221585 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=10 Owner=N/A MCS_label=N/A
Partitions=gpu
BootTime=2022-12-16T09:24:25 SlurmdStartTime=2022-12-16T12:16:21
LastBusyTime=2023-01-05T14:32:26
CfgTRES=cpu=16,mem=514638M,billing=16,gres/gpu=2
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

@andrew-niaid
Copy link

andrew-niaid commented Jan 9, 2023

For completeness these nodes are available to you (and EM)

ai-rmlcpu01-16: 16 Cores, 256GB Memory, No GPU
ai-rmlcpu17-20, 64 Cores, 512GB Memory, No GPU
ai-rmlcpu21-28, 64 Cores, 1TB Memory, 2 x NVIDIA A100 40GB
ai-rmlgpu01-04, 16 Cores, 512GB Memory, 2 x NVIDIA P100 16GB
ai-rmlgpu05-08, 16 Cores, 512GB Memory, 2 x NVIDIA V100 32GB

On BigSky, if you don't specify a partition in slurm, you get the default partition called 'int' (for interactive). Nodes can be available in multiple partitions.

The default 'int' partition contains nodes ai-rmlcpu01-28
The 'gpu' partition contains nodes ai-rmlgpu01-08 and ai-rmlcpu21-28.

We've started putting GPUs in all new nodes, which is why there are some in nodes with an 'ai-rmlcpu##' name.

In slurm, the most correct way to see if a node has a GPU is to look at the 'Gres=' line in the 'scontrol show node' output like you have above. Gres is 'Generic RESource', a generic allocatable resource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants