Update the feature_wekafs branch #2

fluidnumerics-joe · 2023-05-19T17:18:25Z

No description provided.

Continuation of commit 0059b58.

Issue #283

This allows an external node to be configured and register in slurm as a dynamic node using node features to place the node into a pre-configured partition. Issue #283

Make testsuite marks for arch filtering See merge request SchedMD/slurm-gcp!61

@286

Issue @286

Issue 286

Isse #286

This example shows how to customize slurm.conf. Issue #286

Issue #286

This allows the use of shielded VMs with GPUs.

Instead of supporting multiple images, the module only builds a single image. This simplifies the module, allowing easier support for custom image names and ansible vars.

Properly removing the 'packer' user should allow this user to be reused on this image.

This reverts commit 8fdceb5.

Issue #286

Issue #287

There are build issues unrelated to slurm-gcp at this time. At the time of commit, we cannot release this as a public image. This is subject to change in the future. Issue #287

Issue #287

In practical use, the number of counts is not particularly important. The initial wait and the total timeout are more important. So I made an equation for generating a default count.

Warning from Packer: This template relies on the use of plugins bundled into the Packer binary. The practice of bundling external plugins into Packer will be removed in an upcoming version.

Lookup the service account like we do with instance templates.

This reverts commit ff210cd. Using 0 to denote implicit default value behavior is not the correct solution in this case. Rather null should be used.

TPUs operations, create and destroy, take longer than regular nodes. Hence we want to make good defaults, even for TPU nodes.

This function was not working, gone back to the serial creation of TPU and normal nodes, moved the TPU creation after the normal node creation to not make normal nodes slower when TPU nodes are created.

subnetwork was not being propagated to the TPU creation process which make it fail for non-default subnetworks, also included a check that makes sure that TPUs either have public ip, or the subnetwork they are in, has private google access, as that is needed for the TPU to start all the needed services.

Included list of supported TPU types and tensorflow versions. Included Usage information containing how to make static ndoes and heterogenous jobs as well as information regarding multi-rank TPU nodes

Just run --fix-broken after CUDA install. I don't know why dpkg is failing.

CUDA on Ubuntu 20.04 is proving to be a pain point, so we need to know sooner if it is failing.

In suspend.py the function to suspend tpu nodes was called in the same function as regular nodes (delete_instances), but after the check to see if there are any valid nodes to suspend, this caused an early return that made that the TPU nodes were never suspended. Moved the separation of the list between regular nodes and TPU nodes as well as the call to delete the tpu_instances to the calling function (suspend_nodes) this way the early return does not affect the invocation of the delete TPU instances.

physicalHost between nodes in placement policy is not longer matching, so just check if there is a physicalHost at all.

2 sockets when 112 cores, 1 for fewer.

Changing this dependency to only the base or core one, we make that apache does not get installed as a dependency, making the image lighter.

Also removed the -except parameter from gitlab ci packer build, as docker build is in another directory

This separation allows to build first a base image and from the generated base image install a tensorflow version, this makes that all the tensorflow images share the previous layers, thus reducing the space used in the GCP registry in a siginicant way. Created a helper script to generate and push the images. Disabled also the lustre install in the packer examples vars of docker build as this was causing issues being compiled inside a container. Added a helpful note in the ansible role of TPU that states how to get the list of .whl and .so files in the google artifact repository for cloud tpu.

SkylerMalinowski and others added 30 commits April 17, 2023 12:34

feat(chore): pre-commit run --all-files

fc43da9

Continuation of commit 0059b58.

feat: add partition_feature for external dynamic nodes

bc3d0fa

Issue #283

feat: add setup.py --slurmd-feature option

f695656

This allows an external node to be configured and register in slurm as a dynamic node using node features to place the node into a pre-configured partition. Issue #283

Make testsuite marks for arch filtering

04a5b99

Merge branch 'marks' into 'dev'

b82f0a5

Make testsuite marks for arch filtering See merge request SchedMD/slurm-gcp!61

docs: move modifiable options out of warning section

63f7636

Issue @286

feat: add CommitDelay to slurmdbd.conf.tpl

8fdceb5

Issue 286

fix(terraform): example missing dependencies

97088ba

Isse #286

chore: pre-commit run --all-files

5b93421

feat: add htc cloud example

b3a1f27

This example shows how to customize slurm.conf. Issue #286

docs: add HTC guide

35d9cae

Issue #286

Add nvidia install from repo for ubuntu2004

4a087af

This allows the use of shielded VMs with GPUs.

Refactor packer module for single images

c9d8dd1

Instead of supporting multiple images, the module only builds a single image. This simplifies the module, allowing easier support for custom image names and ansible vars.

testsuite: retry tests once

4c01b62

Improve packer clean-up

6f6770f

Properly removing the 'packer' user should allow this user to be reused on this image.

Revert "feat: add CommitDelay to slurmdbd.conf.tpl"

7aa6baa

This reverts commit 8fdceb5.

feat: add slurmdbd.conf.tpl to htc example

e107b8a

Issue #286

feat: add bf_continue to SchedulerParameters

8f610ab

Issue #286

feat: adjust htc example slurm.conf.tpl

6d2247e

Issue #286

chore: terraform init -upgrade

d44c3fd

docs: add example htc hardware requirements

5ef66e3

Issue #286

docs: OS customization for HTC section

0b42326

Issue #286

docs: fix rocky linux name

899025c

docs: improve image matrix

9b13df4

Issue #287

docs: reorder images by project

0c417c8

Issue #287

docs: clarify general base image support

7a27231

docs: remove schedmd-v5-slurm-22-05-8-ubuntu-2204-lts

1d49d3e

There are build issues unrelated to slurm-gcp at this time. At the time of commit, we cannot release this as a public image. This is subject to change in the future. Issue #287

docs: add link to published images

2ff1b72

Issue #287

Fix 'full' tf example for partition_feature

4a16a9b

Retune backoff_delay and allow for default count

ec37599

In practical use, the number of counts is not particularly important. The initial wait and the total timeout are more important. So I made an equation for generating a default count.

SkylerMalinowski and others added 30 commits July 31, 2023 14:03

chore: add new external ansible dependency

e741ef0

Warning from Packer: This template relies on the use of plugins bundled into the Packer binary. The practice of bundling external plugins into Packer will be removed in an upcoming version.

feat: add TPU nodesets to test_cluster example

78455bf

fix(terraform): allow service_account=null

c45a264

Lookup the service account like we do with instance templates.

feat(terraform): prevent partition without any nodesets

e455130

revert: Changes default variable in slurm_partition

9a4fb1e

This reverts commit ff210cd. Using 0 to denote implicit default value behavior is not the correct solution in this case. Rather null should be used.

feat: {resume|suspend}_timeout=null means use smart default

b63d74b

TPUs operations, create and destroy, take longer than regular nodes. Hence we want to make good defaults, even for TPU nodes.

fix(terraform): partition precondition for tpus

8968b46

Remove combined futures function

4f3a75d

This function was not working, gone back to the serial creation of TPU and normal nodes, moved the TPU creation after the normal node creation to not make normal nodes slower when TPU nodes are created.

Fix softlink gsutil docker image creation

c8b4919

Add documention for the TPU nodes

95b6161

Included list of supported TPU types and tensorflow versions. Included Usage information containing how to make static ndoes and heterogenous jobs as well as information regarding multi-rank TPU nodes

Add missing tags to TPUs as well as service_account

88565ed

Add missing permission for TPU nodesets in bucket

df54357

fix: TPU not being in centos7 made util.py crash

4c08970

docs: Add tpu compatibility matrix

8d0308a

Fix CUDA install on Ubuntu 20.04

cbc242d

Just run --fix-broken after CUDA install. I don't know why dpkg is failing.

ci: add ubuntu-2004-lts to core images to build

2b59b88

CUDA on Ubuntu 20.04 is proving to be a pain point, so we need to know sooner if it is failing.

Label 6.1.0 release in changelog

67d1940

Update image references to 6.1.0

7ff2a70

fix: in commit 4c08970 some code was unintentionally removed

2e0fc74

ci: change placement policy test

48e701d

physicalHost between nodes in placement policy is not longer matching, so just check if there is a physicalHost at all.

Update socket count for c2d

6cb0fd3

2 sockets when 112 cores, 1 for fewer.

Change man2html to man2html-base dependency

cf8ee00

Changing this dependency to only the base or core one, we make that apache does not get installed as a dependency, making the image lighter.

Add TPU job example

671e342

Changed default name of docker image

c64740f

Also removed the -except parameter from gitlab ci packer build, as docker build is in another directory

Changelog for previous 4 commits

72cba5a

doc: ubuntu-2004-lts lapse in support notice

7f64cc4

Added new docker images to the list of published images

a5be802

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the feature_wekafs branch #2

Update the feature_wekafs branch #2

fluidnumerics-joe commented May 19, 2023

Update the feature_wekafs branch #2

Are you sure you want to change the base?

Update the feature_wekafs branch #2

Conversation

fluidnumerics-joe commented May 19, 2023