Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the feature_wekafs branch #2

Open
wants to merge 430 commits into
base: feature_wekafs
Choose a base branch
from
Open

Conversation

fluidnumerics-joe
Copy link
Member

No description provided.

SkylerMalinowski and others added 30 commits April 17, 2023 12:34
This allows an external node to be configured and register in slurm as a
dynamic node using node features to place the node into a pre-configured
partition.

Issue #283
Make testsuite marks for arch filtering

See merge request SchedMD/slurm-gcp!61
This example shows how to customize slurm.conf.

Issue #286
This allows the use of shielded VMs with GPUs.
Instead of supporting multiple images, the module only builds a single image.
This simplifies the module, allowing easier support for custom image names and ansible
vars.
Properly removing the 'packer' user should allow this user to be reused on this image.
There are build issues unrelated to slurm-gcp at this time. At the time
of commit, we cannot release this as a public image. This is subject to
change in the future.

Issue #287
In practical use, the number of counts is not particularly important. The initial wait
and the total timeout are more important. So I made an equation for generating a default
count.
SkylerMalinowski and others added 30 commits July 31, 2023 14:03
Warning from Packer:

This template relies on the use of plugins bundled into the Packer binary.
The practice of bundling external plugins into Packer will be removed in an
upcoming version.
Lookup the service account like we do with instance templates.
This reverts commit ff210cd.

Using 0 to denote implicit default value behavior is not the correct
solution in this case. Rather null should be used.
TPUs operations, create and destroy, take longer than regular nodes.
Hence we want to make good defaults, even for TPU nodes.
This function was not working, gone back to the serial creation of TPU
and normal nodes, moved the TPU creation after the normal node creation
to not make normal nodes slower when TPU nodes are created.
subnetwork was not being propagated to the TPU creation process which
make it fail for non-default subnetworks, also included a check that
makes sure that TPUs either have public ip, or the subnetwork they are
in, has private google access, as that is needed for the TPU to start
all the needed services.
Included list of supported TPU types and tensorflow versions.
Included Usage information containing how to make static ndoes and
heterogenous jobs as well as information regarding multi-rank TPU nodes
Just run --fix-broken after CUDA install. I don't know why dpkg is failing.
CUDA on Ubuntu 20.04 is proving to be a pain point, so we need to know sooner if it is
failing.
In suspend.py the function to suspend tpu nodes was called in the same
function as regular nodes (delete_instances), but after the check to see
if there are any valid nodes to suspend, this caused an early return
that made that the TPU nodes were never suspended.

Moved the separation of the list between regular nodes and TPU nodes as
well as the call to delete the tpu_instances to the calling function
(suspend_nodes) this way the early return does not affect the invocation
of the delete TPU instances.
physicalHost between nodes in placement policy is not longer matching, so just
check if there is a physicalHost at all.
2 sockets when 112 cores, 1 for fewer.
Changing this dependency to only the base or core one, we make that
apache does not get installed as a dependency, making the image lighter.
Also removed the -except parameter from gitlab ci packer build, as
docker build is in another directory
This separation allows to build first a base image and from the
generated base image install a tensorflow version, this makes that all
the tensorflow images share the previous layers, thus reducing the space
used in the GCP registry in a siginicant way. Created a helper script to
generate and push the images.

Disabled also the lustre install in the packer examples vars of docker
build as this was causing issues being compiled inside a container.

Added a helpful note in the ansible role of TPU that states how to get
the list of .whl and .so files in the google artifact repository for
cloud tpu.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants