forked from SchedMD/slurm-gcp
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the feature_wekafs branch #2
Open
fluidnumerics-joe
wants to merge
430
commits into
feature_wekafs
Choose a base branch
from
master
base: feature_wekafs
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Continuation of commit 0059b58.
This allows an external node to be configured and register in slurm as a dynamic node using node features to place the node into a pre-configured partition. Issue #283
Make testsuite marks for arch filtering See merge request SchedMD/slurm-gcp!61
This example shows how to customize slurm.conf. Issue #286
Issue #286
This allows the use of shielded VMs with GPUs.
Instead of supporting multiple images, the module only builds a single image. This simplifies the module, allowing easier support for custom image names and ansible vars.
Properly removing the 'packer' user should allow this user to be reused on this image.
This reverts commit 8fdceb5.
Issue #286
Issue #286
Issue #287
Issue #287
There are build issues unrelated to slurm-gcp at this time. At the time of commit, we cannot release this as a public image. This is subject to change in the future. Issue #287
Issue #287
In practical use, the number of counts is not particularly important. The initial wait and the total timeout are more important. So I made an equation for generating a default count.
Warning from Packer: This template relies on the use of plugins bundled into the Packer binary. The practice of bundling external plugins into Packer will be removed in an upcoming version.
Lookup the service account like we do with instance templates.
This reverts commit ff210cd. Using 0 to denote implicit default value behavior is not the correct solution in this case. Rather null should be used.
TPUs operations, create and destroy, take longer than regular nodes. Hence we want to make good defaults, even for TPU nodes.
This function was not working, gone back to the serial creation of TPU and normal nodes, moved the TPU creation after the normal node creation to not make normal nodes slower when TPU nodes are created.
subnetwork was not being propagated to the TPU creation process which make it fail for non-default subnetworks, also included a check that makes sure that TPUs either have public ip, or the subnetwork they are in, has private google access, as that is needed for the TPU to start all the needed services.
Included list of supported TPU types and tensorflow versions. Included Usage information containing how to make static ndoes and heterogenous jobs as well as information regarding multi-rank TPU nodes
Just run --fix-broken after CUDA install. I don't know why dpkg is failing.
CUDA on Ubuntu 20.04 is proving to be a pain point, so we need to know sooner if it is failing.
In suspend.py the function to suspend tpu nodes was called in the same function as regular nodes (delete_instances), but after the check to see if there are any valid nodes to suspend, this caused an early return that made that the TPU nodes were never suspended. Moved the separation of the list between regular nodes and TPU nodes as well as the call to delete the tpu_instances to the calling function (suspend_nodes) this way the early return does not affect the invocation of the delete TPU instances.
physicalHost between nodes in placement policy is not longer matching, so just check if there is a physicalHost at all.
2 sockets when 112 cores, 1 for fewer.
Changing this dependency to only the base or core one, we make that apache does not get installed as a dependency, making the image lighter.
Also removed the -except parameter from gitlab ci packer build, as docker build is in another directory
This separation allows to build first a base image and from the generated base image install a tensorflow version, this makes that all the tensorflow images share the previous layers, thus reducing the space used in the GCP registry in a siginicant way. Created a helper script to generate and push the images. Disabled also the lustre install in the packer examples vars of docker build as this was causing issues being compiled inside a container. Added a helpful note in the ansible role of TPU that states how to get the list of .whl and .so files in the google artifact repository for cloud tpu.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.