Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the feature_wekafs branch #2

Open
wants to merge 430 commits into
base: feature_wekafs
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
430 commits
Select commit Hold shift + click to select a range
fc43da9
feat(chore): pre-commit run --all-files
SkylerMalinowski Apr 17, 2023
bc3d0fa
feat: add partition_feature for external dynamic nodes
SkylerMalinowski Apr 17, 2023
f695656
feat: add setup.py --slurmd-feature option
SkylerMalinowski Apr 19, 2023
04a5b99
Make testsuite marks for arch filtering
bsngardner Apr 20, 2023
b82f0a5
Merge branch 'marks' into 'dev'
bsngardner Apr 20, 2023
63f7636
docs: move modifiable options out of warning section
SkylerMalinowski Apr 21, 2023
8fdceb5
feat: add CommitDelay to slurmdbd.conf.tpl
SkylerMalinowski Apr 21, 2023
97088ba
fix(terraform): example missing dependencies
SkylerMalinowski Apr 21, 2023
5b93421
chore: pre-commit run --all-files
SkylerMalinowski Apr 24, 2023
b3a1f27
feat: add htc cloud example
SkylerMalinowski Apr 21, 2023
35d9cae
docs: add HTC guide
SkylerMalinowski Apr 21, 2023
4a087af
Add nvidia install from repo for ubuntu2004
bsngardner Apr 17, 2023
c9d8dd1
Refactor packer module for single images
bsngardner Apr 19, 2023
4c01b62
testsuite: retry tests once
bsngardner Apr 18, 2023
6f6770f
Improve packer clean-up
bsngardner Apr 20, 2023
7aa6baa
Revert "feat: add CommitDelay to slurmdbd.conf.tpl"
SkylerMalinowski Apr 27, 2023
e107b8a
feat: add slurmdbd.conf.tpl to htc example
SkylerMalinowski Apr 27, 2023
8f610ab
feat: add bf_continue to SchedulerParameters
SkylerMalinowski Apr 27, 2023
6d2247e
feat: adjust htc example slurm.conf.tpl
SkylerMalinowski Apr 27, 2023
d44c3fd
chore: terraform init -upgrade
SkylerMalinowski Apr 27, 2023
5ef66e3
docs: add example htc hardware requirements
SkylerMalinowski Apr 27, 2023
0b42326
docs: OS customization for HTC section
SkylerMalinowski Apr 27, 2023
899025c
docs: fix rocky linux name
SkylerMalinowski Apr 27, 2023
9b13df4
docs: improve image matrix
SkylerMalinowski Apr 27, 2023
0c417c8
docs: reorder images by project
SkylerMalinowski Apr 27, 2023
7a27231
docs: clarify general base image support
SkylerMalinowski Apr 27, 2023
1d49d3e
docs: remove schedmd-v5-slurm-22-05-8-ubuntu-2204-lts
SkylerMalinowski Apr 27, 2023
2ff1b72
docs: add link to published images
SkylerMalinowski Apr 27, 2023
4a16a9b
Fix 'full' tf example for partition_feature
bsngardner Apr 25, 2023
ec37599
Retune backoff_delay and allow for default count
bsngardner Apr 26, 2023
95dd2a2
Switch packer ssh key format to ed25519
bsngardner Apr 27, 2023
d0f4aaf
Stop building ubuntu-2004-lts-arm64
bsngardner Apr 27, 2023
4f33fbf
doc: Shielded VM support
bsngardner Apr 27, 2023
13a0800
Improve packer example vars file
bsngardner Apr 27, 2023
b395bd5
doc: Add packer, k80s, and shielded VM GPUs to changelog
bsngardner Apr 27, 2023
12fd049
testsuite: Add shielded test
bsngardner Apr 25, 2023
f024def
doc: Adjust wording in images.md
bsngardner Apr 28, 2023
e96f500
Do not build or publish debian-11-arm64 for now
bsngardner Apr 28, 2023
6698be5
docs: add image section in main readme
SkylerMalinowski Apr 28, 2023
c43240b
docs: add deprecation/EOL
SkylerMalinowski Apr 28, 2023
7cf8999
docs: add back ubuntu images previously released
SkylerMalinowski Apr 28, 2023
330cdc3
changelog: add HTC exmaple
SkylerMalinowski Apr 28, 2023
3ac17de
Add --project_id to notify_cluster script
jvilarru Apr 13, 2023
58c1fc1
Add --project_id to destroy_nodes script
jvilarru Apr 18, 2023
2331ea6
Add --project_id to destroy_resource_policies script
jvilarru Apr 18, 2023
e29b20c
testsuite: Disable building ubuntu-2204-lts
bsngardner Apr 28, 2023
31f37d2
Release 5.7.0
bsngardner Apr 28, 2023
6ab8210
chore: pre-commit run --all-files
SkylerMalinowski Apr 28, 2023
3337483
fix: add missing partition_feature to example.tfvars
SkylerMalinowski May 1, 2023
97efee4
Fix regression in load_bq.py
bsngardner Apr 28, 2023
9ccd49c
Fix handling slurmsync handling of subscriptions
bsngardner May 1, 2023
f5280da
Add retries to munge mount
bsngardner Apr 28, 2023
ca329d7
Adjust logic for partition.partition_feature
bsngardner May 2, 2023
4e262a3
fix: partition_feature truthy conditional
SkylerMalinowski May 2, 2023
ad09936
fix: partition generation with both nodes and feature
SkylerMalinowski May 2, 2023
7c27ab5
feat: add prometheus-client package
SkylerMalinowski May 2, 2023
ab79071
fix: regex parser for ops agent ingestion of slurm logs
SkylerMalinowski May 2, 2023
81f9fd1
fix: remove attempting to parse timestamp
SkylerMalinowski May 2, 2023
e483f78
Add wait on placement group creation
bsngardner May 2, 2023
6ab37bc
Release 5.7.1
bsngardner May 3, 2023
191b681
fix: DefMemPerCPU on partitions that only contain dynamic nodes
SkylerMalinowski May 3, 2023
5a7e799
doc: update EOL for debian-10 and rocky-8 images
bsngardner May 4, 2023
2f611ca
doc: update docs to pull some ubuntu images
bsngardner May 4, 2023
c6e70ad
docs: fix grammar
SkylerMalinowski May 5, 2023
aa3593f
Add hpc-rocky-linux-8 build
bsngardner May 4, 2023
6293829
Upgrade default Slurm to 22.05.9
bsngardner May 4, 2023
fd287c2
Fix kernel package errors in rocky-linux-8
bsngardner May 4, 2023
61e0361
Disable all package update for rocky-8
bsngardner May 9, 2023
28e1fe1
Release 5.7.2
bsngardner May 9, 2023
69d9154
Fix detecting gpus on certain machine types
SkylerMalinowski May 15, 2023
48f6edd
Add logging around reading SLURM_RESUME_FILE
SkylerMalinowski May 15, 2023
a9d0666
Update terraform_docs to use custom config file
SkylerMalinowski May 16, 2023
17c00a6
chore: pre-commit run --all
SkylerMalinowski May 16, 2023
10d5b10
Forward more error logging to slurm/srun/salloc
bsngardner May 11, 2023
139b44a
Move httplib2 to pip packages in util.py
bsngardner May 18, 2023
50d2ec3
Fix lustre support in images
bsngardner May 11, 2023
685a883
Try to fix lustre on debian-11
bsngardner May 15, 2023
79e3fa3
Update image name format
bsngardner May 17, 2023
fd6ce13
Alternate gcsfuse repo key is no longer available
bsngardner May 23, 2023
0b7cddb
Disable lustre install on debian-11
bsngardner May 24, 2023
139cf26
Add software versions to image doc
bsngardner May 25, 2023
b072855
Minor update to publish_image.sh
bsngardner May 25, 2023
9862512
Release 5.7.3
bsngardner May 25, 2023
3c5a3e5
Add notice of image name change to main README
bsngardner May 25, 2023
f4f20d3
Add slurm cluster management daemon
jvilarru May 25, 2023
49dc9fe
docs: Fix test doc typos
bsngardner Jun 6, 2023
28a98a8
doc: Update EOL of our published images
bsngardner Jun 6, 2023
d803c35
Gitignore service and timer files
SkylerMalinowski Jun 7, 2023
ace27f3
Update terraform.lock.hcl
SkylerMalinowski Jun 7, 2023
ac23c68
Merge branch 'test219' into 'dev'
SkylerMalinowski Jun 7, 2023
52ed9fb
Add changelog for previous commit
SkylerMalinowski Jun 7, 2023
aae7d38
Make changelog 6.0.0 section
SkylerMalinowski Jun 7, 2023
f6df27d
Add changelog 5.7.4 section
SkylerMalinowski Jun 7, 2023
f17eacd
Allow metadata key slurm_feature to initiate dynamic node setup
SkylerMalinowski Jun 6, 2023
00c77e6
Fix dynamic nodes using cloud_dns instead of cloud_reg_addrs
SkylerMalinowski Jun 6, 2023
c3890e1
Disable TreeWidth when dynamic nodes are configured
SkylerMalinowski Jun 6, 2023
0f6c829
Fix dynamic nodes failing to download custom scripts
SkylerMalinowski Jun 6, 2023
c216893
Fix slurmsync with only dynamic nodes in system
SkylerMalinowski Jun 6, 2023
2f4d0b8
Add dynamic node example
SkylerMalinowski Jun 6, 2023
7666f4b
Fix kernel update on rocky-linux-8
bsngardner Jun 13, 2023
e13262d
ci: Update testsuite docker image
bsngardner Jun 13, 2023
3d55f81
ci: fix pipenv install in docker image
bsngardner Jun 14, 2023
b190381
ci: update docker image
bsngardner Jun 14, 2023
6cc6ad1
doc: EOL centos-7 in Aug
bsngardner Jun 14, 2023
f60794c
Merge branch 'v5' into dev
bsngardner Jun 21, 2023
8090994
Upgrade Slurm to 23.02.2
bsngardner May 31, 2023
9841dca
Fix testsuite for Slurm 23.02
bsngardner Jun 2, 2023
2bda937
Update lustre repo url
bsngardner Jun 21, 2023
3194a28
Change job exclusive to use PowerDownOnIdle
bsngardner Jun 20, 2023
4e3bb86
Add changelog for previous commit
SkylerMalinowski Jun 7, 2023
3477502
slurm_cluster - use terraform 1.3 optional fields
SkylerMalinowski Jun 7, 2023
653925c
Add test_cluster example for slurm_cluster
SkylerMalinowski Jun 8, 2023
14cca22
tests: move tfvars.tpl into directory
SkylerMalinowski Jun 8, 2023
a527c34
ci: use test_cluster example
SkylerMalinowski Jun 8, 2023
3e81574
Flatten and prune slurm_cluster examples
SkylerMalinowski Jun 8, 2023
fb948d6
Prune optional fields from examples
SkylerMalinowski Jun 8, 2023
41fe265
Fix partitions input not compatible on destroy error
SkylerMalinowski Jun 9, 2023
a3ed8dc
Remove reconfigure implementation
SkylerMalinowski May 15, 2023
d5773b5
Move config generation to a new file
SkylerMalinowski Jun 13, 2023
2c14daa
Reimplemented Slurm reconfigure
SkylerMalinowski May 15, 2023
e36a2b8
changelog: add entry for previous commits
SkylerMalinowski May 18, 2023
fc9f809
Use gcs bucket instead of project metadata
SkylerMalinowski May 17, 2023
e84ac75
Use new Makefile for project
SkylerMalinowski Jun 15, 2023
9159497
terraform - refactor partition module
SkylerMalinowski Jun 14, 2023
be97827
feat: use new enable_public_ip to change access_config
SkylerMalinowski Jun 14, 2023
152312a
chore: make pc-autoupdate
SkylerMalinowski Jun 15, 2023
5164da5
chore: make tf-init-upgrade-recursive
SkylerMalinowski Jun 15, 2023
bfbf56b
Add partition options for power save settings and default
SkylerMalinowski Jun 15, 2023
f9b4395
Refactor gcs bucket download functions
bsngardner Jun 22, 2023
a6f7b8e
Refactor custom script download from blobs
bsngardner Jun 22, 2023
7bd5725
Update scripts for nodeset and partition split
bsngardner Jun 27, 2023
05e1373
Fix rebase artifact in changelog
SkylerMalinowski Jun 28, 2023
f9cd599
Increase nodeset_name length to 15 characters (from 7)
SkylerMalinowski Jun 28, 2023
22daec9
Remove partition_name length limit
SkylerMalinowski Jun 28, 2023
dbc1ca5
test: update the tfvars to use the new slurm_cluster
SkylerMalinowski Jun 28, 2023
1bb78b2
chore: make pc-run
SkylerMalinowski Jun 28, 2023
3b7b0d7
Fork terraform-google-vm//modules/instance_template
SkylerMalinowski Jun 23, 2023
b05b05d
Use forked instance-template module
SkylerMalinowski Jun 23, 2023
0d64404
chore: add SchedMD license notice
SkylerMalinowski Jun 26, 2023
34e0a31
Add bandwidth_tier support to instance templates
SkylerMalinowski Jun 23, 2023
0c5a127
Move spot preemptible support to instance template
SkylerMalinowski Jun 26, 2023
d74512b
Fix login template name not using group_name in name schema
SkylerMalinowski Jun 28, 2023
6f48e55
Fix job exclusive VM labels
bsngardner Jun 29, 2023
fff3f8d
test: prune redundant partition_conf from tfvars
SkylerMalinowski Jun 29, 2023
54ef68e
Add enable_login to toggle creation of login node resources
SkylerMalinowski Jul 4, 2023
60969b9
Fix regression in hybrid where prolog/epilog dir must exist
SkylerMalinowski Jul 4, 2023
4ce06c2
Fix hybrid missing conf.py
SkylerMalinowski Jul 4, 2023
5a0b902
hybrid - avoid certain cloud only functions
SkylerMalinowski Jul 4, 2023
0d0322f
hybrid - only run sync_placement_groups on controller
SkylerMalinowski Jul 4, 2023
668de5b
Fix sync_placement_groups usage of slurm_cluster_name
SkylerMalinowski Jul 4, 2023
00cc1db
Fix sync_placement_groups regex
SkylerMalinowski Jul 4, 2023
b342e86
Update to 23.02 SLURM_RESUME_FILE fields
SkylerMalinowski Jul 5, 2023
d2fa9a5
Switch to hostname compare (from nodelist)
SkylerMalinowski Jul 5, 2023
f5439a0
Fix update reconfigure wall message
SkylerMalinowski Jul 5, 2023
6220045
Add enable_public_ip and network_tier to nodeset
SkylerMalinowski Jul 5, 2023
9b7640e
Remove last slurm_depends_on
SkylerMalinowski Jul 5, 2023
e6ff1d6
Remove partition level startup-scripts and network mounts
SkylerMalinowski Jul 5, 2023
32c2323
Fix sync_placement_groups
bsngardner Jul 5, 2023
29194ce
Fix nvidia install on ubuntu-2004-lts
bsngardner Jul 6, 2023
30d9d90
Change partition level placement policy to nodeset level
SkylerMalinowski Jul 7, 2023
0519f37
Use topology.conf to prioritize nodes within nodesets
SkylerMalinowski Jul 7, 2023
bffd610
Fix slurmsync trying to destroy all placement groups
SkylerMalinowski Jul 10, 2023
dc6a0a0
Actually fix nvidia/cuda install on Ubuntu 20.04
bsngardner Jul 11, 2023
5fb0638
Remove debian-10 and rocky-linux-8 from support.
bsngardner Jul 11, 2023
3ca361c
Fix nvidia fix
bsngardner Jul 11, 2023
d237c84
Remove slurm_login_suffix metadata reference
bsngardner Jul 12, 2023
c5095d0
Fix threads per core inference
SkylerMalinowski Jul 11, 2023
a3306b6
Fix test_cluster exmaple using create_service_accounts
SkylerMalinowski Jul 13, 2023
e6fc071
Fix service account binding to bucket
SkylerMalinowski Jul 14, 2023
a5d205a
ci: add custom startup script tests
bsngardner Jul 12, 2023
64cee2f
ci: only build and test core images on push
bsngardner Jul 11, 2023
0fa56e3
ci: Fix tfvars spot nodes
bsngardner Jul 13, 2023
fefaa9c
ci: Fix GPU test
bsngardner Jul 13, 2023
26f70ad
Fix mismatched project on destroy_nodes.py
bsngardner Jul 13, 2023
252ea93
Fix gres.conf generation
bsngardner Jul 14, 2023
5cec91e
ci: fix test_gpu_config test
bsngardner Jul 14, 2023
d6522cd
ci: fix test_preemption for nodesets
bsngardner Jul 17, 2023
31f4b26
ci: test python version increased to 3.10
bsngardner Jul 17, 2023
9e443ec
ci: Change test python environment to workaround
bsngardner Jul 17, 2023
dc487fb
Fix python requirements
bsngardner Jul 17, 2023
a01c6e3
ci: Add variable to only build and test core image
bsngardner Jul 17, 2023
9c41fff
docs: add upgrade_to_v6.md
SkylerMalinowski Jul 18, 2023
72ccea3
docs: update with test_cluster example
SkylerMalinowski Jul 18, 2023
ccfad3a
Improve recursive make targets for tf init
bsngardner Jul 19, 2023
abb74d1
chore: make tf-init-upgrade-recursive
bsngardner Jul 19, 2023
7971dd5
Upgrade Slurm to 23.02.3
bsngardner Jul 19, 2023
00f449d
Update image doc and defaults to 6.0
bsngardner Jul 19, 2023
2e20488
chore: run pre-commit --all-files
bsngardner Jul 19, 2023
9b0e65a
chore: update pre-commit
bsngardner Jul 19, 2023
9847ce6
Fix test_cluster example.tfvars
bsngardner Jul 20, 2023
5b449dd
Do not label local-ssd
bsngardner Jun 29, 2023
b3c2d24
Add on_host_maintenance to packer module
bsngardner Jul 11, 2023
da451df
Move packer image cleanup to shutdown script
bsngardner Jul 11, 2023
65d6a38
Fix retry power up of static nodes
bsngardner Jul 24, 2023
9dc7bb7
Add support for H3 machines and multiple sockets
bsngardner Jul 27, 2023
b9e9c94
Fix munge failing after manual reboot
bsngardner Jul 27, 2023
8de55e9
Ansible TPU docker image creation
jvilarru Jul 20, 2023
3c5f450
Include a flag to not generate the docker image on gitlab
jvilarru Jul 25, 2023
5b56d99
Add nodeset tpu to terraform
jvilarru Jul 20, 2023
9507f18
Adding google cloud tpu as script requirement
jvilarru Jul 23, 2023
59aa115
Add function in startup for the TPU nodes
jvilarru Jul 23, 2023
123b6f6
Add handling for the TPU nodes in the scripts
jvilarru Jul 23, 2023
7c53773
Add data_disk to the nodeset_tpu
jvilarru Jul 24, 2023
3a6fff4
Adding custom topology to tpu nodesets
jvilarru Jul 25, 2023
5503825
Node configuration now done based on node_type
jvilarru Jul 26, 2023
c8cdcbc
Change default docker image to be in schedmd-public
jvilarru Jul 26, 2023
676d2ee
Delete a node if it cannot be stopped
jvilarru Jul 26, 2023
cc13612
Sets to false preemptible and preserve_tpu for non-single nodes
jvilarru Jul 26, 2023
ff210cd
Changes default variable in slurm_partition
jvilarru Jul 27, 2023
f30d863
Removed project_id from nodeset_tpu and add new node type
jvilarru Jul 27, 2023
109033f
Big TPUs update
jvilarru Jul 27, 2023
63b61dc
Add example with TPU nodeset
jvilarru Jul 27, 2023
5f3b1d0
Change tag name structure for docker images
jvilarru Jul 27, 2023
4f5b0b0
Add TPU documentation
jvilarru Jul 27, 2023
aaf7b0c
Merge branch 'tpudev' into 'dev'
bsngardner Jul 28, 2023
01929bc
Add ignore_prefer_validation to SchedulerParameters
bsngardner Jul 27, 2023
6aa9ebe
Remove centos-7 from published images
bsngardner Jul 31, 2023
cb96510
Upgrade Slurm to 23.02.4
bsngardner Jul 31, 2023
e9990c6
docs: fix README_TF title
SkylerMalinowski Jul 31, 2023
c39d422
fix(terraform): tpu example
SkylerMalinowski Jul 31, 2023
e741ef0
chore: add new external ansible dependency
SkylerMalinowski Jul 31, 2023
78455bf
feat: add TPU nodesets to test_cluster example
SkylerMalinowski Jul 31, 2023
c45a264
fix(terraform): allow service_account=null
SkylerMalinowski Jul 31, 2023
e455130
feat(terraform): prevent partition without any nodesets
SkylerMalinowski Jul 31, 2023
9a4fb1e
revert: Changes default variable in slurm_partition
SkylerMalinowski Jul 31, 2023
b63d74b
feat: {resume|suspend}_timeout=null means use smart default
SkylerMalinowski Jul 31, 2023
8968b46
fix(terraform): partition precondition for tpus
SkylerMalinowski Jul 31, 2023
4f3a75d
Remove combined futures function
jvilarru Aug 1, 2023
3e9d5ea
Include subnetwork in the TPU creation
jvilarru Aug 1, 2023
c8b4919
Fix softlink gsutil docker image creation
jvilarru Aug 1, 2023
95b6161
Add documention for the TPU nodes
jvilarru Aug 1, 2023
88565ed
Add missing tags to TPUs as well as service_account
jvilarru Aug 1, 2023
df54357
Add missing permission for TPU nodesets in bucket
jvilarru Aug 1, 2023
4c08970
fix: TPU not being in centos7 made util.py crash
jvilarru Aug 1, 2023
8d0308a
docs: Add tpu compatibility matrix
jvilarru Aug 1, 2023
cbc242d
Fix CUDA install on Ubuntu 20.04
bsngardner Aug 1, 2023
2b59b88
ci: add ubuntu-2004-lts to core images to build
bsngardner Aug 1, 2023
67d1940
Label 6.1.0 release in changelog
bsngardner Aug 1, 2023
7ff2a70
Update image references to 6.1.0
bsngardner Aug 1, 2023
2e0fc74
fix: in commit 4c089707 some code was unintentionally removed
jvilarru Aug 2, 2023
31d2ddf
fix: not suspending TPU nodes if no regular ones are present
jvilarru Aug 3, 2023
48e701d
ci: change placement policy test
bsngardner Aug 3, 2023
6cb0fd3
Update socket count for c2d
bsngardner Aug 3, 2023
cf8ee00
Change man2html to man2html-base dependency
jvilarru Aug 3, 2023
671e342
Add TPU job example
jvilarru Aug 3, 2023
c64740f
Changed default name of docker image
jvilarru Aug 3, 2023
72cba5a
Changelog for previous 4 commits
jvilarru Aug 3, 2023
7f64cc4
doc: ubuntu-2004-lts lapse in support notice
bsngardner Aug 3, 2023
13ab8f3
Separate packer docker build in two stages
jvilarru Aug 3, 2023
a5be802
Added new docker images to the list of published images
jvilarru Aug 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
### General ###
*.cache
*.log
*.zip

### Ansible ###
*.retry
Expand Down
259 changes: 259 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
---
workflow:
rules:
# skip pipeline if the source is a tag
- if: $CI_COMMIT_TAG
when: never
# Skip pipeline on merge request event
# TODO this should be possible, but I'm not sure how it works yet
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
when: never
- when: always

stages:
- build-images
- test-clusters

variables:
GOOGLE_APPLICATION_CREDENTIALS: $CI_PROJECT_DIR/sa.json
SLURM_VERSION: 23.02.4
BRANCH: $CI_COMMIT_BRANCH

default:
image:
name: registry.gitlab.com/schedmd/slurm-gcp/ci-image:0.0.7

before_script:
- set +o pipefail
- packer --version
- echo $SERVICE_ACCOUNT > $GOOGLE_APPLICATION_CREDENTIALS
- GCP_PROJECT_ID=$(jq -r .project_id < $GOOGLE_APPLICATION_CREDENTIALS)
- GCP_SA_EMAIL=$(jq -r .client_email < $GOOGLE_APPLICATION_CREDENTIALS)
- GCP_SA_ID=$(jq -r .client_id < $GOOGLE_APPLICATION_CREDENTIALS)

- 'echo Slurm version to build: $SLURM_VERSION'
- SLURM_VERSION_ALT=$(tr \. - <<< $SLURM_VERSION)
- IMAGE_FAMILY_ROOT=slurm-gcp-$BRANCH

- gcloud auth activate-service-account --key-file $GOOGLE_APPLICATION_CREDENTIALS
- gcloud config set project $GCP_PROJECT_ID

.build-image:
stage: build-images
retry: 2
artifacts:
name: image-manifest-$BRANCH-$IMAGE_OS
paths:
- packer/manifest.json
script:
- echo "Building $IMAGE_OS image for $BRANCH"
- cd packer
- packer init .
- >
packer build -var-file builds/$IMAGE_OS.hcl
-var "project_id=$GCP_PROJECT_ID"
-var "slurm_version=$SLURM_VERSION"
-var "slurmgcp_version=$BRANCH"
$EXTRA_VARS
.
- echo "packer build $IMAGE_OS completed"

.build-core-image:
extends: .build-image
rules:
- changes:
paths:
- .gitlab-ci.yml
- scripts/*.{py,sh}
- scripts/Pipfile
- scripts/requirements.txt
- ansible/**/*
- packer/**/*

# skip image build and test for all except a couple images on push
# build and test all on manual/scheduled pipelines
.build-extra-image:
extends: .build-image
rules:
- if: $CI_PIPELINE_SOURCE == "push"
when: never
- if: $ONLY_CORE
when: never
- changes:
paths:
- .gitlab-ci.yml
- scripts/*.{py,sh}
- scripts/Pipfile
- scripts/requirements.txt
- ansible/**/*
- packer/**/*

build-hpc-centos-7:
extends: .build-extra-image
variables:
IMAGE_OS: hpc-centos-7

build-hpc-centos-7-k80:
extends: .build-extra-image
variables:
IMAGE_OS: hpc-centos-7-k80

build-hpc-rocky-linux-8:
extends: .build-core-image
variables:
IMAGE_OS: hpc-rocky-linux-8

build-debian-11:
extends: .build-extra-image
variables:
IMAGE_OS: debian-11

# build-debian-11-arm64:
# extends: .build-extra-image
# variables:
# IMAGE_OS: debian-11-arm64

build-ubuntu-2004-lts:
extends: .build-core-image
variables:
IMAGE_OS: ubuntu-2004-lts

# build-ubuntu-2004-lts-arm64:
# extends: .build-extra-image
# variables:
# IMAGE_OS: ubuntu-2004-lts-arm64

# build-ubuntu-2204-lts:
# extends: .build-extra-image
# variables:
# IMAGE_OS: ubuntu-2204-lts

build-ubuntu-2204-lts-arm64:
extends: .build-extra-image
variables:
IMAGE_OS: ubuntu-2204-lts-arm64

.test-image:
stage: test-clusters
retry: 1
artifacts:
name: $CI_COMMIT_REF_NAME-$CI_JOB_NAME-logs
when: always
paths:
- test/cluster_logs/
script:
- IMAGE_FAMILY=$IMAGE_FAMILY_ROOT-$IMAGE_OS
- echo "Image family $IMAGE_FAMILY"
- >
IMAGE_NAME=$(
jq -r '.last_run_uuid as $uuid | .builds | map(select(.packer_run_uuid == $uuid))
| .[].artifact_id' < packer/manifest.json || true
)
- echo "Testing ${IMAGE_NAME:-$IMAGE_FAMILY}"
- cd test
- CLUSTER_NAME="test$(tr -dc a-z </dev/urandom | head -c2)"
- echo $CLUSTER_NAME | tee cluster_name
- export PATH=/venv/bin:$PATH
- pip3 install -r requirements.txt
- >
pytest -vs
--project-id=$GCP_PROJECT_ID
--cluster-name=$CLUSTER_NAME
--image-project=$GCP_PROJECT_ID
--image-family=$IMAGE_FAMILY
--image=${IMAGE_NAME:-null}
-m "${ARCH}"
# after_script:
# - cd test
# - pipenv install
# - pipenv run ./cleanup.py $(cat cluster_name)

.test-extra-image:
extends: .test-image
rules:
- if: $CI_PIPELINE_SOURCE == "push"
when: never
- if: $ONLY_CORE
when: never
- when: always

test-hpc-centos-7:
extends: .test-extra-image
needs:
- job: build-hpc-centos-7
optional: true
variables:
IMAGE_OS: hpc-centos-7
ARCH: x86_64

test-hpc-centos-7-k80:
extends: .test-extra-image
needs:
- job: build-hpc-centos-7-k80
optional: true
variables:
IMAGE_OS: hpc-centos-7-k80
ARCH: x86_64

test-hpc-rocky-linux-8:
extends: .test-image
needs:
- job: build-hpc-rocky-linux-8
optional: true
variables:
IMAGE_OS: hpc-rocky-linux-8
ARCH: x86_64

test-debian-11:
extends: .test-extra-image
needs:
- job: build-debian-11
optional: true
variables:
IMAGE_OS: debian-11
ARCH: x86_64

# test-debian-11-arm64:
# extends: .test-extra-image
# needs:
# - job: build-debian-11-arm64
# optional: true
# variables:
# IMAGE_OS: debian-11-arm64
# ARCH: arm64

test-ubuntu-2004-lts:
extends: .test-image
needs:
- job: build-ubuntu-2004-lts
optional: true
variables:
IMAGE_OS: ubuntu-2004-lts
ARCH: x86_64

# test-ubuntu-2004-lts-arm64:
# extends: .test-extra-image
# needs:
# - job: build-ubuntu-2004-lts-arm64
# optional: true
# variables:
# IMAGE_OS: ubuntu-2004-lts-arm64
# ARCH: arm64

# test-ubuntu-2204-lts:
# extends: .test-extra-image
# needs:
# - job: build-ubuntu-2204-lts
# optional: true
# variables:
# IMAGE_OS: ubuntu-2204-lts
# ARCH: x86_64

test-ubuntu-2204-lts-arm64:
extends: .test-extra-image
needs:
- job: build-ubuntu-2204-lts-arm64
optional: true
variables:
IMAGE_OS: ubuntu-2204-lts-arm64
ARCH: arm64
11 changes: 6 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
rev: v4.4.0
hooks:
- id: check-merge-conflict
- id: check-executables-have-shebangs
Expand All @@ -26,7 +26,7 @@ repos:
- mdformat-toc
- mdformat-tables
- repo: https://github.com/jumanjihouse/pre-commit-hook-yamlfmt
rev: 0.2.2
rev: 0.2.3
hooks:
- id: yamlfmt
args:
Expand All @@ -44,23 +44,24 @@ repos:
types: [file, text]
pass_filenames: false
- repo: https://github.com/antonbabenko/pre-commit-terraform
rev: v1.76.0
rev: v1.81.0
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_tflint
- id: terraform_docs
args:
- --args=--config=.terraform-docs.yaml
- --hook-config=--create-file-if-not-exist=true
- --hook-config=--path-to-file=README_TF.md
- repo: https://github.com/psf/black
rev: 22.10.0
rev: 23.7.0
hooks:
- id: black
exclude: ^dm/
language_version: python3
- repo: https://github.com/pycqa/flake8
rev: 5.0.4
rev: 6.0.0
hooks:
- id: flake8
exclude: ^dm/
10 changes: 10 additions & 0 deletions .terraform-docs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
formatter: markdown
output:
mode: inject
template: |-
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
{{ .Content }}
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
settings:
lockfile: false
Loading