Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CASMINST-5657 Add common WorkflowTemplate to sync secret to Argo namespace #5611

Merged
merged 18 commits into from
Dec 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
ff29b09
CASMINST-6711 - Command Mismatch When Setting up SNMP on Dell and Mel…
spillerc-hpe Dec 5, 2024
036ce1d
CASMTRIAGE-7545: handle idempotent annotation addition (#5586)
bo-quan Dec 5, 2024
f4eb37b
Automated API docs swagger to md conversion (https://jenkins.algol60.…
Dec 5, 2024
0a051c3
CASMINST-7041 add manual NCN upgrade documentation for use in case of…
leliasen-hpe Nov 21, 2024
a1cdf74
CASMINST-7041 Update troubleshooting/README.md with correct file name…
leliasen-hpe Nov 22, 2024
af223fa
CASMINST-7041 Apply formatting improvement suggestions from code review
leliasen-hpe Dec 4, 2024
d9ea6b4
CASMINST-7088-release-1.6 small docs adjustment to IUF deliver produc…
leliasen-hpe Dec 6, 2024
98f490a
TECHPUBS-4619: HPE Slingshot Network Operator docs
nrockershousen Nov 22, 2024
96e0d2a
TECHPUBS-4619: added link to HPESC
nrockershousen Dec 2, 2024
f158f86
CASMCMS-9226 - fix misspelling. (#5587)
dlaine-hpe Dec 6, 2024
524e529
CASMNET-2238 - Add switch firmware upgrade step to CSM upgrade proced…
spillerc-hpe Dec 10, 2024
315e536
SSI-14310 Update docs to reflect that USS now contains multiple compo…
don-bahls-hpe Dec 10, 2024
f4cd39c
CASMTRIAGE-7607: Grab docker-kubectl image from cacheImages for nexus…
studenym-hpe Dec 17, 2024
237547a
CASMTRIAGE-7504: Fix kubectl examples to use cray-console-data-postgr…
studenym-hpe Dec 17, 2024
ffc55b8
CASMTRIAGE-7616 make certmanager upgrade more robust, redo upgrade if…
leliasen-hpe Dec 17, 2024
9c15d64
CASMINST-7039 cleanup old images and overlayFS on upgrade (#5605)
jpdavis-prof Dec 17, 2024
b0ebab6
CASMTRIAGE-7577: Update to add cray-spire-jwks (#5593)
ndavidson-hpe Dec 18, 2024
b99c826
CASMINST-5657 Add common WorkflowTemplate to sync secret to Argo name…
Srinivas-Anand-HPE Dec 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
296 changes: 249 additions & 47 deletions api/cfs.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions operations/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -835,6 +835,7 @@ these backups.
- [Modifying a Tenant](multi-tenancy/Modify_a_Tenant.md)
- [Removing a Tenant](multi-tenancy/Remove_a_Tenant.md)
- [Slurm Operator](multi-tenancy/SlurmOperator.md)
- [HPE Slingshot Network Operator](multi-tenancy/hpe_slingshot_network_operator.md)
- [Tenant and Partition Management System (TAPMS) Overview](multi-tenancy/Tapms.md)
- [TAPMS Tenant Status API](../api/tapms-operator.md)
- [Global Tenant Hooks](multi-tenancy/GlobalTenantHooks.md)
Expand Down
2 changes: 1 addition & 1 deletion operations/image_management/Configure_IMS_to_Use_DKMS.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ tool. This allows kernel modules to be built for the specific kernel used in the
access to the running kernel that is not usually allowed by the Image Management Service (IMS). In order to safely
allow the expanded access, the IMS configuration must be modified to enable the feature.

## Requirements of DMKS
## Requirements of DKMS

Many DKMS build and install scripts require access to the system `/proc`, `/dev`, and `/sys` directories which
allows access to running processes and system services. The IMS jobs run as an administrator user since preparing
Expand Down
4 changes: 2 additions & 2 deletions operations/iuf/workflows/admin_directory.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,15 +98,15 @@ the HPC CSM Software Recipe with the existing content in `${ADMIN_DIR}`.

Example output:

```text
```yaml
default:
network_type: "cassini"
suffix: "-test01"
system_name: "my-system"
site_domain: "my-site-domain.net"
uss:
deploy_slurm: true
deploy_pbs: true
deploy_pbs: false
```

3. Ensure the expected files are present in the admin directory after performing the steps in this section.
Expand Down
37 changes: 17 additions & 20 deletions operations/iuf/workflows/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,14 +91,23 @@ The following highlights some of the areas that require manual configuration cha
required for initial installation scenarios.

- USS
- Configure DVS and LNet with appropriate Slingshot settings
- Configure DVS and LNet for use on application nodes
- Enable site-specific file system mounts
- Set the USS root password in HashiCorp Vault
- UAN
- Enable CAN, LDAP, and set MOTD
- Move DVS and LNet settings to USS branch
- Set the UAN root password in HashiCorp Vault
- Compute Configuration
- Configure DVS and LNet with appropriate Slingshot settings
- Configure DVS and LNet for use on application nodes
- Enable site-specific file system mounts
- Set the USS root password in HashiCorp Vault
- UAN Configuration
- Enable CAN, LDAP, and set MOTD
- Move DVS and LNet settings to USS branch
- Set the UAN root password in HashiCorp Vault
- Enable UAIs on UAN
- SLURM Configuration
- CSM Diags
- Update CSM Diags network attachment definition
- PBS Pro Configuration
- CSM Diags
- Update CSM Diags network attachment definition

- SHS
- Update release information in `group_vars` (done for each product release)
- CPE
Expand All @@ -109,18 +118,6 @@ required for initial installation scenarios.
- Configure SAT authentication via `sat auth`
- Generate SAT S3 credentials
- Configure system revision information via `sat setrev`
- SLURM
- UAS
- Configure UAS network settings
- The network settings for UAS must match the SLURM WLM to allow job submission from UAIs
- CSM Diags
- Update CSM Diags network attachment definition
- PBS Pro
- UAS
- Configure UAS network settings
- The network settings for UAS must match the PBS Pro WLM to allow job submission from UAIs
- CSM Diags
- Update CSM Diags network attachment definition

Once this step has completed:

Expand Down
64 changes: 40 additions & 24 deletions operations/iuf/workflows/management_rollout.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@
This section updates the software running on management NCNs.

- [1. Perform Slingshot switch firmware updates](#1-perform-slingshot-switch-firmware-updates)
- [2. Update management host firmware (FAS)](#2-update-management-host-firmware-fas)
- [3. Execute the IUF `management-nodes-rollout` stage](#3-execute-the-iuf-management-nodes-rollout-stage)
- [3.1 `management-nodes-rollout` with CSM upgrade](#31-management-nodes-rollout-with-csm-upgrade)
- [3.2 `management-nodes-rollout` without CSM upgrade](#32-management-nodes-rollout-without-csm-upgrade)
- [3.3 NCN worker nodes](#33-ncn-worker-nodes)
- [4. Restart `goss-servers` on all NCNs](#4-restart-goss-servers-on-all-ncns)
- [5. Update management host Slingshot NIC firmware](#5-update-management-host-slingshot-nic-firmware)
- [6. Next steps](#6-next-steps)
- [2. Perform management network switch firmware updates](#2-perform-management-network-switch-firmware-updates)
- [3. Update management host firmware (FAS)](#3-update-management-host-firmware-fas)
- [4. Execute the IUF `management-nodes-rollout` stage](#4-execute-the-iuf-management-nodes-rollout-stage)
- [4.1 `management-nodes-rollout` with CSM upgrade](#41-management-nodes-rollout-with-csm-upgrade)
- [4.2 `management-nodes-rollout` without CSM upgrade](#42-management-nodes-rollout-without-csm-upgrade)
- [4.3 NCN worker nodes](#43-ncn-worker-nodes)
- [5. Restart `goss-servers` on all NCNs](#5-restart-goss-servers-on-all-ncns)
- [6. Update management host Slingshot NIC firmware](#6-update-management-host-slingshot-nic-firmware)
- [7. Next steps](#7-next-steps)

## 1. Perform Slingshot switch firmware updates

Expand All @@ -22,7 +23,22 @@ Once this step has completed:

- Slingshot switch firmware has been updated

## 2. Update management host firmware (FAS)
## 2. Perform management network switch firmware updates

**`NOTE`** This section is optional and can be skipped or deferred unless network configuration that requires updated firmware is being applied to the system.

Management network switch firmware is shipped in the HPC Firmware Pack (HFP) product tarball.

Refer to [Update Management Network Firmware](../../network/management_network/firmware/update_management_network_firmware.md) for instructions on performing the switch firmware update.

**`NOTE`** The firmware on spine, leaf, and CDU switches can be updated without disruption. Air-cooled compute nodes, their BMCs, and other air-cooled devices
such as Slingshot switches will experience a loss of connectivity while the leaf-bmc switch the device is connected to restarts.

Once this step has been completed:

- Management network switch firmware has been updated

## 3. Update management host firmware (FAS)

**`NOTE`** This subsection is optional and can be skipped if upgrading only CSM through IUF.

Expand All @@ -32,7 +48,7 @@ Once this step has completed:

- Host firmware has been updated on management nodes

## 3. Execute the IUF `management-nodes-rollout` stage
## 4. Execute the IUF `management-nodes-rollout` stage

This section describes how to update software on management nodes. It describes how to test a new image and CFS configuration on a single node first to ensure they work as expected before rolling the changes out to the other management
nodes. This initial test node is referred to as the "canary node". Modify the procedure as necessary to accommodate site preferences for rebuilding management nodes. The images and CFS configurations used are created by the
Expand All @@ -49,10 +65,10 @@ being upgraded, then NCN storage nodes and NCN master nodes will not be upgraded
upgraded, the NCN storage nodes and NCN master nodes will be upgraded with new images and the new CFS configuration. Both procedures use the same steps for rebuilding/upgrading NCN worker nodes. Select **one** of the following
procedures based on whether or not CSM is being upgraded:

- [`management-nodes-rollout` with CSM upgrade](#31-management-nodes-rollout-with-csm-upgrade)
- [`management-nodes-rollout` without CSM upgrade](#32-management-nodes-rollout-without-csm-upgrade)
- [`management-nodes-rollout` with CSM upgrade](#41-management-nodes-rollout-with-csm-upgrade)
- [`management-nodes-rollout` without CSM upgrade](#42-management-nodes-rollout-without-csm-upgrade)

### 3.1 `management-nodes-rollout` with CSM upgrade
### 4.1 `management-nodes-rollout` with CSM upgrade

All management nodes will be upgraded to a new image because CSM itself is being upgraded.
This section describes how to test a new image and CFS configuration on a single canary node first before rolling it out to the other management nodes of the same management type.
Expand Down Expand Up @@ -152,7 +168,7 @@ Refer to that table and any corresponding product documents before continuing to
cray cfs components describe "${XNAME}"
```

1. Perform the NCN worker node upgrade. To upgrade worker nodes, follow the procedure in section [3.3 NCN worker nodes](#33-ncn-worker-nodes) and then return to this procedure to complete the next step.
1. Perform the NCN worker node upgrade. To upgrade worker nodes, follow the procedure in section [4.3 NCN worker nodes](#43-ncn-worker-nodes) and then return to this procedure to complete the next step.

1. Perform the NCN master node upgrade of `ncn-m001`.

Expand Down Expand Up @@ -200,9 +216,9 @@ Refer to that table and any corresponding product documents before continuing to
- All management NCNs have been upgraded to the image and CFS configuration created in the previous steps of this workflow
- Per-stage product hooks have executed for the `management-nodes-rollout` stage

Continue to the next section [4. Restart `goss-servers` on all NCNs](#4-restart-goss-servers-on-all-ncns).
Continue to the next section [5. Restart `goss-servers` on all NCNs](#5-restart-goss-servers-on-all-ncns).

### 3.2 `management-nodes-rollout` without CSM upgrade
### 4.2 `management-nodes-rollout` without CSM upgrade

This is the procedure to rollout management nodes if CSM is not being upgraded. NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow.
Unlike NCN worker nodes, NCN master nodes and storage nodes do not contain kernel module content from non-CSM products. However, user-space non-CSM product content is still provided on NCN master nodes and storage nodes and thus the `prepare-images` and `update-cfs-config`
Expand All @@ -215,7 +231,7 @@ Follow the following steps to complete the `management-nodes-rollout` stage.
section of the _HPE Cray EX System Software Stack Installation and Upgrade Guide for CSM (S-8052)_ provides a table that summarizes which product documents contain information or actions for the `management-nodes-rollout` stage.
Refer to that table and any corresponding product documents before continuing to the next step.

1. Rebuild the NCN worker nodes. Follow the procedure in section [3.3 NCN worker nodes](#33-ncn-worker-nodes) and then return to this procedure to complete the next step.
1. Rebuild the NCN worker nodes. Follow the procedure in section [4.3 NCN worker nodes](#43-ncn-worker-nodes) and then return to this procedure to complete the next step.

1. Configure NCN master nodes.

Expand Down Expand Up @@ -339,9 +355,9 @@ Once this step has completed:
- Management NCN storage and NCN master nodes have be updated with the CFS configuration created in the previous steps of this workflow.
- Per-stage product hooks have executed for the `management-nodes-rollout` stage

Continue to the next section [4. Restart `goss-servers` on all NCNs](#4-restart-goss-servers-on-all-ncns).
Continue to the next section [5. Restart `goss-servers` on all NCNs](#5-restart-goss-servers-on-all-ncns).

### 3.3 NCN worker nodes
### 4.3 NCN worker nodes

NCN worker node images contain kernel module content from non-CSM products and need to be rebuilt as part of the workflow. This section describes how to test a new image and CFS configuration on a single canary node (`ncn-w001`) first before
rolling it out to the other NCN worker nodes. Modify the procedure as necessary to accommodate site preferences for rebuilding NCN worker nodes.
Expand Down Expand Up @@ -432,10 +448,10 @@ Once this step has completed:
- Management NCN worker nodes have been rebuilt with the image and CFS configuration created in previous steps of this workflow
- Per-stage product hooks have executed for the `management-nodes-rollout` stage

Return to the procedure that was being followed for `management-nodes-rollout` to complete the next step, either [Management-nodes-rollout with CSM upgrade](#31-management-nodes-rollout-with-csm-upgrade) or
[Management-nodes-rollout without CSM upgrade](#32-management-nodes-rollout-without-csm-upgrade).
Return to the procedure that was being followed for `management-nodes-rollout` to complete the next step, either [Management-nodes-rollout with CSM upgrade](#41-management-nodes-rollout-with-csm-upgrade) or
[Management-nodes-rollout without CSM upgrade](#42-management-nodes-rollout-without-csm-upgrade).

## 4. Restart `goss-servers` on all NCNs
## 5. Restart `goss-servers` on all NCNs

**`NOTE`** Skip this step if the CSM version is 1.6.1 or above. This step will cause no harm if done on CSM 1.6.1 or higher, but it is unnecessary.

Expand All @@ -449,7 +465,7 @@ ncn_nodes=${ncn_nodes%,}
pdsh -S -b -w $ncn_nodes 'systemctl restart goss-servers'
```

## 5. Update management host Slingshot NIC firmware
## 6. Update management host Slingshot NIC firmware

**`NOTE`** This subsection is optional and can be skipped if upgrading only CSM through IUF.

Expand All @@ -464,7 +480,7 @@ Once this step has completed:
- Service checks have been run to verify product microservices are executing as expected
- Per-stage product hooks have executed for the `deploy-product` and `post-install-service-check` stages

## 6. Next steps
## 7. Next steps

- If performing an initial install or an upgrade of non-CSM products only, return to the
[Install or upgrade additional products with IUF](install_or_upgrade_additional_products_with_iuf.md)
Expand Down
4 changes: 2 additions & 2 deletions operations/iuf/workflows/product_delivery.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,8 @@ Refer to that table and any corresponding product documents before continuing to
Additional arguments are available to control the behavior of the `deliver-product` stage (for example, `-rv`). See the [`deliver-product` stage documentation](../stages/deliver_product.md)
for details and adjust the example below if necessary.

**`NOTE`** When installing USS 1.1 or higher, select either SLURM or PBS Pro Products to use on the system before running this stage. For more information, see the `deliver-product` stage
details in the "Install and Upgrade Framework" section of the _HPE Cray Supercomputing User Services Software Administration Guide: CSM on HPE Cray Supercomputing EX Systems (S-8063)_.
**`NOTE`** When installing USS 1.1 or higher, select either Slurm or PBS Pro Products to use on the system before running this stage. This should be specified in `site_vars.yaml`.
For more information, see the `deliver-product` stage details in the "Install and Upgrade Framework" section of the _HPE Cray Supercomputing User Services Software Administration Guide: CSM on HPE Cray Supercomputing EX Systems (S-8063)_.

(`ncn-m001#`) Execute the `deliver-product` stage. Use site variables from the `site_vars.yaml` file found in `${ADMIN_DIR}` and recipe variables from the `product_vars.yaml` file found in `${ADMIN_DIR}`.

Expand Down
6 changes: 3 additions & 3 deletions operations/kubernetes/Troubleshoot_Postgres_Database.md
Original file line number Diff line number Diff line change
Expand Up @@ -306,7 +306,7 @@ For example:
Re-run the following command until it succeeds and reports that the leader pod is `running`.

```bash
kubectl exec keycloak-postgres-0 -c postgres -n services -it -- patronictl list
kubectl exec cray-console-data-postgres-0 -c postgres -n services -it -- patronictl list
```

Example output:
Expand All @@ -328,7 +328,7 @@ For example:
1. (`ncn-mw#`) Determine which pods are reporting lag.

```bash
kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
kubectl exec cray-console-data-postgres-0 -c postgres -n services -it -- patronictl list
```

Example output:
Expand All @@ -352,7 +352,7 @@ For example:
1. (`ncn-mw#`) Once the pods restart, verify that the lag has resolved.

```bash
kubectl exec cray-console-postgres-0 -c postgres -n services -it -- patronictl list
kubectl exec cray-console-data-postgres-0 -c postgres -n services -it -- patronictl list
```

Example output:
Expand Down
14 changes: 14 additions & 0 deletions operations/multi-tenancy/hpe_slingshot_network_operator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# HPE Slingshot Network Operator

Starting in the HPE Slingshot 2.3.0 release, the HPE Slingshot Network Operator is installed as part of the Fabric Manager install.
It is a Kubernetes operator that is designed to support multi-tenancy in CSM 1.6 and later releases.

For more information on the HPE Slingshot Network Operator, see the "HPE Slingshot Network Operator for CSM Multi-Tenancy" section in the _HPE Slingshot Administration Guide_. Search for this document on the [HPE Support Center](https://support.hpe.com/hpesc/public/home).

The HPE Slingshot documentation outlines several critical tasks, including:

- Enabling the HPE Slingshot Network Operator
- Creating HPE Slingshot tenants
- Modifying HPE Slingshot tenants
- Updating VNI and tenant node component names (xnames)
- Removing HPE Slingshot tenants
54 changes: 44 additions & 10 deletions operations/network/management_network/dell/snmp-community.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,59 @@
# Configure SNMPv2c Community
# Configure SNMPv2c community

The switch supports SNMPv2c community-based security for read-only access.
The switch supports SNMPv2c community-based security for read-only and read-write access.

## Configuration Commands
## Configuration commands

Configure an SNMPv2c community name:
### Configure the SNMP community

1. Enter configuration mode.

```console
configure terminal
```

1. Configure the SNMPv2c community name

```console
snmp-server community community-name access-mode
```

Parameters:

| Parameter | Description |
|------------------|----------------------------------------------------------------------------------------------|
| `community-name` | The user defined name for this community. |
| `access-mode` | The access level for this community. Can be `ro` for read-only or `rw` for read-write access |

### Example

The following command configures a read-only SNMP community called "public".

```text
snmp-server community community-name
snmp-server community public ro
```

Show commands to validate functionality:
When successful this command returns no output.

```text
### Show configured SNMP community

The following command displays information about any SNMP community that may have been configured.

```console
show snmp community
```

Example output:

```text
Community : public
Access : read-only
```

## Expected Results

1. Administrators can configure the community name
2. Administrators can bind the SNMP server to the default VRF
3. Administrators can connect from the workstation using the community name
1. Administrators can configure the community name.
2. Administrators can bind the SNMP server to the default VRF.
3. Administrators can connect from the workstation using the community name.

[Back to Index](../README.md)
Loading
Loading