Skip to content

Commit

Permalink
MTL-1695 Overhaul 1.3 Documentation (Cray-HPE#1554)
Browse files Browse the repository at this point in the history
MTL-1695

Consolidate the Install path, define a new flow that incorporates the
automation calls for CSI input files. Ensure cable and SHCDs are checked
before deploying NCNs and right after.

Use a new command line context convention; make the code snippets
"copy-pasteable."

Add README.md symbolic links for rending index.md on GitHub where
index.md exists.

Liniting of extra white space.

Linting of headers; all headers use hyphens to allow native matching to
header references while also matching to anchor refs.
  • Loading branch information
rustydb authored Jun 10, 2022
1 parent 64ee899 commit 421aa9b
Show file tree
Hide file tree
Showing 755 changed files with 10,323 additions and 14,158 deletions.
9 changes: 7 additions & 2 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
# docs-csm pull request review team
* @Cray-HPE/docs-csm-reviewers

* @Cray-HPE/docs-csm-reviewers

background/ncn_* @Cray-HPE/metal
install/* @Cray-HPE/metal
install/livecd @Cray-HPE/metal
operations/network @Cray-HPE/management-network
17 changes: 17 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Description

<!--- Describe what this change is and what it is for. -->

# Checklist Before Merging

<!--- An empty check is two brackets with a space inbetween, a checked checkbox is two brackets with an x inbetween -->
<!--- unchecked checkbox: [ ] -->
<!--- checked checkbox: [x] -->
<!--- invalid checkbox: [] -->

- [ ] If I added any command snippets, the steps they belong to follow the prompt conventions (see [example][1]).
- [ ] If I added a new directory, I also updated `.github/CODEOWNERS` with the corresponding team in [Cray-HPE][2].
- [ ] My commits or Pull-Request Title contain my JIRA information, or I don't have a JIRA.

[1]: https://github.com/Cray-HPE/docs-csm/blob/MTL-1695/introduction/documentation_conventions.md#using-prompts
[2]: https://github.com/Cray-HPE/teams
2 changes: 1 addition & 1 deletion .version
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.13.15
1.14.0
6 changes: 4 additions & 2 deletions Jenkinsfile.github
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,10 @@ pipeline {
} else {
RELEASE_FOLDER = ""
}
publishCsmRpms(component: env.GIT_REPO_NAME + RELEASE_FOLDER, pattern: "dist/rpmbuild/RPMS/noarch/*.rpm", arch: "noarch", isStable: env.IS_STABLE)
publishCsmRpms(component: env.GIT_REPO_NAME + RELEASE_FOLDER, pattern: "dist/rpmbuild/SRPMS/*.rpm", arch: "src", isStable: env.IS_STABLE)
publishCsmRpms(component: env.GIT_REPO_NAME + RELEASE_FOLDER, pattern: "dist/rpmbuild/RPMS/noarch/*.rpm", os: "sle-15sp2", arch: "noarch", isStable: env.IS_STABLE)
publishCsmRpms(component: env.GIT_REPO_NAME + RELEASE_FOLDER, pattern: "dist/rpmbuild/RPMS/noarch/*.rpm", os: "sle-15sp3", arch: "noarch", isStable: env.IS_STABLE)
publishCsmRpms(component: env.GIT_REPO_NAME + RELEASE_FOLDER, pattern: "dist/rpmbuild/SRPMS/*.rpm", os: "sle-15sp2", arch: "src", isStable: env.IS_STABLE)
publishCsmRpms(component: env.GIT_REPO_NAME + RELEASE_FOLDER, pattern: "dist/rpmbuild/SRPMS/*.rpm", os: "sle-15sp3", arch: "src", isStable: env.IS_STABLE)
}
}
}
Expand Down
150 changes: 130 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,130 @@
# Cray System Management (CSM) - README

The documentation included here describes how to install or upgrade the Cray System Management (CSM)
software and related supporting operational procedures. CSM software is the foundation upon which
other software product streams for the HPE Cray EX system depend.

This documentation is in Markdown format. Although much of it can be viewed with any text editor,
a richer experience will come from using a tool which can render the Markdown to show different font
sizes, the use of bold and italics formatting, inclusion of diagrams and screen shots as image files,
and to follow navigational links within a topic file and to other files.

There are many tools which can render the Markdown format to get these advantages. Any Internet search
for Markdown tools will provide a long list of these tools. Some of the tools are better than others
at displaying the images and allowing you to follow the navigational links.

The exploration of the CSM documentation begins with
the [Cray System Management Documentation](index.md) which introduces
topics related to CSM software installation, upgrade, and operational use. Notice that the
previous sentence had a link to the index.md file for the Cray System Management Documentation.
If the link does not work, then a better Markdown viewer is needed.
# Cray System Management Documentation

## Scope and Audience

The documentation included here describes the Cray System Management (CSM) software, how to install
or upgrade CSM software, and related supporting operational procedures to manage an HPE Cray EX system.
CSM software is the foundation upon which other software product streams for the HPE Cray EX system depend.

The CSM installation prepares and deploys a distributed system across a group of management
nodes organized into a Kubernetes cluster which uses Ceph for utility storage. These nodes
perform their function as Kubernetes master nodes, Kubernetes worker nodes, or utility storage
nodes with the Ceph storage.

System services on these nodes are provided as containerized micro-services packaged for deployment
via Helm charts. Kubernetes orchestrates these services and schedules them on Kubernetes worker
nodes with horizontal scaling. Horizontal scales increases or decreases the number of service instances as
demand for them varies, such as when booting many compute nodes or application nodes.

This information is intended for system installers, system administrators, and network administrators
of the system. It assumes some familiarity with standard Linux and open source tools, such as shell
scripts, revision control with git, configuration management with Ansible, YAML, JSON, and TOML file formats, etc.

## Table of Contents

1. [Introduction to CSM Installation](introduction/README.md)

This chapter provides an introduction to using the CSM software to manage the HPE Cray EX system which
also describes the scenarios for installation and upgrade of CSM software, how product stream updates
for CSM are delivered, the operational activities done after installation for on-going management
of the HPE Cray EX system, differences between previous release and this release, and conventions
used in this documentation.

1. [Bare-Metal Steps](operations/bare_metal/Bare-Metal.md)

This chapter outlines how to set up default credentials for River BMCs and
ServerTech PDUs, which must be done before the initial installation of
CSM, in order to enable HSM software to interact with River Redfish BMCs
and PDUs.

1. [Update CSM Product Stream](update_product_stream/README.md)

This chapter explains how to get the CSM product release, get any patches, update to the latest
documentation, and check for any Field Notices or Hotfixes.

1. [Install CSM](install/README.md)

This chapter provides an order list of procedures which can be used for CSM software installation or reinstall
that indicate when to do operational tasks as part of the installation workflow. Updating software is in another chapter.
Installation of the CSM product stream has many steps in multiple procedures which should be done in a
specific order. Information about the HPE Cray EX system and the site is used to prepare the configuration
payload. The initial node used to bootstrap the installation process is called the PIT node because the
Pre-Install Toolkit is installed there. Once the management network switches have been configured, the other
management nodes can be deployed with an operating system and the software to create a Kubernetes cluster
utilizing Ceph storage. The CSM services provide essential software infrastructure including the API gateway
and many micro-services with REST APIs for managing the system. Once administrative access has been configured,
the installation of CSM software and nodes can be validated with health checks before doing operational tasks
like the check and update of firmware on system components or the preparation of compute nodes.

1. [Upgrade CSM](upgrade/README.md)

This chapter provides an order list of procedures which can be used to update CSM software that indicate when
to do operational tasks as part of the software upgrade workflow. There are procedures to prepare the
HPE Cray system for the upgrade, and update the management network, the management nodes, and the CSM services.
After the upgrade of CSM software, the CSM health checks are used to validate the system before doing any other
operational tasks like the check and update of firmware on system components.

1. [CSM Operational Activities](operations/README.md)

This chapter provides an unordered set of administrative procedures required to operate an HPE Cray EX system with CSM software and grouped into several major areas:
* CSM Product Management
* Artifact Management
* Boot Orchestration
* Compute Rolling Upgrade
* Configuration Management
* Console Management
* Firmware Management
* Hardware State Manager
* Image Management
* Kubernetes
* Network Management
* Node Management
* Package Repository Management
* Power Management
* Resiliency
* River Endpoint Discovery Service
* Security And Authentication
* System Configuration Service
* System Layout Service
* System Management Health
* UAS User And Admin Topics
* Utility Storage
* Validate CSM Health

1. [CSM Troubleshooting Information](troubleshooting/README.md)

This chapter provides information about some known issues in the system and tips for troubleshooting Kubernetes.

1. [CSM Background Information](background/README.md)

This chapter provides background information about the NCNs (non-compute nodes) which function as
management nodes for the HPE Cray EX system. This information is not normally needed to install
or upgrade software, but provides background which might be helpful for troubleshooting an installation.

1. [Glossary](glossary.md)

This chapter provides explanations of terms and acronyms used throughout the rest of this documentation.

## Copyright and License

MIT License

(C) Copyright [2020-2022] Hewlett Packard Enterprise Development LP

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
29 changes: 10 additions & 19 deletions background/index.md → background/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,18 @@ management nodes for the HPE Cray EX system. This information is not normally ne
software, but provides background which might be helpful for troubleshooting an installation.

### Topics:
* [Cray Site Init Files](#cray_site_init_files)
* [Certificate Authority](#certificate_authority)
* [NCN Images](#ncn_images)
* [NCN Boot Workflow](#ncn_boot_workflow)
* [NCN Networking](#ncn_networking)
* [NCN Mounts and File Systems](#ncn_mounts_and_file_systems)
* [NCN Packages](#ncn_packages)
* [NCN Operating System Releases](#ncn_operating_system_releases)
* [cloud-init Basecamp Configuration](#cloud-init_basecamp_configuration)
* [Cray Site Init Files](#cray-site-init-files)
* [Certificate Authority](#certificate-authority)
* [NCN Images](#ncn-images)
* [NCN Boot Workflow](#ncn-boot-workflow)
* [NCN Networking](#ncn-networking)
* [NCN Mounts and File Systems](#ncn-mounts-and-file-systems)
* [NCN Packages](#ncn-packages)
* [NCN Operating System Releases](#ncn-operating-system-releases)
* [cloud-init Basecamp Configuration](#cloud-init-basecamp-configuration)

## Details

<a name="cray_site_init_files"></a>
### Cray Site Init Files

The Cray Site Init (`csi`) command has several files which describe pre-configuration data needed during
Expand All @@ -32,7 +31,6 @@ software, but provides background which might be helpful for troubleshooting an
In addition, after running `csi` with those pre-config files, `csi` creates an output `system_config.yaml`
file which can be passed to `csi` when reinstalling this software release.

<a name="certificate_authority"></a>
### Certificate Authority

While a system is being installed for the first time, a certificate authority (CA) is needed. This can be
Expand All @@ -46,7 +44,6 @@ software, but provides background which might be helpful for troubleshooting an
* "Customize Platform Generated CA"
* "Use an External CA"

<a name="ncn_images"></a>
### NCN Images

The management nodes boot from NCN images which are created as layers on top of a common base image.
Expand All @@ -56,7 +53,6 @@ software, but provides background which might be helpful for troubleshooting an

See [NCN Images](ncn_images.md)

<a name="ncn_boot_workflow"></a>
### NCN Boot Workflow

The boot workflow for management nodes (NCNs) is different from compute nodes or application nodes.
Expand All @@ -73,37 +69,32 @@ software, but provides background which might be helpful for troubleshooting an
* Reverting Changes
* Locating USB Device

<a name="ncn_networking"></a>
### NCN Networking

Non-compute nodes and compute nodes have different network interfaces used for booting.
The NCN network interfaces, device naming, and vendor and bus identification are described in this topic.

* [NCN Networking](ncn_networking.md)

<a name="ncn_mounts_and_file_systems"></a>
### NCN Mounts and File Systems

The management nodes have specific file systems and mounts and use overlayfs.

See [NCN Mounts and File Systems](ncn_mounts_and_file_systems.md)

<a name="ncn_packages"></a>
### NCN Packages

The management nodes boot from images which have many (RPM) packages installed. The packages
installed differ between the Kubernetes master and worker nodes versus the utility storage nodes.

* [NCN Packages](ncn_packages.md)
See [csm-rpms](https://github.com/Cray-HPE/csm-rpms) for a list of packages per node type.

<a name="ncn_operating_system_releases"></a>
### NCN Operating System Releases

All management nodes have an operating system based on SLE_HPC (SuSE High Performance Computing).

* [NCN Operating System Releases](ncn_operating_system_releases.md)

<a name="cloud-init_basecamp_configuration"></a>
### cloud-init Basecamp Configuration

Metal Basecamp is a cloud-init DataSource available on the LiveCD. Basecamp's configuration file offers many inputs for various cloud-init scripts embedded within the NCN images.
10 changes: 3 additions & 7 deletions background/certificate_authority.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,10 @@ installation, there is no supported method to rotate or change the platform CA i

### Topics:
* [Overview](#overview)
* [Use Default Platform Generated CA](#use_default_platform_generated_ca)
* [Customize Platform Generated CA](#customize_platform_generated_ca)
* [Use External CA](#use_external_ca)
* [Use Default Platform Generated CA](#use-default-platform-generated-ca)
* [Customize Platform Generated CA](#customize-platform-generated-ca)
* [Use External CA](#use-external-ca)

<a name="overview"></a>
## Overview

At *install time*, a PKI certificate authority (CA) can either be generated for a system, or a customer can opt to supply their own (intermediate) CA.
Expand All @@ -23,7 +22,6 @@ The resulting CA will be used to sign multiple workloads on the platform (Ingres

> Management of Sealed Secrets should ideally take place on a secure workstation.
<a name="use_default_platform_generated_ca"></a>
## Use Default Platform Generated CA

In shasta-cfg, there is a Sealed Secret generator named `platform_ca`. By default, the `customizations.yaml` file will contain a generation template to use this generator, and will create a sealed secret named `generated-platform-ca-1`. The `cray-vault` overrides in `customizations.yaml` contain a) a templated reference to expand the `generated-platform-ca-1` Sealed Secret and b) directives instructing vault to load the CA material on start-up -- ultimately initializing a HashiCorp Vault PKI Engine instance with the material.
Expand Down Expand Up @@ -65,14 +63,12 @@ spec:

> The `platform_ca` generator will produce RSA CAs with a 3072-bit modulus, using SHA256 as the base signature algorithm.
<a name="customize_platform_generated_ca"></a>
## Customize Platform Generated CA

The `platform_ca` generator inputs can be customized, if desired. Notably, the `root_days`, `int_days`, `root_cn`, and `int_cn` fields can be modified. While the shasta-cfg documentation on the use of generators supplies additional detail, the `*_days` settings control the validity period and the `*_cn` settings control the common name value for the resulting CA certificates. Ensure the Sealed Secret name reference in `spec.kubernetes.services.cray-vault.sealedSecrets` is updated if you opt to use a different name.

> Outside of a new installation, there is currently no supported method to rotate (change) the platform CA. Please set validity periods accordingly. The ability to rotate CAs is anticipated as part of a future release.
<a name="use_external_ca"></a>
## Use External CA

The `static_platform_ca` generator, part of shasta-cfg, can be used to supply an external CA private key, certificate, and associated upstream CAs that form the trust chain. The generator will attempt to prevent you from supplying a root CA. You must also supply the entire trust chain up to the root CA certificate.
Expand Down
3 changes: 1 addition & 2 deletions background/ncn_bios.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ This page denotes BIOS settings that are desirable for non-compute nodes.
> **`NOTE`** The table below declares desired settings; unlisted settings should remain at vendor-default. This table may be expanded as new settings are adjusted.

| Common Name | Common Value | Description | Value Rationale | Common Menu Location
| --- | --- | --- | --- | --- |
| Intel® Hyper-Threading (e.g. HT) | `Enabled` | Enables two-threads per physical core. | Leverage the full performance of the CPU, the higher thread-count assists with parallel tasks within the processor(s). | Within the Processor or the PCH Menu.
Expand All @@ -15,4 +14,4 @@ This page denotes BIOS settings that are desirable for non-compute nodes.
| PXE Timeout | 5 Seconds (or less, never more) | The time that the PXE ROM will wait for a DHCP handshake to complete before moving on to the next boot device. | If DHCP is working nominally, then the DHCP handshake should not take longer than 5 seconds. This timeout could be increased where networking faults cannot be reconciled, but ideally this should be tuned to 3 or 2 seconds. |
| Continuous Boot | `Disabled` | Whether boot-group (e.g. all network devices, or all disk devices) should continuously retry. This prevents fall-through to the fallback disks. | We want deterministic nodes in Shasta, if the boot fails the first tier we want the node to try the next tier of boot mediums before failing at a shell or menu for intervention. |

> **`NOTE`** **PCIe** options can be found in [PCIe : Setting Expected Values](../install/switch_pxe_boot_from_onboard_nic_to_pcie.md#setting-expected-values).
> **`NOTE`** **PCIe** options can be found in [PCIe : Setting Expected Values](../operations/node_management/Switch_PXE_Boot_From_Onboard_NICs_to_PCIe.md#setting-expected-values).
Loading

0 comments on commit 421aa9b

Please sign in to comment.