diff --git a/.spelling b/.spelling index f9d1e69de..e10f6cd0d 100644 --- a/.spelling +++ b/.spelling @@ -116,6 +116,7 @@ update-vcs-config # HPE publication numbers S-8000 S-8001 +S-8029 S-8052 # CVE IDs CVE-2021-33503 diff --git a/docs/README.md b/docs/README.md index df877a00e..56eae1437 100644 --- a/docs/README.md +++ b/docs/README.md @@ -32,11 +32,6 @@ - [Authenticate SAT Commands](external_system.md#authenticate-sat-commands) - [Generate SAT S3 Credentials](external_system.md#generate-sat-s3-credentials) -## [SAT Dashboards](dashboards/README.md) - -- [SAT Kibana Dashboards](dashboards/SAT_Kibana_Dashboards.md) -- [SAT Grafana Dashboards](dashboards/SAT_Grafana_Dashboards.md) - ## [SAT Usage](usage/README.md) - [SAT Bootprep](usage/sat_bootprep.md) diff --git a/docs/about_sat/introduction.md b/docs/about_sat/introduction.md index 5e73754d0..eb082735b 100644 --- a/docs/about_sat/introduction.md +++ b/docs/about_sat/introduction.md @@ -9,23 +9,6 @@ components. SAT offers a command line utility which uses subcommands. There are similarities between SAT commands and `xt` commands used on the Cray XC platform. For more information on SAT commands, see [SAT Command Overview](#sat-command-overview). -Six Kibana Dashboards are included with SAT. They provide organized output for system health information. - -- [AER Kibana Dashboard](../dashboards/SAT_Kibana_Dashboards.md#aer-kibana-dashboard) -- [ATOM Kibana Dashboard](../dashboards/SAT_Kibana_Dashboards.md#atom-kibana-dashboard) -- [Heartbeat Kibana Dashboard](../dashboards/SAT_Kibana_Dashboards.md#heartbeat-kibana-dashboard) -- [Kernel Kibana Dashboard](../dashboards/SAT_Kibana_Dashboards.md#kernel-kibana-dashboard) -- [MCE Kibana Dashboard](../dashboards/SAT_Kibana_Dashboards.md#mce-kibana-dashboard) -- [RAS Daemon Kibana Dashboard](../dashboards/SAT_Kibana_Dashboards.md#ras-daemon-kibana-dashboard) - -Four Grafana Dashboards are included with SAT. They display messages that are generated by the HSN (High Speed Network) and -are reported through Redfish. - -- [Grafana Fabric Congestion Dashboard](../dashboards/SAT_Grafana_Dashboards.md#grafana-fabric-congestion-dashboard) -- [Grafana Fabric Errors Dashboard](../dashboards/SAT_Grafana_Dashboards.md#grafana-fabric-errors-dashboard) -- [Grafana Fabric Port State Dashboard](../dashboards/SAT_Grafana_Dashboards.md#grafana-fabric-port-state-dashboard) -- [Grafana Fabric RFC3635 Dashboard](../dashboards/SAT_Grafana_Dashboards.md#grafana-fabric-rfc3635-dashboard) - In CSM 1.3 and newer, the `sat` command is automatically available on all the Kubernetes control plane. For more information, see [SAT in CSM](sat_in_csm.md). Older versions of CSM do not have the `sat` command automatically available, and SAT diff --git a/docs/dashboards/README.md b/docs/dashboards/README.md deleted file mode 100644 index 123db754f..000000000 --- a/docs/dashboards/README.md +++ /dev/null @@ -1,4 +0,0 @@ -# SAT Dashboards - -- [SAT Kibana Dashboards](SAT_Kibana_Dashboards.md) -- [SAT Grafana Dashboards](SAT_Grafana_Dashboards.md) diff --git a/docs/dashboards/SAT_Grafana_Dashboards.md b/docs/dashboards/SAT_Grafana_Dashboards.md deleted file mode 100644 index 53165463d..000000000 --- a/docs/dashboards/SAT_Grafana_Dashboards.md +++ /dev/null @@ -1,119 +0,0 @@ -# SAT Grafana Dashboards - -The SAT Grafana Dashboards display messages that are generated by the HSN (High Speed Network) and reported through -Redfish. The messages are displayed based on severity. - -Grafana can be accessed via web browser at the following URL: -`https://sma-grafana.cmn.`. - -(`ncn-m001#`) The value of `site-domain` can be obtained as follows: - -```bash -kubectl get secret site-init -n loftsman -o jsonpath='{.data.customizations\.yaml}' | \ - base64 -d | grep "external:" -``` - -That command will produce the following output, for example: - -```text - external: EXAMPLE_DOMAIN.com -``` - -This would result in the address for Grafana being `https://sma-grafana.cmn.EXAMPLE_DOMAIN.com`. - -For more information on accessing the Grafana Dashboards, refer to **Access the Grafana Monitoring UI** in the -SMA product documentation. - -For more information on the interpretation of metrics for the SAT Grafana Dashboards, refer to "Fabric Telemetry -Kafka Topics" in the SMA product documentation. - -## Navigate SAT Grafana Dashboards - -There are four Fabric Telemetry dashboards used in SAT that report on the HSN. Two contain chart panels and two display -telemetry in a tabular format. - -|Dashboard Name|Display Type| -|--------------|------------| -|Fabric Congestion|Chart Panels| -|Fabric RFC3635|Chart Panels| -|Fabric Errors|Tabular Format| -|Fabric Port State|Tabular Format| - -The tabular format presents a single point of telemetry for a given location and metric, either because the telemetry -is not numerical or that it changes infrequently. The value shown is the most recently reported value for that location -during the time range selected, if any. The interval setting is not used for tabular dashboards. - -## SAT Grafana Interval and Locations Options - -Shows the Interval and Locations Options for the available telemetry. - -![Grafana Interval and Locations Options](../img/SAT_Grafana_Fabric_Vars.png) - -The value of the **Interval** option sets the time resolution of the received telemetry. This works a bit like a -histogram, with the available telemetry in an interval of time going into a "bucket" and averaging out to a single -point on the chart or table. The special value *auto* will choose an interval based on the time range selected. - -For more information, refer to [Grafana Templates and Variables](https://grafana.com/docs/grafana/latest/reference/templating/#interval-variables). - -The **Locations** option allows restriction of the telemetry shown by locations, either individual links or all links -in a switch. The selection presented updates dynamically according to time range, except for the errors dashboard, -which always has entries for all links and switches, although the errors shown are restricted to the selected time -range. - -The chart panels for the RFC3635 and Congestion dashboards allow selection of a single location from the chart's legend -or the trace on the chart. - -## Grafana Fabric Congestion Dashboard - -![Grafana Fabric Congestion Dashboard](../img/Grafana_Fabric_Congestion.png) - -SAT Grafana Dashboards provide system administrators a way to view fabric telemetry data across all Rosetta switches in -the system and assess the past and present health of the high-speed network. It also allows the ability to drill down -to view data for specific ports on specific switches. - -This dashboard contains the variable, **Port Type** not found in the other dashboards. The possible values are *edge*, -*local*, and *global* and correspond to the link's relationship to the network topology. The locations presented in the -panels are restricted to the values (any combination, defaults to "all") selected. - -The metric values for links of a given port type are similar in value to each other but very distinct from the values of -other types. If the values for different port types are all plotted together, the values for links with lower values are -indistinguishable from zero when plotted. - -The port type of a link is reported as a port state "subtype" event when defined at port initialization. - -## Grafana Fabric Errors Dashboard - -![Grafana HSN Errors Dashboard](../img/Grafana_HSN_Errors.png) - -This dashboard reports error counters in a tabular format in three panels. - -There is no **Interval** option because this parameter is not used to set a coarseness of the data. Only a single value -is presented that displays the most recent value in the time range. - -Unlike other dashboards, the locations presented are all locations in the system rather than having telemetry within -the time range selected. However, the values are taken from telemetry within the time range. - -## Grafana Fabric Port State Dashboard - -![Grafana Fabric Port State Dashboard](../img/Fabric_PortState_Locations_UI.png) - -There is no **Interval** option because this parameter is not used to set a coarseness of the data. Only a single value -is presented that displays the most recent value in the time range. - -The Fabric Port State telemetry is distinct because it typically is not numeric. It also updates infrequently, so a -long time range may be necessary to obtain any values. Port State is refreshed daily, so a time range of 24 hours -results in all states for all links in the system being shown. - -The three columns named, *group*, *switch*, and *port* are not port state events, but extra information included with -all port state events. - -## Grafana Fabric RFC3635 Dashboard - -![Grafana Fabric RFC3635 Dashboard](../img/Grafana_rfc3635.png) - -For more information on performance counters, refer to -[Definitions of Managed Objects for the Ethernet-like Interface Types](https://tools.ietf.org/html/rfc3635), -an Internet standards document. - -Because these metrics are counters that only increase over time, the values plotted are the change in the counter's -value over the interval setting. diff --git a/docs/dashboards/SAT_Kibana_Dashboards.md b/docs/dashboards/SAT_Kibana_Dashboards.md deleted file mode 100644 index 1fbb7a871..000000000 --- a/docs/dashboards/SAT_Kibana_Dashboards.md +++ /dev/null @@ -1,177 +0,0 @@ -# SAT Kibana Dashboards - -Kibana is an open source analytics and visualization platform designed to search, view, and interact with data stored -in Elasticsearch indices. Kibana runs as a web service and has a browser-based interface. It offers visual output of -node data in the forms of charts, tables and maps that display real-time Elasticsearch queries. Viewing system data in -this way breaks down the complexity of large data volumes into easily understood information. - -Kibana can be accessed via web browser at the following URL: -`https://sma-kibana.cmn.`. - -(`ncn-m001#`) The value of `site-domain` can be obtained as follows: - -```bash -kubectl get secret site-init -n loftsman -o jsonpath='{.data.customizations\.yaml}' | \ - base64 -d | grep "external:" -``` - -That command will produce the following output, for example: - -```text - external: EXAMPLE_DOMAIN.com -``` - -This would result in the address for Kibana being `https://sma-kibana.cmn.EXAMPLE_DOMAIN.com`. - -For more information on accessing the Kibana Dashboards, refer to **View Logs Via Kibana** in the SMA product -documentation. - -Additional details about the AER, ATOM, Heartbeat, Kernel, MCE, and RAS Daemon Kibana Dashboards are included in this -table. - -|Dashboard|Short Description|Long Description|Kibana Visualization and Search Name| -|---------|-----------------|----------------|------------------------------------| -|`sat-aer`|AER corrected|Corrected Advanced Error Reporting messages from PCI Express devices on each node.|Visualization: `aer-corrected` Search: `sat-aer-corrected`| -|`sat-aer`|AER fatal|Fatal Advanced Error Reporting messages from PCI Express devices on each node.|Visualization: `aer-fatal` Search: `sat-aer-fatal`| -|`sat-atom`|ATOM failures|Application Task Orchestration and Management tests are run on a node when a job finishes. Test failures are logged.|`sat-atom-failed`| -|`sat-atom`|ATOM `admindown`|Application Task Orchestration and Management test failures can result in nodes being marked `admindown`. An `admindown` node is not available for job launch.|`sat-atom-admindown`| -|`sat-heartbeat`|Heartbeat loss events|Heartbeat loss event messages reported by the `hbtd` pods that monitor for heartbeats across nodes in the system.|`sat-heartbeat`| -|`sat-kernel`|Kernel assertions|The kernel software performs a failed assertion when some condition represents a serious fault. The node goes down.|`sat-kassertions`| -|`sat-kernel`|Kernel panics|The kernel panics when something is seriously wrong. The node goes down.|`sat-kernel-panic`| -|`sat-kernel`|Lustre bugs (LBUGs)|The Lustre software in the kernel stack performs a failed assertion when some condition related to file system logic represents a serious fault. The node goes down.|`sat-lbug`| -|`sat-kernel`|CPU stalls|CPU stalls are serous conditions that can reduce node performance, and sometimes cause a node to go down. Technically these are Read-Copy-Update stalls where software in the kernel stack holds onto memory for too long. Read-Copy-Update is a vital aspect of kernel performance and rather esoteric.|`sat-cpu-stall`| -|`sat-kernel`|Out of memory|An Out Of Memory (OOM) condition has occurred. The kernel must kill a process to continue. The kernel will select an expendable process when possible. If there is no expendable process the node usually goes down in some manner. Even if there are expendable processes the job is likely to be impacted. OOM conditions are best avoided.|`sat-oom`| -|`sat-mce`|MCE|Machine Check Exceptions (MCE) are errors detected at the processor level.|`sat-mce`| -|`sat-rasdaemon`|`rasdaemon` errors|Errors from the `rasdaemon` service on nodes. The `rasdaemon` service is the Reliability, Availability, and Serviceability Daemon, and it is intended to collect all hardware error events reported by the Linux kernel, including PCI and MCE errors. This may include certain HSN errors in the future.|`sat-rasdaemon-error`| -|`sat-rasdaemon`|`rasdaemon` messages|All messages from the `rasdaemon` service on nodes.|`sat-rasdaemon`| - -## Disable Search Highlighting in Kibana Dashboard - -By default, search highlighting is enabled. This procedure instructs how to disable search highlighting. - -The Kibana Dashboard should be open on the system. - -1. Navigate to **Management** - -1. Navigate to **Advanced Settings** in the Kibana section, below the Elastic search section - -1. Scroll down to the **Discover** section - -1. Change **Highlight results** from *on* to *off* - -1. Click **Save** to save changes - -## AER Kibana Dashboard - -The AER Dashboard displays errors that come from the PCI Express Advanced Error Reporting (AER) driver. These errors -are split up into separate visualizations depending on whether they are fatal or corrected errors. - -### View the AER Kibana Dashboard - -1. Go to the dashboard section. - -1. Select `sat-aer` dashboard. - -1. Choose the time range of interest. - -1. View the Corrected and Fatal Advanced Error Reporting messages from PCI Express devices on each node. View the - matching log messages in the panel(s) on the right, and view the counts of each message per NID in the panel(s) on - the left. If desired, results can be filtered by NID by clicking the icon showing a **+** inside a magnifying glass - next to each NID. - -## ATOM Kibana Dashboard - -The ATOM (Application Task Orchestration and Management) Dashboard displays node failures that occur during health -checks and application test failures. Some test failures are of *possible* interest even though a node is not marked -`admindown` or otherwise fails. They are of *clear* interest if a node is marked `admindown`, and might provide -clues if a node otherwise fails. They might also show application problems. - -### View the ATOM Kibana Dashboard - -HPE Cray EX is installed on the system along with the System Admin Toolkit, which contains the ATOM Kibana Dashboard. - -1. Go to the dashboard section. - -1. Select `sat-atom` dashboard. - -1. Choose the time range of interest. - -1. View any nodes marked `admindown` and any ATOM test failures. These failures occur during health checks and - application test failures. Test failures marked `admindown` are important to note. View the matching log messages - in the panel(s) on the right, and view the counts of each message per NID in the panel(s) on the left. If desired, - results can be filtered by NID by clicking the icon showing a **+** inside a magnifying glass next to each NID. - -## Heartbeat Kibana Dashboard - -The Heartbeat Dashboard displays heartbeat loss messages that are logged by the `hbtd` pods in the system. The `hbtd` -pods are responsible for monitoring nodes in the system for heartbeat loss. - -### View the Heartbeat Kibana Dashboard - -1. Go to the dashboard section. - -1. Select `sat-heartbeat` dashboard. - -1. Choose the time range of interest. - -1. View the heartbeat loss messages that are logged by the `hbtd` pods in the system. The `hbtd` pods are responsible - for monitoring nodes in the system for heartbeat loss. View the matching log messages in the panel. - -## Kernel Kibana Dashboard - -The Kernel Dashboard displays compute node failures such as kernel assertions, kernel panics, and Lustre LBUG messages. -The messages reveal if Lustre has experienced a fatal error on any compute nodes in the system. A CPU stall is a serious -problem that might result in a node failure. Out-of-memory conditions can be due to applications or system problems and -may require expert analysis. They provide useful clues for some node failures and may reveal if an application is using -too much memory. - -### View the Kernel Kibana Dashboard - -1. Go to the dashboard section. - -1. Select `sat-kernel` dashboard. - -1. Choose the time range of interest. - -1. View the compute node failures such as kernel assertions, kernel panics, and Lustre LBUG messages. View the matching - log messages in the panel(s) on the right, and view the counts of each message per NID in the panel(s) on the left. - If desired, results can be filtered by NID by clicking the icon showing a **+** inside a magnifying glass next to - each NID. - -## MCE Kibana Dashboard - -The MCE Dashboard displays CPU detected processor-level hardware errors. - -### View the MCE Kibana Dashboard - -1. Go to the dashboard section. - -1. Select `sat-mce` dashboard. - -1. Choose the time range of interest. - -1. View the Machine Check Exceptions (MCEs) listed including the counts per NID (node). For an MCE, the CPU number and - DIMM number can be found in the message, if applicable. View the matching log messages in the panel(s) on the right, - and view the counts of each message per NID in the panel(s) on the left. If desired, results can be filtered by NID - by clicking the icon showing a **+** inside a magnifying glass next to each NID. - -## RAS Daemon Kibana Dashboard - -The RAS Daemon Dashboard displays errors that come from the Reliability, Availability, and Serviceability (RAS) daemon -service on nodes in the system. This service collects all hardware error events reported by the Linux kernel, including -PCI and MCE errors. As a result there may be some duplication between the messages presented here and the messages -presented in the MCE and AER dashboards. This dashboard splits up the messages into two separate visualizations, one -for only messages of severity `emerg` or `err` and another for all messages from `rasdaemon`. - -### View the RAS Daemon Kibana Dashboard - -1. Go to the dashboard section. - -1. Select `sat-rasdaemon` dashboard. - -1. Choose the time range of interest. - -1. View the errors that come from the Reliability, Availability, and Serviceability (RAS) daemon service on nodes in - the system. View the matching log messages in the panel(s) on the right, and view the counts of each message per NID - in the panel(s) on the left. If desired, results can be filtered by NID by clicking the icon showing a **+** inside - a magnifying glass next to each NID. diff --git a/docs/img/Fabric_PortState_Locations_UI.png b/docs/img/Fabric_PortState_Locations_UI.png deleted file mode 100644 index 704511ebc..000000000 Binary files a/docs/img/Fabric_PortState_Locations_UI.png and /dev/null differ diff --git a/docs/img/Grafana_Fabric_Congestion.png b/docs/img/Grafana_Fabric_Congestion.png deleted file mode 100644 index dbf481d94..000000000 Binary files a/docs/img/Grafana_Fabric_Congestion.png and /dev/null differ diff --git a/docs/img/Grafana_HSN_Errors.png b/docs/img/Grafana_HSN_Errors.png deleted file mode 100644 index f43b7d02a..000000000 Binary files a/docs/img/Grafana_HSN_Errors.png and /dev/null differ diff --git a/docs/img/Grafana_rfc3635.png b/docs/img/Grafana_rfc3635.png deleted file mode 100644 index dff176c82..000000000 Binary files a/docs/img/Grafana_rfc3635.png and /dev/null differ diff --git a/docs/img/SAT_Grafana_Fabric_Vars.png b/docs/img/SAT_Grafana_Fabric_Vars.png deleted file mode 100644 index 194d75b12..000000000 Binary files a/docs/img/SAT_Grafana_Fabric_Vars.png and /dev/null differ diff --git a/docs/release_notes/sat_2.6_release_notes.md b/docs/release_notes/sat_2.6_release_notes.md index 09784d9bb..535b30c4c 100644 --- a/docs/release_notes/sat_2.6_release_notes.md +++ b/docs/release_notes/sat_2.6_release_notes.md @@ -62,6 +62,11 @@ No new `sat` commands were added in SAT 2.6. ## Other SAT Changes +- The SAT Kibana and Grafana dashboards were moved to the System Monitoring + Application (SMA) beside other dashboards. For more information on how to + view these dashboards going forward, see the *HPE Cray EX System Monitoring + Application Administration Guide (S-8029)*. + - Add the new `s3.cert_verify` option to the SAT configuration file to control whether certificate verification is performed when accessing S3.