Skip to content

Commit

Permalink
Document changes to Monitor dashboards for 3.16 (#53126)
Browse files Browse the repository at this point in the history
Co-authored-by: Isaac Brown <101839405+isaacmbrown@users.noreply.github.com>
  • Loading branch information
manue1 and isaacmbrown authored Nov 22, 2024
1 parent f748cba commit 1e24ec2
Show file tree
Hide file tree
Showing 14 changed files with 97 additions and 58 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ During the migration, the CPU and memory usage for your instance will increase.

After the migration, storage pressure on your instance will increase due to the duplication of image files in the Docker registry and the {% data variables.product.prodname_container_registry %}. A future release of {% data variables.product.product_name %} will remove the duplicated files when all migrations are complete.

For more information about monitoring the performance and storage of {% data variables.location.product_location %}, see "[AUTOTITLE](/admin/enterprise-management/monitoring-your-appliance/accessing-the-monitor-dashboard)."
For more information about monitoring the performance and storage of {% data variables.location.product_location %}, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards)."

### Starting a migration

Expand Down
2 changes: 1 addition & 1 deletion content/admin/guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ includeGuides:
- /admin/configuring-settings/hardening-security-for-your-enterprise/troubleshooting-tls-errors
- /admin/configuring-settings/configuring-network-settings/using-github-enterprise-server-with-a-load-balancer
- /admin/monitoring-and-managing-your-instance/configuring-high-availability/about-high-availability-configuration
- /admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard
- /admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards
- /admin/monitoring-and-managing-your-instance/configuring-high-availability/creating-a-high-availability-replica
- /admin/monitoring-and-managing-your-instance/configuring-clustering/differences-between-clustering-and-high-availability-ha
- /admin/upgrading-your-instance/preparing-to-upgrade/enabling-automatic-update-checks
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ You may be hitting the CPU or memory limits if you notice that jobs are not star

### 1. Check the overall CPU and memory usage in the management console

Access the management console and use the monitor dashboard to inspect the overall CPU and memory graphs under "System Health". For more information, see "[AUTOTITLE](/admin/enterprise-management/monitoring-your-appliance/accessing-the-monitor-dashboard)."
Access the management console and use the monitor dashboard to inspect the overall CPU and memory graphs under "System Health". For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards)."

If the overall "System Health" CPU usage is close to 100%, or there is no free memory left, then {% data variables.location.product_location %} is running at capacity and needs to be scaled up. For more information, see "[AUTOTITLE](/admin/enterprise-management/updating-the-virtual-machine-and-physical-resources/increasing-cpu-or-memory-resources)."

Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
---
title: Accessing the monitor dashboard
intro: '{% data variables.product.prodname_ghe_server %} includes a web-based monitoring dashboard that displays historical data about your {% data variables.product.prodname_ghe_server %} appliance, such as CPU and storage usage, application and authentication response times, and general system health.'
title: 'About the monitor {% ifversion ghes > 3.15 %}dashboards{% else %}dashboard{% endif %}'
allowTitleToDifferFromFilename: true
intro: 'View historical data for details like CPU and storage usage, application and authentication response times, and general system health.'
redirect_from:
- /enterprise/admin/installation/accessing-the-monitor-dashboard
- /enterprise/admin/enterprise-management/accessing-the-monitor-dashboard
- /admin/enterprise-management/accessing-the-monitor-dashboard
- /admin/enterprise-management/monitoring-your-appliance/accessing-the-monitor-dashboard
- /admin/monitoring-managing-and-updating-your-instance/monitoring-your-appliance/accessing-the-monitor-dashboard
- /admin/monitoring-managing-and-updating-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard
- /admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard
versions:
ghes: '*'
type: how_to
Expand All @@ -17,17 +19,40 @@ topics:
- Infrastructure
- Monitoring
- Performance
shortTitle: Access the monitor dashboard
shortTitle: About the monitor {% ifversion ghes > 3.15 %}dashboards{% else %}dashboard{% endif %}
---
## Accessing the monitor dashboard
## Accessing the monitor {% ifversion ghes > 3.15 %}dashboards{% else %}dashboard{% endif %}

{% data reusables.enterprise_site_admin_settings.access-settings %}
{% data reusables.enterprise_site_admin_settings.management-console %}
1. In the top navigation bar, click **Monitor**.

![Screenshot of the header of the {% data variables.enterprise.management_console %}. A tab, labeled "Monitor", is highlighted with an orange outline.](/assets/images/enterprise/management-console/monitor-dash-link.png)
![Screenshot of the header of the {% data variables.enterprise.management_console %}. A tab, labeled "Monitor", is highlighted with an orange outline.](/assets/images/enterprise/management-console/{% ifversion ghes > 3.15 %}monitor-dash-link.png{% else %}monitor-dash-link-old.png{% endif %})

1. In HA and cluster environments you can switch between nodes using the dropdown and clicking on a different hostname.
{% ifversion ghes > 3.15 %}

## Using the monitor dashboards

The dashboards visualize metrics which can be useful for troubleshooting performance issues and better understanding how your {% data variables.product.prodname_ghe_server %} appliance is being used. The data behind the graphs is gathered by the `collectd` service and sampled every 10 seconds.

Within the pre-built dashboards you can find various sections grouping graphs of different types of system resources. Use the links on the page to navigate between the dashboards.

![Screenshot of the {% data variables.enterprise.management_console %} header. The dashboard navigation links provided at the top right are highlighted in orange.](/assets/images/enterprise/management-console/monitor-dash-navigation.png)

### "Operational Health" dashboard

This is the default dashboard displayed on the "Monitor" page. It visualizes key metrics that help you to get a quick overview of the health of your {% data variables.product.prodname_ghe_server %} appliance.

### "System & Application Insights" dashboard

On this more detailed dashboard you can get further insights into all aspects of the services that are running on your appliance.

## Creating new dashboards

Building your own dashboard and alerts requires the data to be forwarded to an external instance, by enabling `collectd` forwarding. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/configuring-collectd-for-your-instance)."

{% else %}

## Using the monitor dashboard

Expand All @@ -36,12 +61,26 @@ The page visualizes metrics which can be useful for troubleshooting performance
Within the pre-built dashboard you can find various sections grouping graphs of different types of system resources.

Building your own dashboard and alerts requires the data to be forwarded to an external instance, by enabling `collectd` forwarding. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/configuring-collectd-for-your-instance)."
{% endif %}

## About the metrics on the monitor dashboard
## About the metrics on the monitor dashboards

### System health
### System Health

The system health graphs provide a general overview of services and system resource utilization. The CPU, memory, and load average graphs are useful for identifying trends or times where provisioned resource saturation has occurred. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/recommended-alert-thresholds)."
{% ifversion ghes > 3.15 %}

### Application Health

These graphs include key metrics for the resource utilization of services that power {% data variables.product.prodname_ghe_server %}. They help visualize ongoing issues while processing requests.

* **Nomad jobs**: The CPU and memory usage of individual services. {% data variables.product.prodname_ghe_server %} utilizes Nomad internally as the workload orchestrator.
* **Response code**: The number of responses by status code returned across {% data variables.product.prodname_ghe_server %} services.
* **Response time**: The speed of web requests at the 90th percentile in milliseconds.
* **Active workers**: The number of web workers busy per {% data variables.product.prodname_ghe_server %} application.
* **Queued requests**: The number of web requests queued per {% data variables.product.prodname_ghe_server %} application. It is expected for this panel to display 'No data' when no requests are queued up.
* **ElasticSearch Cluster Health**: The health status of the ElasticSearch cluster, based on the state of its primary and replica shards. This cluster powers {% data variables.product.prodname_ghe_server %} search.
{% endif %}

### Processes

Expand All @@ -65,7 +104,7 @@ The **App request/response** section looks at the rate of requests, how quickly

### Actions

The graphs break down different metrics about {% data variables.product.prodname_actions %} on {% data variables.location.product_location %} including an overview of {% data variables.product.prodname_actions %} services web requests.
The graphs break down different metrics about {% data variables.product.prodname_actions %} on {% data variables.location.product_location %} including an overview of {% data variables.product.prodname_actions %} services web requests {% ifversion ghes > 3.15 %} and MSSQL database transaction log size{% endif %}.

### Background jobs

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ topics:

`collectd` is a service that runs on {% data variables.location.product_location %} to gather and provide metrics about the system's performance. Common metrics that `collectd` gathers includes CPU utilization, memory and disk consumption, network interface traffic and errors, and a system's overall load. You can also forward the data to another `collectd` server. For more information see the [collectd wiki](https://github.com/collectd/collectd/wiki).

Your instance uses metrics from `collectd` to display graphs in the {% data variables.enterprise.management_console %}'s monitor dashboard. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard)."
Your instance uses metrics from `collectd` to display graphs in the {% data variables.enterprise.management_console %}'s monitor dashboard. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards)."

You can review a list of the metrics that `collectd` gathers on {% data variables.location.product_location %}. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/collectd-metrics-for-github-enterprise-server)."

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ versions:
topics:
- Enterprise
children:
- /accessing-the-monitor-dashboard
- /about-the-monitor-dashboards
- /recommended-alert-thresholds
- /setting-up-external-monitoring
- /configuring-collectd-for-your-instance
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ shortTitle: Recommended alert thresholds

## About recommended alert thresholds

You can configure external monitoring systems to alert you to storage, CPU, and memory usage that may cause problems with {% data variables.location.product_location %}. For more information, see "[AUTOTITLE](/admin/enterprise-management/monitoring-your-appliance/setting-up-external-monitoring)" and "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard)."
You can configure external monitoring systems to alert you to storage, CPU, and memory usage that may cause problems with {% data variables.location.product_location %}. For more information, see "[AUTOTITLE](/admin/enterprise-management/monitoring-your-appliance/setting-up-external-monitoring)" and "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards)."

## Monitoring storage

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ For system-critical issues, and prior to making modifications to your appliance,
* CPU of your instance is under-provisioned for your workload.
* Upgrading to a new {% data variables.product.prodname_ghe_server %} releases often increases CPU and memory usage due to new features. Additionally, post-upgrade migration or reconciliation background jobs can temporarily degrade performance until they complete.
* Elevated requests against Git or API. Increased requests to Git or API can occur due to various factors, such as excessive repository cloning, CI/CD processes, or unintentional usage by API scripts or new workloads.
* Increased number of [GitHub Actions jobs](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard#actions).
* Increased number of [GitHub Actions jobs](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards#actions).
* Elevated amount of Git commands executed a large repository.

### Recommendations
Expand All @@ -60,10 +60,10 @@ For system-critical issues, and prior to making modifications to your appliance,
### Recommendations

* Memory of your instance is under-provisioned for your workload, data volume, given usage over time may exceed the [minimum recommended requirements](/admin/installing-your-enterprise-server/setting-up-a-github-enterprise-server-instance/installing-github-enterprise-server-on-aws#minimum-recommended-requirements).
* Within the Nomad graphs, identify services with out of memory trends which are often followed by free memory trends after they get restarted. For more information, see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard#appliance-specific-system-services)."
* Within the Nomad graphs, identify services with out of memory trends which are often followed by free memory trends after they get restarted. For more information, see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards#appliance-specific-system-services)."
* Check logs for processes going out of memory by running `rg -z 'kernel: Out of memory: Killed process' /var/log/syslog*` (for this, first log in to the administrative shell using SSH - see "[AUTOTITLE](/enterprise-server@3.14/admin/administering-your-instance/administering-your-instance-from-the-command-line/accessing-the-administrative-shell-ssh).")
* Ensure the correct ratio of memory to CPU services is met (at least `6.5:1`).
* Check the amount of tasks queued for background processing - see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard#background-jobs)."
* Check the amount of tasks queued for background processing - see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards#background-jobs)."

## Low disk space availability

Expand Down Expand Up @@ -101,7 +101,7 @@ Keep in mind that the root storage volume is split into two equally-sized partit
* Check the database logs for slow queries in `/var/log/github/exceptions.log` (for this, first log in to the administrative shell using SSH - see "[AUTOTITLE](/enterprise-server@3.14/admin/administering-your-instance/administering-your-instance-from-the-command-line/accessing-the-administrative-shell-ssh)"), for example by checking for Top 10 slow requests by URL: `grep SlowRequest github-logs/exceptions.log | jq '.url' | sort | uniq -c | sort -rn | head`.
* Check the **Queued requests** graph for certain workers and consider adjusting their active worker count.
* Increase the storage disks to ones with higher IOPS/throughput.
* Check the amount of tasks queued for background processing - see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/monitoring-your-instance/accessing-the-monitor-dashboard#background-jobs)."
* Check the amount of tasks queued for background processing - see "[AUTOTITLE](/enterprise-server@3.14/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards#background-jobs)."

## Elevated error rates

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Collect the baseline data before upgrading to {% data variables.product.prodname

You may not be able to simulate the load that your instance experiences in a production environment. However, it's useful if you can collect baseline data while simulating patterns of usage from your production environment on the staging instance.

1. Browse to your instance's monitor dashboard. For more information, see "[AUTOTITLE](/admin/enterprise-management/monitoring-your-appliance/accessing-the-monitor-dashboard)."
1. Browse to your instance's monitor dashboard. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards)."
1. From the monitor dashboard, monitor relevant graphs.

* Under "Processes", monitor the graphs for "I/O operations (Read IOPS)" and "I/O operations (Write IOPS)", filtering for `mysqld`. These graphs display I/O operations for all of the node's services.
Expand All @@ -52,7 +52,7 @@ You may not be able to simulate the load that your instance experiences in a pro

After the upgrade to {% data variables.product.prodname_ghe_server %} 3.9, review the instance's I/O utilization. {% data variables.product.company_short %} recommends that you upgrade a staging instance of {% data variables.product.prodname_ghe_server %} running 3.7 or 3.8 that includes restored data from your production instance, or that you restore data from your production instance to a new staging instance running 3.9. For more information, see "[AUTOTITLE](/admin/installation/setting-up-a-github-enterprise-server-instance/setting-up-a-staging-instance)" and "[AUTOTITLE](/admin/configuration/configuring-your-enterprise/configuring-backups-on-your-appliance)."

1. Browse to your instance's monitor dashboard. For more information, see "[AUTOTITLE](/admin/enterprise-management/monitoring-your-appliance/accessing-the-monitor-dashboard)."
1. Browse to your instance's monitor dashboard. For more information, see "[AUTOTITLE](/admin/monitoring-and-managing-your-instance/monitoring-your-instance/about-the-monitor-dashboards)."
1. From the monitor dashboard, monitor relevant graphs.

* Under "Processes", monitor the graphs for "I/O operations (Read IOPS)" and "I/O operations (Write IOPS)", filtering for `mysqld`. These graphs display I/O operations for all of the node's services.
Expand Down
Loading

0 comments on commit 1e24ec2

Please sign in to comment.