Skip to content

Commit

Permalink
Add Uptime monitoring content (#162) (#173)
Browse files Browse the repository at this point in the history
* Add uptime content

* Add shared set up cloud content

* Add anomaly alert

* Minor edits

* Edits following review

* Edits following review
  • Loading branch information
EamonnTP authored Oct 13, 2020
1 parent bd5f0b5 commit 35443cd
Show file tree
Hide file tree
Showing 44 changed files with 545 additions and 20 deletions.
2 changes: 1 addition & 1 deletion docs/en/observability/analyze-metrics.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Using {metricbeat} modules, you can ingest and analyze
metrics from servers, Docker containers, Kubernetes orchestrations, explore and
analyze Prometheus-style metrics or application telemetries, and many more.

To view the {metrics-app}, in the side navigation, expand *Observability*, and then click *Metrics*.
To view the {metrics-app}, go to *Observability > Metrics*.

[role="screenshot"]
image::images/metrics-app.png[Metrics app in Kibana]
60 changes: 60 additions & 0 deletions docs/en/observability/analyze-monitors.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
[[analyze-monitors]]
= Analyze monitors

To access this page, go to *Observability > Uptime*. From the *Overview* page,
click on a listed monitor to view more details and analyze further.

The monitor detail screen displays several panels of information.

[[uptime-status-panel]]
== Status panel

The *Status* panel displays a summary of the latest information regarding your monitor.
You can view its availability, click a link to visit the targeted URL, view when the
TLS certificate expires, and determine the amount of time that has elapsed since the last check.

[role="screenshot"]
image::images/uptime-status-panel.png[Uptime status panel]

The *Monitoring from* list displays service availability per monitoring location,
along with the amount of time elapsed since data was received from that location.
The availability percentage is the percentage of successful checks made during
the selected time period.

To display a map with each location as a pinpoint, you can toggle the availability view from list
view to map view.

[[uptime-monitor-duration]]
== Monitor duration

The *Monitor duration* chart displays the timing for each check that was performed. The visualization
helps you to gain insights into how quickly requests resolve by the targeted endpoint and give you a
sense of how frequently a host or endpoint was down in your selected timespan.

Included on this chart is the anomaly detection ({ml}) integration. For more information, see
<<inspect-uptime-duration-anomalies,Inspect Uptime duration anomalies>>.

[role="screenshot"]
image::images/monitor-duration-chart.png[Monitor duration chart]

[[uptime-pings-chart]]
== Pings over time

The *Pings over time* chart is a graphical representation of the check statuses over time.
Hover over the charts to display crosshairs with specific numeric data.

[role="screenshot"]
image::images/pings-over-time.png[Pings over time chart]

[[uptime-history-panel]]
== Check history

The *History* table lists the total count of this monitor’s checks for the selected date range.
To help find recent problems on a per-check basis, you can filter by `status`
and `location`.

This table can help you gain insights into more granular details
about recent individual data points that {heartbeat} is logging about your host or endpoint.

[role="screenshot"]
image::images/uptime-history.png[Monitor history list]
4 changes: 2 additions & 2 deletions docs/en/observability/configure-logs-sources.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ default configuration settings.
[[edit-config-settings]]
== Edit configuration settings

. In the side navigation, expand *Observability*, and then click *Logs*.
. To access this page, go to *Observability > Logs*.
+
. Click *Settings*.
+
Expand Down Expand Up @@ -58,7 +58,7 @@ base field, `message`, is used.

|===

1. To add a new column to the logs stream, in the *Settings* tab, click *Add column*.
1. To add a new column to the logs stream, select *Settings > Add column*.
2. In the list of available fields, select the field you want to add.
To filter the field list by that name, you can start typing a field name in the search box.
3. To remove an existing column, click the *Remove this column* icon.
Expand Down
2 changes: 1 addition & 1 deletion docs/en/observability/configure-metrics-sources.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ and container names.
[[metrics-config-settings]]
== Override configuration settings

. In the side navigation, expand *Observability*, and then click *Metrics*.
. To access this page, go to *Observability > Metrics*.
+
. Click *Settings*.
+
Expand Down
75 changes: 75 additions & 0 deletions docs/en/observability/configure-uptime-settings.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
[[configure-uptime-settings]]
= Configure settings

The *Settings* page enables you to change which {heartbeat} indices are displayed
by the {uptime-app}, configure alert connectors, and set expiration/age thresholds
for TLS certificates.

Uptime settings apply to the current space only. To segment
different uptime use cases and domains, use different settings in other spaces.

. To access this page, go to *Observability > Uptime*.
. Click *Settings*.
+
[IMPORTANT]
=====
To modify items on this page, you must have the {kibana-ref}/space-rbac-tutorial.html[`all`]
privilege granted to your role. The `all` privilege grants cluster administration operations, like snapshotting,
node shutdown/restart, settings update, rerouting, or managing users and roles.
=====

[[configure-uptime-indices]]
== Configure indices

Specify a comma-separated list of index patterns to match indices in {es} that contain {heartbeat} data.

[NOTE]
=====
The pattern set here only restricts what the {uptime-app} displays. You can still query {es} for
data outside of this pattern.
=====

[role="screenshot"]
image::images/heartbeat-indices.png[Heartbeat indices]

[[configure-uptime-alert-connectors]]
== Configure alert connectors

*Alerts* work by running checks on a schedule to detect conditions. When a condition is met, the alert tracks
it as an *alert instance* and responds by triggering one or more *actions*.
Actions typically involve interaction with {kib} services or third party integrations. *Connectors* allow actions
to talk to these services and integrations.

Click *Create connector* and follow the prompts to select a connector type and configure its properties.
After you create a connector, it's available to you anytime you set up an alert action in the current space.

For more information about each connector, see {kibana-ref}/action-types.html[action types and connectors].

[role="screenshot"]
image::images/alert-connector.png[Alert connector]

[[configure-cert-thresholds]]
== Configure certificate thresholds

You can modify certificate thresholds to control how Uptime displays your TLS values in
the <<view-certificate-status,Certificates>> page. These settings also determine which certificates are
selected by any TLS alert you create.

|===

| *Expiration threshold* | The `expiration` threshold specifies when you are notified
about certificates that are approaching expiration dates. When the value of a certificate's remaining valid days falls
below the `Expiration threshold`, it's considered a warning state. When you define a
<<tls-certificate-alert,TLS alert>>, you receive a notification about the certificate.

| *Age limit* | The `age` threshold specifies when you are notified about certificates
that have been valid for too long.

|===

A standard security requirement is to make sure that your TLS certificates have not been
valid for longer than a year. To help you keep track of which certificates you may want to refresh,
modify the *Age limit* value to `365` days.

[role="screenshot"]
image::images/cert-expiry-settings.png[Certificate expiry settings]
6 changes: 6 additions & 0 deletions docs/en/observability/create-alerts.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,9 @@ include::logs-threshold-alert.asciidoc[leveloffset=+1]
include::infrastructure-threshold-alert.asciidoc[leveloffset=+1]

include::metrics-threshold-alert.asciidoc[leveloffset=+1]

include::monitor-status-alert.asciidoc[leveloffset=+1]

include::uptime-tls-alert.asciidoc[leveloffset=+1]

include::uptime-duration-anomaly-alert.asciidoc[leveloffset=+1]
2 changes: 1 addition & 1 deletion docs/en/observability/explore-metrics.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ for one or more resources that you are monitoring.
Additionally, for detailed analyses of your metrics, you can annotate and save visualizations for
your custom dashboards by using the {kibana-ref}/dashboard.html#tsvb[Time Series Visual Builder (TSVB)] within {kib}.

In the side navigation, expand *Observability*, click *Metrics*, and then click *Metrics Explorer*.
To access this page, go to *Observability > Metrics*, and then click *Metrics Explorer*.

By default, the Metrics Explorer page displays the CPU usage for hosts, Kubernetes pods, and Docker containers.
The initial configuration has the *Average* aggregation selected, the *of* field is populated with the default metrics,
Expand Down
Binary file added docs/en/observability/images/alert-connector.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/en/observability/images/monitors-chart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/en/observability/images/monitors-list.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/en/observability/images/pings-over-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/en/observability/images/tls-alert.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/en/observability/images/uptime-app.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/en/observability/images/uptime-history.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions docs/en/observability/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,16 @@ include::explore-metrics.asciidoc[leveloffset=+2]

include::configure-metrics-sources.asciidoc[leveloffset=+2]

include::monitor-uptime.asciidoc[leveloffset=+1]

include::view-monitor-status.asciidoc[leveloffset=+2]

include::analyze-monitors.asciidoc[leveloffset=+2]

include::inspect-uptime-duration-anomalies.asciidoc[leveloffset=+2]

include::configure-uptime-settings.asciidoc[leveloffset=+2]

include::create-alerts.asciidoc[leveloffset=+1]

include::fields-reference.asciidoc[leveloffset=+1]
Expand Down
4 changes: 2 additions & 2 deletions docs/en/observability/infrastructure-threshold-alert.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ resource or for a group of resources within your infrastructure.
Additionally, each alert can be defined using multiple
conditions that combine metrics and thresholds to create precise notifications and reduce false positives.

. In the side navigation, expand *Observability*, and then click *Metrics*.
. On the *Inventory* page, click *Alerts*, and then select *Create alert*.
. To access this page, go to *Observability > Metrics*.
. On the *Inventory* page, click *Alerts > Create alert*.

[role="screenshot"]
image::images/inventory-create-alert.png[Closeup of the open Alerts menu on the Inventory page]
Expand Down
8 changes: 7 additions & 1 deletion docs/en/observability/ingest-logs.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,12 @@ If you haven't already, you need to install {es} for storing and searching your
managing it. For more information, see <<install-observability,Get started>>.
=====

Install and configure {filebeat} on your servers to collect log events. {filebeat} allows you ship log data from sources that come
in the form of files. It monitors the log files or locations that you specify,
collects log events, and forwards them to {es}. To ease the collection and parsing of
log formats for common applications such as Apache, MySQL, and Kafka, a number of
{filebeat-ref}/filebeat-modules.html[modules] are available.

[[install-filebeat]]
== Step 1: Install {beatname_uc}

Expand Down Expand Up @@ -174,7 +180,7 @@ Let's confirm your data is correctly streaming to your cloud instance.
include::{beats-repo-dir}/tab-widgets/open-kibana-widget.asciidoc[]
--

. In the side navigation, expand *{kib}*, and then click *Discover*.
. In the side navigation, click *{kib} > Discover*.
+
. Select `filebeat-*` as your index pattern.
+
Expand Down
12 changes: 11 additions & 1 deletion docs/en/observability/ingest-metrics.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,16 @@ If you haven't already, you need to install {es} for storing and searching your
managing it. For more information, see <<install-observability,Get started>>.
=====

Install and configure {metricbeat} on your servers to collect and preprocess system
and service metrics, such as information about running processes, as well as CPU, memory,
disk, and network utilization numbers.

{metricbeat} comes with predefined assets for parsing, indexing, and
visualizing your data. To load these assets, {metricbeat} uses
{metricbeat-ref}/metricbeat-modules.html[modules], before sending them to {es}. Each
integration defines the basic logic for collecting data from specific services, such as
Redis or MySQL. A module consists of metricsets that fetch and structure the data.

[[install-metricbeat]]
== Step 1: Install {metricbeat}

Expand Down Expand Up @@ -135,7 +145,7 @@ Let's confirm your data is correctly ingested to your cluster.
include::{beats-repo-dir}/tab-widgets/open-kibana-widget.asciidoc[]
--

. In the side navigation, expand *{kib}*, and then click *Discover*.
. In the side navigation, click *{kib} > Discover*
+
. Select `metricbeat-*` as your index pattern.
+
Expand Down
47 changes: 45 additions & 2 deletions docs/en/observability/ingest-uptime.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,47 @@ If you haven't already, you need to install {es} for storing and searching your
managing it. For more information, see <<install-observability,Get started>>.
=====

Install and configure {heartbeat} on your servers to periodically check the status of your
services. {heartbeat} uses probing to monitor the availability of services and helps
verify that you’re meeting your service level agreements for service uptime.
You typically install {heartbeat} as part of a monitoring service that runs on a separate machine
and possibly even outside of the network where the services that you want to monitor are running.

[[deployment-considerations]]
== Deployment considerations

There are multiple ways to deploy Uptime and Heartbeat. A guiding principle is that when
an outage takes down the service being monitored, it should not take down {heartbeat}.

{heartbeat} is commonly run as a centralized service within a data center.
While it's possible to run it as a separate "sidecar" process paired with each process/container,
we recommend against it. Running {heartbeat} centrally ensures you will still be able to see
monitoring data in the event of an overloaded, disconnected, or otherwise malfunctioning server.

For further redundancy, you may want to deploy multiple instances of {heartbeat} across geographic and network boundaries
to provide more data.

For example:

* A site served from a content delivery network (CDN) with points of presence (POPs) around the globe.
+
To check if your site is reachable via CDN POPS, deploy multiple {heartbeat} instances at
different data centers around the world.
+
* A service within a single data center that is accessed across multiple VPNs.
+
Set up one {heartbeat} instance within the VPN the service operates from, and another within an additional
VPN that users access the service from. In the event of an outage, having both instances helps pinpoint
the network errors.
+
* A single service running primarily in a US east coast data center, with a hot failover located in
a US west coast data center.
+
In each data center, run a {heartbeat} instance that checks both the local
copy of the service and its counterpart across the country. Set up two monitors in each region, one for
the local service, and one for the remote service. In the event of a data center failure, it will be
immediately apparent if the service has a connectivity issue to the outside world, or if the failure is only internal.

[[install-heartbeat]]
== Step 1: Install {beatname_uc}

Expand Down Expand Up @@ -150,15 +191,17 @@ include::{beats-repo-dir}/tab-widgets/start-widget.asciidoc[]
[[view-uptime-kibana]]
== Step 6: View your data in {kib}

To view the <<observability-ui,Observability Overview>> page:
Let's confirm your data is correctly ingested to your cluster.

. Launch {kib}:
+
--
include::{beats-repo-dir}/tab-widgets/open-kibana-widget.asciidoc[]
--

. In the side navigation, expand *Observability*, and then click *Overview*.
. In the side navigation, click *Observability > Uptime*.

Now let's have a look at the <<monitor-uptime,Uptime app>>.

// Add Javascript and CSS for tabbed panels
include::{beats-repo-dir}/tab-widgets/code.asciidoc[]
Expand Down
31 changes: 31 additions & 0 deletions docs/en/observability/inspect-uptime-duration-anomalies.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
[[inspect-uptime-duration-anomalies]]
= Inspect uptime duration anomalies

Each monitor location is modelled, and when a monitor runs
for an unusual amount of time, at a particular time, an anomaly is recorded and highlighted
on the *Monitor duration* chart.

[[uptime-anomaly-detection]]
== Enable uptime duration anomaly detection

Create a machine learning job to detect anomalous monitor duration rates automatically.

1. To access this page, go to *Observability > Uptime*, and then click a monitor to view its the details.
2. In the *Monitor duration* panel, click *Enable anomaly detection*.
+
[NOTE]
=====
If anomaly detection is already enabled, click *Anomaly detection* and select to view duration anomalies directly in the
{ml-docs}/ml-gs-results.html[Machine Learning app], enable an <<duration-anomaly-alert,anomaly alert>>,
or disable the anomaly detection.
=====
+
3. You are prompted to create a <<duration-anomaly-alert,response duration anomaly alert>> for the machine learning job which will carry
out the analysis, and you can configure which severity level to create the alert for.

When an anomaly is detected, the duration is displayed on the *Monitor duration*
chart, along with the duration times. The colors represent the criticality of the anomaly: red
(critical) and yellow (minor).

[role="screenshot"]
image::images/inspect-uptime-duration-anomalies.png[]
2 changes: 1 addition & 1 deletion docs/en/observability/install-observability.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ data, and {kib} for visualizing and managing it.
[[set-up-on-cloud]]
== Set up on Cloud

include::{docs-root}/shared/cloud/ess-getting-started.asciidoc[]
include::{docs-root}/shared/cloud/ess-getting-started-obs.asciidoc[]

[float]
[[self-manage]]
Expand Down
Loading

0 comments on commit 35443cd

Please sign in to comment.