Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Tail-based sampling #7317

Merged
merged 10 commits into from
Mar 9, 2022
108 changes: 108 additions & 0 deletions docs/apm-input-settings.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -434,3 +434,111 @@ Anonymous Event rate limit (event limit)
// end::anonymous_rate_limit_event_limit-setting[]

// =============================================================================

// tag::tail_sampling_enabled-setting[]
|
[id="input-{input-type}-tail_sampling_enabled"]
Enable Tail-based sampling

| (bool) Enable and disable tail-based sampling.

*Default:* `false`
// end::tail_sampling_enabled-setting[]

// =============================================================================

// tag::tail_sampling_interval-setting[]
|
[id="input-{input-type}-tail_sampling_interval"]
Interval

| (duration) Synchronization interval for multiple APM Servers.
Should be in the order of tens of seconds or low minutes.

*Default:* `1m`
// end::tail_sampling_interval-setting[]

// =============================================================================

// tag::tail_sampling_policies-setting[]
|
[id="input-{input-type}-tail_sampling_policies"]
Policies

| (`[]policy`) Criteria used to match a root transaction to a sample rate.
Order is important; the first policy on the list that an event matches is the winner.
Each policy list must conclude with a default policy that only specifies a sample rate.
The default policy is used to catch remaining trace events that don’t match a stricter policy.

Required when tail-based sampling is enabled.

// end::tail_sampling_policies-setting[]

// =============================================================================

// tag::sample_rate-setting[]
|
[id="input-{input-type}-sample_rate"]
Sample rate

`sample_rate`

| (int) The sample rate to apply to trace events matching this policy.
Required in each policy.

// end::sample_rate-setting[]

// =============================================================================

// tag::trace_name-setting[]
|
[id="input-{input-type}-trace_name"]
Trace name

`trace.name`

| (string) The trace name for events to match a policy.

// end::trace_name-setting[]

// =============================================================================

// tag::trace_outcome-setting[]
|
[id="input-{input-type}-trace_outcome"]
Trace outcome

`trace.outcome`

| (string) The trace outcome for events to match a policy.
Trace outcome can be `success`, `failure`, or `unknown`.

// end::trace_outcome-setting[]

// =============================================================================

// tag::service_name-setting[]
|
[id="input-{input-type}-service_name"]
Service name

`service.name`

| (string) The service name for events to match a policy.

// end::service_name-setting[]

// =============================================================================

// tag::service_env-setting[]
|
[id="input-{input-type}-service_env"]
Service Environment

`service.environment`

| (string) The service environment for events to match a policy.

// end::service_env-setting[]

// =============================================================================
Binary file added docs/images/dt-sampling-example-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dt-sampling-example-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dt-sampling-example-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/images/dt-sampling-example.png
Binary file not shown.
24 changes: 24 additions & 0 deletions docs/input-apm.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -78,4 +78,28 @@ include::./apm-input-settings.asciidoc[tag=anonymous_rate_limit_ip_limit-setting
include::./apm-input-settings.asciidoc[tag=anonymous_rate_limit_event_limit-setting]
|===

[float]
[[apm-input-tail-sampling-settings]]
=== Tail-based sampling

**Top-level tail-based sampling settings:**

[cols="2*<a"]
|===
include::./apm-input-settings.asciidoc[tag=tail_sampling_enabled-setting]
include::./apm-input-settings.asciidoc[tag=tail_sampling_interval-setting]
include::./apm-input-settings.asciidoc[tag=tail_sampling_policies-setting]
|===

**Policy settings:**

[cols="2*<a"]
|===
include::./apm-input-settings.asciidoc[tag=sample_rate-setting]
include::./apm-input-settings.asciidoc[tag=trace_name-setting]
include::./apm-input-settings.asciidoc[tag=trace_outcome-setting]
include::./apm-input-settings.asciidoc[tag=service_name-setting]
include::./apm-input-settings.asciidoc[tag=service_env-setting]
|===

:input-type!:
3 changes: 3 additions & 0 deletions docs/legacy/guide/trace-sampling.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ For example, a sampling value of `.2` indicates a transaction sample rate of `20
This means that only `20%` of traces will send and retain all of their associated information.
The remaining traces will drop contextual information to reduce the transfer and storage size of the trace.

TIP: The APM integration supports both head-based and tail-based sampling.
Learn more <<sampling,here>>.
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved

[float]
==== Why sample?

Expand Down
181 changes: 140 additions & 41 deletions docs/sampling.asciidoc
Original file line number Diff line number Diff line change
@@ -1,27 +1,77 @@
[[sampling]]
=== Transaction sampling

Elastic APM supports head-based, probability sampling.
_Head-based_ means the sampling decision for each trace is made when that trace is initiated.
_Probability sampling_ means that each trace has a defined and equal probability of being sampled.
Distributed tracing can generate a substantial amount of data.
More data can mean higher costs and more noise to sift through.
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved
Sampling aims to lower the amount of data ingested and the effort required to analyze that data --
all while still making it easy to find anomalous patterns in your applications, detect outages, track errors,
and lower MTTR.

Elastic APM supports two types of sampling:

* <<head-based-sampling>>
* <<tail-based-sampling>>

[float]
[[head-based-sampling]]
==== Head-based sampling

In head-based sampling, the sampling decision for each trace is made when that trace is initiated.
Each trace has a defined and equal probability of being sampled.

For example, a sampling value of `.2` indicates a transaction sample rate of `20%`.
This means that only `20%` of traces will send and retain all of their associated information.
The remaining traces will drop contextual information to reduce the transfer and storage size of the trace.
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved

Head-based sampling is quick and easy to set up.
Its downside is that it's entirely random -- good
data might be discarded purely due to chance.
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved

See <<configure-head-based-sampling>> to get started.

**Distributed tracing with head-based sampling**

In a distributed trace, the sampling decision is still made when the trace is initiated.
Each subsequent service respects the initial service's sampling decision, regardless of its configured sample rate;
the result is a sampling percentage that matches the initiating service.

In this example, `Service A` initiates four transactions and has sample rate of `.5` (`50%`).
The sample rates of `Service B` and `Service C` are ignored.

image::./images/dt-sampling-example-1.png[Distributed tracing and head based sampling example one]

In this example, `Service A` initiates four transactions and has a sample rate of `1` (`100%`).
Again, the sample rates of `Service B` and `Service C` are ignored.

image::./images/dt-sampling-example-2.png[Distributed tracing and head based sampling example two]

[float]
==== Why sample?
[[tail-based-sampling]]
==== Tail-based sampling
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved

Distributed tracing can generate a substantial amount of data,
and storage can be a concern for users running `100%` sampling -- especially as they scale.
In tail-based sampling, the sampling decision for each trace is made after the trace has completed.
This means all traces will be analyzed against a set of rules, or policies, which will determine the rate at which they are sampled.

The goal of probability sampling is to provide you with a representative set of data that allows
you to make statistical inferences about the entire group of data.
In other words, in most cases, you can still find anomalous patterns in your applications, detect outages, track errors,
and lower MTTR, even when sampling at less than `100%`.
Tail-based sampling reduces the risk of discarding important data, because the sampling decision is only made _after_
each trace has been analyzed.
However, because traces are all initially observed,
storage and transfer costs may be higher than with head-based sampling.
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved

See <<configure-tail-based-sampling>> to get started.

**Distributed tracing with tail-based sampling**

With tail-based sampling, all traces are observed and a sampling decision is only made once a trace completes.

In this example, `Service A` initiates four transactions.
If our sample rate is `.5` (`50%`) for traces with a `success` outcome,
and `1` (`100%`) for traces with a `failure` outcome,
the sampled traces would look something like this:

image::./images/dt-sampling-example-3.png[Distributed tracing and tail based sampling example one]
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved

[float]
==== What data is sampled?
=== Sampled data
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved

A sampled trace retains all data associated with it.

Expand All @@ -37,37 +87,25 @@ This means the following data will always accurately reflect *all* of your appli
* Transaction breakdown metrics
* Errors, error occurrence, and error rate

// To turn off the sending of all data, including transaction and error data, set `active` to `false`.

[float]
==== Sample rates
=== Sample rates

What's the best sampling rate? Unfortunately, there isn't one.
Sampling is dependent on your data, the throughput of your application, data retainment policies, and other factors.
A sampling rate from `.1%` to `100%` would all be considered normal.
You may even decide to have a unique sample rate per service -- for example, if a certain service
experiences considerably more or less traffic than another.
You'll likely decide on a unique sample rate for different scenarios.
Here are some examples:

// Regardless, cost conscious customers are likely to be fine with a lower sample rate.
* Services with considerably more traffic than others might be safe to sample at lower rates
* Routes that are more important than others might be sampled at higher rates
* A production service environment might warrant a higher sampling rate than a development environment

[float]
==== Sampling with distributed tracing

The initiating service makes the sampling decision in a distributed trace,
and all downstream services respect that decision.

In each example below, `Service A` initiates four transactions.
In the first example, `Service A` samples at `.5` (`50%`). In the second, `Service A` samples at `1` (`100%`).
Each subsequent service respects the initial sampling decision, regardless of their configured sample rate.
The result is a sampling percentage that matches the initiating service:

image::./images/dt-sampling-example.png[How sampling impacts distributed tracing]
Regardless of the above, cost conscious customers are likely to be fine with a lower sample rate.
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved

[float]
==== APM app implications
=== APM app implications

Because the transaction sample rate is respected by downstream services,
the APM app always knows which transactions have and haven't been sampled.
The APM app always knows which transactions have and haven't been sampled.
This prevents the app from showing broken traces.
In addition, because transaction and error data is never sampled,
you can always expect metrics and errors to be accurately reflected in the APM app.
Expand All @@ -78,24 +116,24 @@ Service maps rely on distributed traces to draw connections between services.
A minimum required version of APM agents is required for Service maps to work.
See {kibana-ref}/service-maps.html[Service maps] for more information.

// Follow-up: Add link from https://www.elastic.co/guide/en/kibana/current/service-maps.html#service-maps-how
// to this page.
[[configure-head-based-sampling]]
==== Configure head-based sampling

[float]
==== Adjust the sample rate
There are three ways to adjust the head-based sampling rate of your APM agents:

There are three ways to adjust the transaction sample rate of your APM agents:
===== Dynamic configuration

Dynamic::
The transaction sample rate can be changed dynamically (no redeployment necessary) on a per-service and per-environment
basis with {kibana-ref}/agent-configuration.html[APM Agent Configuration] in Kibana.

Kibana API::
===== Kibana API configuration

APM Agent configuration exposes an API that can be used to programmatically change
your agents' sampling rate.
An example is provided in the {kibana-ref}/agent-config-api.html[Agent configuration API reference].

Configuration::
===== APM agent configuration

Each agent provides a configuration value used to set the transaction sample rate.
See the relevant agent's documentation for more details:

Expand All @@ -105,4 +143,65 @@ See the relevant agent's documentation for more details:
* Node.js: {apm-node-ref-v}/configuration.html#transaction-sample-rate[`transactionSampleRate`]
* PHP: {apm-php-ref-v}/configuration-reference.html#config-transaction-sample-rate[`transaction_sample_rate`]
* Python: {apm-py-ref-v}/configuration.html#config-transaction-sample-rate[`transaction_sample_rate`]
* Ruby: {apm-ruby-ref-v}/configuration.html#config-transaction-sample-rate[`transaction_sample_rate`]
* Ruby: {apm-ruby-ref-v}/configuration.html#config-transaction-sample-rate[`transaction_sample_rate`]

[[configure-tail-based-sampling]]
==== Configure tail-based sampling

Enable tail-based sampling in the <<input-apm,APM integration settings>>.
When enabled, trace events are mapped to sampling policies.
Each sampling policy must specify a sample rate, and can optionally specify other conditions.
All of the policy conditions must be true for a trace event to match it.

Trace events are matched to policies in the order specified.
Each policy list should conclude with a default policy -- one that only specifies a sample rate.
This default policy is used to catch remaining trace events that don't match a stricter policy.
Requiring this default policy ensures that traces are only dropped intentionally.
If you enable tail-based sampling and send a transaction that does not match any of the policies,
APM Server will reject the transaction with the error `no matching policy`.
bmorelli25 marked this conversation as resolved.
Show resolved Hide resolved

===== Example configuration

This example defines three tail-based sampling polices:

[source, yml]
----
- sample_rate: 1 <1>
service.environment: production
trace.name: "GET /very_important_route"
- sample_rate: .01 <2>
service.environment: production
trace.name: "GET /not_important_route"
- sample_rate: .1 <3>
----
<1> Samples 100% of traces in `production` with the trace name `"GET /very_important_route"`
<2> Samples 1% of traces in `production` with the trace name `"GET /not_important_route"`
<3> Default policy to sample all remaining traces at 10%, e.g. traces in a different environment, like `dev`,
or traces with any other name

===== Configuration reference

:input-type: tbs
**Top-level tail-based sampling settings:**

// This looks like the root service name/env, trace name/env, and trace outcome

[cols="2*<a"]
|===
include::./apm-input-settings.asciidoc[tag=tail_sampling_enabled-setting]
include::./apm-input-settings.asciidoc[tag=tail_sampling_interval-setting]
include::./apm-input-settings.asciidoc[tag=tail_sampling_policies-setting]
|===

**Policy settings:**

[cols="2*<a"]
|===
include::./apm-input-settings.asciidoc[tag=sample_rate-setting]
include::./apm-input-settings.asciidoc[tag=trace_name-setting]
include::./apm-input-settings.asciidoc[tag=trace_outcome-setting]
include::./apm-input-settings.asciidoc[tag=service_name-setting]
include::./apm-input-settings.asciidoc[tag=service_env-setting]
|===

:input-type!: