Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions content/en/agent/guide/agent-retry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---
title: Datadog Disaster Recovery
private: true
further_reading:
- link: "agent/remote_config/?tab=configurationyamlfile"
tag: "Documentation"
text: "Remote Configuration"
- link: "/getting_started/site/"
tag: "Documentation"
text: "Getting Started with Datadog Sites"
- link: "https://www.datadoghq.com/blog/ddr-mitigates-cloud-provider-outages/"
tag: "Blog"
text: "Datadog Disaster Recovery mitigates cloud provider outages"
---

## Overview

This brief is a high-level description of the behavior of the Agent when it’s unable to successfully post an HTTP request to Datadog’s intake: retry strategy (e.g. exponential backoff), buffer nature and sizes, drop strategy.

Check warning on line 18 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.abbreviations_latin

Use 'for example' instead of abbreviations like 'e.g.'.

Check failure on line 18 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.quotes

Use straight quotes instead of smart quotes.

Check failure on line 18 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.quotes

Use straight quotes instead of smart quotes.

Check notice on line 18 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.
A failed HTTP request is any HTTP request that doesn’t result in a 2xx HTTP response.

Check failure on line 19 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.quotes

Use straight quotes instead of smart quotes.


## Metrics
Failed HTTP requests are retried on timeouts and 4xx/5xx HTTP response status codes unless the status code is one of 400, 403, 413 (code). When an endpoint is down the Agent will retry again using an exponential backoff strategy with randomized jitter. By default the max backoff time is 64 seconds and will reach this time after 6 attempts (calculated here, using a base back off time of 2).

Check warning on line 23 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.tense

Avoid temporal words like 'will'.

Check notice on line 23 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

Check warning on line 23 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.tense

Avoid temporal words like 'will'.

If the HTTP responses is 404 then the Agent will retry the transaction again (see code here). This means that the Agent will typically retry requests to endpoints that do not exist in a DD region or proxy.

Check warning on line 25 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.tense

Avoid temporal words like 'will'.

Check warning on line 25 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.tense

Avoid temporal words like 'will'.

Failed payloads are kept compressed in an in-memory buffer, with a default size of 15MB (code, configurable with forwarder_retry_queue_payloads_max_size).

An opt-in on-disk buffer can be enabled by customers (public docs,
Storing Agent Infra Payloads on the Disk
). When this is enabled the Agent will first use the in-memory buffer till it becomes full. Once full it will start removing old payloads (transactions) from the in-memory retry queue and will serialize these to disks and accept the new payloads. Then the Agent will retry those in-memory and once empty it will retry the payloads stored on-disk. It will always try the newest first before removing it on success (rationale: we want to send the live and most recent metrics before backfilling older ones). More details on retry logic can be found here.

Check notice on line 31 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

Check warning on line 31 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.tense

Avoid temporal words like 'will'.

Check warning on line 31 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.tense

Avoid temporal words like 'will'.

Check warning on line 31 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.tense

Avoid temporal words like 'will'.

Check notice on line 31 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

Check warning on line 31 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.words

Use 'After' instead of 'Once'.

Check warning on line 31 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.tense

Avoid temporal words like 'will'.

On DD infra on-disk buffering is configured and set to be 2GB and a max disk ratio of 0.8. The max in-memory buffer is also configured to be 15MB.

To calculate the buffer size you can use the metric datadog.agent.retry_queue_duration.bytes_per_sec. This metric is available by default when using on-disk buffering. Using this data you can then calculate how much space you may want to allocate to the Agent in case of an outage.

On a restart of the Agent all metrics in-memory are lost. If the Agent has on-disk buffering enabled then the metrics that are stored on-disk will not be lost (the Agent will resubmit them). However metric payloads are stored on disk only once the in-memory buffer is full, so the metric payloads that were in-memory at the time the Agent restarted are lost.

Check notice on line 37 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

On shutdown, only in-flight requests are flushed and not those in the retry queue.

The Agent reports the number of points it drops to the customer’s org. More details in

Check failure on line 41 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.quotes

Use straight quotes instead of smart quotes.
Number of points dropped by the Agent.


## Logs

Failed HTTP requests (defined as any code above a 400 and not a 400, 401, 403 or 413 here) will be retried indefinitely (with backoff) until the end point comes back. It uses the same exponential backoff strategy with randomized jitter, except the default max backoff time is 120 seconds, using a base back off time of 2 as well.

Check notice on line 47 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

Check notice on line 47 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.oxfordcomma

Suggestion: Use the Oxford comma in 'Failed HTTP requests (defined as any code above a 400 and not a 400, 401, 403 or'.

Check notice on line 47 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

Due to the fact the Agent guarantees log delivery, when a payload fails it creates back pressure through the Agent. This means the Agent will stop reading from the source of the log. As long as the log source is still present/available when the intake comes back, then the Agent will carry on reading from where it stopped. In K8s this could mean the file is rotated before the intake is back online, so we’d lose the rest of the logs. On other systems, things like logrotate can remove files being tailed before the Agent is able to read and send the data again.

Check failure on line 49 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.quotes

Use straight quotes instead of smart quotes.

Check notice on line 49 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

The Logs Agent keeps track of log sources and current location in the logs source. There is a “registry” that keeps track of this information. This registry is flushed to disk every second (not configurable) and is loaded up by the Agent on restart. So on a restart the Agent will read from the place recorded in the registry. This does mean that there is a small chance of duplicate logs if we had sent one payload and not flushed the registry before restarting.

Check failure on line 51 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.quotes

Use straight quotes instead of smart quotes.

Check failure on line 51 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.quotes

Use straight quotes instead of smart quotes.

So when the intake is down, the Logs Agent will store a maximum of ~20MB of compressed payloads in memory. Currently not configurable.

For TCP we will only buffer 100 log lines as we send them one by one. TCP is still being used in some charts in EU1.


### Monitoring retries in the log agent
We have telemetry for the log agent (eg this config), our kb here provides more details regarding information on those telemetry.

Check failure on line 59 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.spaces

Use only one space between words and sentences (not two).

These telemetry can also be found in the telemetry.log within a flare

These metric can vary very widely. But for data loss, we have metrics that monitor monitor buffer health, performance, and potential data loss:

Check failure on line 63 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Vale.Repetition

'monitor' is repeated!


```sh
// Buffer Health & Data Loss
logs.bytes_missed - Bytes lost before consumption (log rotation, etc.)
logs.dropped - Total logs dropped per destination

// Performance & Latency
logs.sender_latency - HTTP sender latency histogram (ms)
logs.retry_count - Total retried payloads
logs.network_errors - Total network errors

// Throughput & Volume
logs.decoded - Total decoded logs
logs.processed - Total processed logs
logs.sent - Total sent logs
logs.bytes_sent - Bytes sent before encoding
logs.encoded_bytes_sent - Bytes sent after encoding

// Connection Health
logs_client_http_destination__idle_ms - Time spent idle (ms) by sender
logs_client_http_destination__in_use_ms - Time spent sending (ms) by sender

// Data Loss & Errors
logs_client_http_destination__payloads_dropped - Payloads dropped due to unrecoverable errors
logs_client_http_destination__send - Send attempts by endpoint_host and error type

// HTTP Response Health
logs.destination_http_resp - HTTP responses by status_code and url

// Buffer Capacity & Health (processor, sender, strategy)
logs_component_utilization__ratio - Utilization ratio (0-1) by component name and instance
logs_component_utilization__items - Items in buffer/queue by component name and instance
logs_component_utilization__bytes - Bytes in buffer/queue by component name and instance

```

### Dual shipping
When dual shipping, the Agent will block on both destinations. The problem with dual shipping is that we will send to the first endpoint. When this happens it will drop all payloads to the failing endpoint and continue to read from the source. This is because logs have been successfully delivered to at least one endpoint.

Dual shipping also has a is_reliable mode option for additional endpoints. In this mode (when set to true) the Agent will send to both endpoints and treat both with the same priority. This means that if all your reliable endpoints are unavailable then the Agent will stop sending data till one reliable endpoint is available. Unreliable endpoints only send data if at least one reliable endpoint is available. They also have lower priority than those marked as reliable. More details can be found here.


## APM
Failed HTTP requests are retried with an exponential backoff. A failed HTTP request is defined as any network connectivity problem, status code 408 or 5xx (code here and here). The max backoff time is 10 seconds (we believe this limit was chosen arbitrarily), using a base backoff time of 2. Which status codes are retriable, and the backoff rates are not configurable.

Failed payloads are kept in-memory compressed, we start dropping older payloads if the payloads queue is full.

### For traces
The queue size is configurable via apm_config.trace_writer.queue_size and defaults to int(max(1, max memory / max payload size)), in most cases this defaults to int(max(1, (500 * 1024 * 1024) / 3200000)) = 163 payloads (code).

Check notice on line 113 in content/en/agent/guide/agent-retry.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

### For stats
The queue size is configurable via apm_config.stats_writer.queue_size and defaults to int(max(1, max memory / payload size)), in most cases this defaults to int(max(1, (250 * 1024 * 1024) / 1500000)) = 174 payloads (code).

When dual shipping, each target endpoint has its own sender instance and queue.

## Other payload types and products

### Processes
Process Agent uses the metric payload forwarder. Before check results are sent to the forwarder, a queue is used to store check results.

Note that ~30m of data can be buffered on the Agent for process payloads (check running every 10s) as determined by DefaultProcessQueueSize. This can be lower if the payloads being buffered are hitting the max of 60MB controlled by DefaultProcessQueueBytes. Before Agent 7.39 these limits apply to a combination of process and connections (NPM) payloads since a single queue is used for both, however since Agent 7.39 separate queues are used, which in turn allows for more data to be buffered in flight, for process payloads ~40m of data can be buffered with these default settings.

https://github.com/DataDog/datadog-agent/blob/main/pkg/config/process.go#L22-L33

```sh
// Assuming we generate ~8 checks/minute (for process/network),
// this should allow buffering of ~30 minutes of data assuming
// it fits within the queue bytes memory budget
DefaultProcessQueueSize = 256

// DefaultProcessQueueBytes is the default amount of process-agent
// check data (in bytes) that can be buffered in memory
// Allow buffering up to 60 megabytes of payload data in total
DefaultProcessQueueBytes = 60 * 1000 * 1000
```

These settings are rarely overridden by customers.

Downstream, the Metrics forwarder is used (see above section), and the behavior is therefore similar (with the exception that on-disk buffering is not enabled).
Loading