Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation on remote recovery #39483

Merged
merged 9 commits into from
Mar 5, 2019
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/reference/ccr/getting-started.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,11 @@ PUT /server-metrics-copy/_ccr/follow?wait_for_active_shards=1

//////////////////////////

The follower index is bootstrapped using the <<remote-recovery, remote recovery>>
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
process. The remote recovery process transfers the existing Lucene segment files
from the leader to the follower. When the remote recovery process is complete,
the index following will be initiated.
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved

Now when you index documents into your leader index, you will see these
documents replicated in the follower index. You can
inspect the status of replication using the
Expand Down
57 changes: 57 additions & 0 deletions docs/reference/ccr/remote-recovery.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
[[remote-recovery]]
=== Remote Recovery

Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
Remote recovery is the process used to build a new copy of a shard on a follower
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
node by copying data from the primary shard in the leader cluster. {es} uses this
remote recovery process to bootstrap a follower index using the data from the
leader index.
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved

Remote recovery is a network intensive process that transfers all of the Lucene
segment files from the leader cluster to the follower cluster. The follower
requests that a recovery session be initiated on the primary shard in the leader
cluster. The follower then requests file chunks concurrently from the leader. By
default, the the process concurrently requests `5` large `1mb` file chunks as remote
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
recovery is designed to support leader and follower clusters with high network
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
latency between them.

Information about an in-progress remote recovery can be obtained using the
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
<<cat-recovery,recovery>> api. Remote recoveries are implemented using the
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
<<modules-snapshots,snapshot and restore>> infrastructure. This means that on-going
remote recoveries will be labelled as type `snapshot` in the recovery api.

The following _expert_ setting can be set to manage the resources consumed by
remote recoveries:

`ccr.indices.recovery.max_bytes_per_sec`::
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
Limits the total inbound and outbound remote recovery traffic on each node.
Since this limit applies on each node, but there may be many nodes
performing remote recoveries concurrently, the total amount of remote recovery bytes
may be much higher than this limit. If you set this limit too high then there
is a risk that ongoing remote recoveries will consume an excess of bandwidth
(or other resources) which could destabilize the cluster. Defaults to `40mb`.
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved

`ccr.indices.recovery.max_concurrent_file_chunk`::
Controls the number of file chunk requests that can be sent in parallel per recovery.
As multiple remote recoveries might already running in parallel, increasing this
expert-level setting might only help in situations where remote recovery of a single shard
is not reaching the total inbound and outbound remote recovery traffic as configured by
`ccr.indices.recovery.max_bytes_per_sec`. Defaults to `5`. The maximum allowed value is
`10`.

`ccr.indices.recovery.chunk_size`::
Controls the chunk size requested by the follower during file transfer. Defaults to
`1mb`.

`ccr.indices.recovery.recovery_activity_timeout`::
Controls the timeout for recovery activity. This timeout primarily applies on the leader
cluster. The leader cluster must open resources in-memory to supply data to the follower
during the recovery process. If the leader does not receive recovery requests from the
follower for this period of time, it will close the resources. Defaults to `60 seconds`.

`ccr.indices.recovery.internal_action_timeout`::
Controls the timeout for individual network requests during the remote recovery
process. An individual action timing out can fail the recovery. Defaults to `60 seconds`.


These settings can be dynamically updated on a live cluster with the
<<cluster-update-settings,cluster-update-settings>> API.