[Docs] Step-by-step tutorial for uni-directional CCR failover #84854

Leaf-Lin · 2022-03-10T04:12:34Z

Description

As of writing, ccr does not offer automatic failover. Can we please add the following tutorial for the failover scenario?

The initial setup can be skipped as it's similar to Tutorial: Set up cross-cluster replication. Adding it here for completeness.

Initial setup (uni-directional CCR with `DR` cluster following `Production` cluster)

Step1: Create remote clusters on `DR` and point to `production`

### On DR cluster ###
PUT /_cluster/settings
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "production" : {
          "seeds" : [
            "127.0.0.1:9300" 
          ]
        }
      }
    }
  }
}

Step2: Create an index on `Production`

### On Production cluster ###
PUT /my_index
POST /my_index/_doc/1
{
  "foo":"bar"
}

Step3: Create follower index on DR

### On DR cluster ###
PUT /my_index/_ccr/follow 
{ 
  "remote_cluster" : "production", 
  "leader_index" : "my_index" 
}

Step4: Test follower index on DR

### On DR cluster ###
GET /my_index/search

### This should show up the content created on production (foo/bar)

⚠️ Ingestion should only be written to the Production cluster, all search queries can be directed to either Production or DR clusters.

When Production down:

Step1: On the Client's side, pause ingestion of `my_index` into `Production`.

Step2: On the Elasticsearch side, turn the follower indices in the `DR` into regular indices:

Ensure no writes are occurring on the leader index (if the data centre is down, or cluster is unavailable, no action needed)
On DR: Convert the follower index to a normal index in Elasticsearch (capable of accepting writes)

### On DR cluster ###
POST /my_index/_ccr/pause_follow
POST /my_index/_close           
POST /my_index/_ccr/unfollow    
POST /my_index/_open

Step3: On the Client side, manually re-enable ingestion of my_index to the `DR` cluster. (You can test that the index should be writable:

### On DR cluster ###
POST my_index/_doc/2
{
  "foo": "new"
}

⚠️ Make sure all traffic is redirected to the DR cluster during this time.

Once the `Production` comes back:

Step1: On the clients side, stop writes to `my_index` on `DR` cluster.

Step2: Create remote clusters on `Production` and points to `DR`

### On Production cluster ###
PUT _cluster/settings
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "dr" : {
          "seeds" : [
            "127.0.0.2:9300" 
          ]
        }
      }
    }
  }
}

Step3: Create follower indices in `Production`, connecting them to the leader in `DR`. The former leader indices in `Production` have outdated data and will need to be discarded/deleted. Wait for Production follower indices to catch up. Once it is caught up, you can turn the follower indices in `Production` to regular index again.

### On Production cluster ###
DELETE my_index

### Create follower index on Production to follow from DR cluster
PUT /my_index/_ccr/follow 
{ 
  "remote_cluster" : "dr", 
  "leader_index" : "my_index" 
}

### Wait for my_index to catch up with DR and contain all the documents.
GET my_index/_search

### Stop following from DR to turn my_index into a regular index.
POST /my_index/_ccr/pause_follow
POST /my_index/_close
POST /my_index/_ccr/unfollow
POST /my_index/_open

Step4: Delete the former `DR` writeable indices that contain outdated data now. Create follower indices in the `DR` again to ensure that all changes from `Production` are streamed to `DR`. (This is the same as the initial setup)

### On DR cluster ###
DELETE my_index

### Create follower index on `DR` to follow from the `Production` cluster
PUT /my_index/_ccr/follow 
{ 
  "remote_cluster" : "production", 
  "leader_index" : "my_index" 
}

Step5: On the Client side, manually re-enable ingestion to the `Production` cluster.

⚠️ Ingestion should only be written onto Production, all search queries can be directed to either Production or DR clusters.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-10T04:12:36Z

Pinging @elastic/es-docs (Team:Docs)

jasonslater2000 · 2022-03-31T23:26:35Z

How should .kibana be handled?

Leaf-Lin · 2022-03-31T23:40:04Z

We removed the auto-follow pattern for system indices in 8.0.0, but we can still specify follower index to replicate a specific leader_index. For example, to replicate .kibana_8.0.0_001 from clusterA to clusterB, execute this on clusterB:

DELETE .kibana_8.0.0_001
PUT .kibana_8.0.0_001/_ccr/follow?wait_for_active_shards=1
{
  "remote_cluster" : "clusterA",
  "leader_index" : ".kibana_8.0.0_001"
}

The DR flow for system indices is the same as my initial post, just that it should avoid using _ccr/auto_follow API.

jasonslater2000 · 2022-04-02T15:48:24Z

Thanks Leaf. Can we add this to the documentation as well, as it pertains to the DR use case, and covering failover/failback scenarios?

Customers in this situation would certainly want to understand what can be sync'ed to their DR cluster, vs what cannot be sync'ed.

Arnovandevelde · 2022-04-11T06:54:51Z

As a follow up question, what ARE the effects of setting up the .kibana as a follower index? How does this pertain to all kinds of configuration elements as well as not allowing direct access to system indices? For example, does this also work with the taskmanager part to handle the dimension of duplicate tasks/alerts? If this would also be included so that customers understand the impact dimensions that would go a long way for the right DR setup in combination with CCR.

a03nikki · 2022-04-14T19:19:46Z

@Leaf-Lin : Another follow-up question that I have to add to the list is:

Will users be able to continue to explicitly follow system indices and complete the steps to convert them to regular indices?

The last few months there has been a number of changes that have been made to make it more difficult to work with the system indices. For example, #72815, #63513, and #74212. When our users are planning their DR strategies, they want to know how forward compatible their plans are as they need to be continually patching their deployments. So knowing if explicit CCR is planned to be taken away or not is important.

elasticmachine · 2022-04-19T05:41:26Z

Pinging @elastic/es-distributed (Team:Distributed)

Leaf-Lin · 2022-04-26T08:35:32Z

On following .kibana index:

Can we add this to the documentation as well, as it pertains to the DR use case, and covering failover/failback scenarios?

Although the comment will allow us to replicate this particular system index, it is not ideal. One still needs to manually delete .kibana indices on the follower cluster before they can replicate the system indices, and it does not cover situations when the cluster gets upgraded (which changes .kibana to new index names.). Furthermore, this workaround may work with .kibana, but it is not guaranteed to work with all system indices. For example, it seems to have little value in replicating the .async-search or .tasks index outside the cluster running these tasks. Unfortunately, there's currently no systematic way to select the right set of system index to be replicated. You can see a similar comment in ¹.

what ARE the effects of setting up the .kibana as a follower index? How does this pertain to all kinds of configuration elements as well as not allowing direct access to system indices? For example, does this also work with the taskmanager part to handle the dimension of duplicate tasks/alerts?

These questions are spot-on. For these reasons you have mentioned, (plus upgrade handling), the follower cluster must have .kibana set to read-only mode, thus the follower will not be able to set up alerts, visualization or any other task management that currently require writes to .kibana or .kibana_task_manager index.

On the system indices + CCR planning:

Will users be able to continue to explicitly follow system indices and complete the steps to convert them to regular indices?

We agree that disaster recovery for system indices today is not well-implemented. I have raised an enhancement request² which would require a cross-team effort to address.

I am not aware of any planned changes in the short term.

tlrx · 2022-08-09T09:57:55Z

@Leaf-Lin the doc team is very busy, do you think you can provide the documentation change you proposed?

tlrx · 2022-08-09T10:11:43Z

@Leaf-Lin the doc team is very busy, do you think you can provide the documentation change you proposed?

Sorry for the noise, I just saw you provided #87099

shainaraskas · 2024-06-03T18:42:18Z

resolved by #91491

Leaf-Lin added >enhancement Team:Docs Meta label for docs team labels Mar 10, 2022

Leaf-Lin assigned debadair Mar 10, 2022

Leaf-Lin added the :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features label Apr 19, 2022

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 19, 2022

Leaf-Lin added the team-discuss label Apr 19, 2022

Leaf-Lin removed the team-discuss label Apr 26, 2022

Leaf-Lin linked a pull request Aug 29, 2022 that will close this issue

[DOCS] uni-directional CCR disaster recovery #87099

Closed

Leaf-Lin mentioned this issue Sep 21, 2022

[DOCS] uni-directional CCR disaster recovery #87099

Closed

Leaf-Lin mentioned this issue Oct 10, 2022

CCR auto_follow to follow both newly created and existing indices #90281

Open

This was referenced Nov 4, 2022

CCR failback shouldn't require delete all existing data #91303

Open

A single CCR API to "promote" all follower indices to writable regular index #91336

Open

shainaraskas unassigned debadair Jun 3, 2024

elasticsearchmachine removed the Team:Docs Meta label for docs team label Jun 3, 2024

shainaraskas closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Step-by-step tutorial for uni-directional CCR failover #84854

[Docs] Step-by-step tutorial for uni-directional CCR failover #84854

Leaf-Lin commented Mar 10, 2022

elasticmachine commented Mar 10, 2022

jasonslater2000 commented Mar 31, 2022

Leaf-Lin commented Mar 31, 2022

jasonslater2000 commented Apr 2, 2022

Arnovandevelde commented Apr 11, 2022

a03nikki commented Apr 14, 2022 •

edited

Loading

elasticmachine commented Apr 19, 2022

Leaf-Lin commented Apr 26, 2022

tlrx commented Aug 9, 2022

tlrx commented Aug 9, 2022

shainaraskas commented Jun 3, 2024

[Docs] Step-by-step tutorial for uni-directional CCR failover #84854

[Docs] Step-by-step tutorial for uni-directional CCR failover #84854

Comments

Leaf-Lin commented Mar 10, 2022

Description

Initial setup (uni-directional CCR with DR cluster following Production cluster)

Step1: Create remote clusters on DR and point to production

Step2: Create an index on Production

Step3: Create follower index on DR

Step4: Test follower index on DR

When Production down:

Step1: On the Client's side, pause ingestion of my_index into Production.

Step2: On the Elasticsearch side, turn the follower indices in the DR into regular indices:

Step3: On the Client side, manually re-enable ingestion of my_index to the DR cluster. (You can test that the index should be writable:

Once the Production comes back:

Step1: On the clients side, stop writes to my_index on DR cluster.

Step2: Create remote clusters on Production and points to DR

Step4: Delete the former DR writeable indices that contain outdated data now. Create follower indices in the DR again to ensure that all changes from Production are streamed to DR. (This is the same as the initial setup)

Step5: On the Client side, manually re-enable ingestion to the Production cluster.

elasticmachine commented Mar 10, 2022

jasonslater2000 commented Mar 31, 2022

Leaf-Lin commented Mar 31, 2022

jasonslater2000 commented Apr 2, 2022

Arnovandevelde commented Apr 11, 2022

a03nikki commented Apr 14, 2022 • edited Loading

elasticmachine commented Apr 19, 2022

Leaf-Lin commented Apr 26, 2022

Footnotes

tlrx commented Aug 9, 2022

tlrx commented Aug 9, 2022

shainaraskas commented Jun 3, 2024

Initial setup (uni-directional CCR with `DR` cluster following `Production` cluster)

Step1: Create remote clusters on `DR` and point to `production`

Step2: Create an index on `Production`

Step1: On the Client's side, pause ingestion of `my_index` into `Production`.

Step2: On the Elasticsearch side, turn the follower indices in the `DR` into regular indices:

Step3: On the Client side, manually re-enable ingestion of my_index to the `DR` cluster. (You can test that the index should be writable:

Once the `Production` comes back:

Step1: On the clients side, stop writes to `my_index` on `DR` cluster.

Step2: Create remote clusters on `Production` and points to `DR`

Step4: Delete the former `DR` writeable indices that contain outdated data now. Create follower indices in the `DR` again to ensure that all changes from `Production` are streamed to `DR`. (This is the same as the initial setup)

Step5: On the Client side, manually re-enable ingestion to the `Production` cluster.

a03nikki commented Apr 14, 2022 •

edited

Loading