Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Step-by-step tutorial for uni-directional CCR failover #84854

Closed
Leaf-Lin opened this issue Mar 10, 2022 · 11 comments
Closed

[Docs] Step-by-step tutorial for uni-directional CCR failover #84854

Leaf-Lin opened this issue Mar 10, 2022 · 11 comments
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@Leaf-Lin
Copy link
Contributor

Description

As of writing, ccr does not offer automatic failover. Can we please add the following tutorial for the failover scenario?

The initial setup can be skipped as it's similar to Tutorial: Set up cross-cluster replication. Adding it here for completeness.

Initial setup (uni-directional CCR with DR cluster following Production cluster)

Step1: Create remote clusters on DR and point to production

### On DR cluster ###
PUT /_cluster/settings
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "production" : {
          "seeds" : [
            "127.0.0.1:9300" 
          ]
        }
      }
    }
  }
}

Step2: Create an index on Production

### On Production cluster ###
PUT /my_index
POST /my_index/_doc/1
{
  "foo":"bar"
}

Step3: Create follower index on DR

### On DR cluster ###
PUT /my_index/_ccr/follow 
{ 
  "remote_cluster" : "production", 
  "leader_index" : "my_index" 
}

Step4: Test follower index on DR

### On DR cluster ###
GET /my_index/search

### This should show up the content created on production (foo/bar)

⚠️ Ingestion should only be written to the Production cluster, all search queries can be directed to either Production or DR clusters.

When Production down:

Step1: On the Client's side, pause ingestion of my_index into Production.

Step2: On the Elasticsearch side, turn the follower indices in the DR into regular indices:

Ensure no writes are occurring on the leader index (if the data centre is down, or cluster is unavailable, no action needed)
On DR: Convert the follower index to a normal index in Elasticsearch (capable of accepting writes)

### On DR cluster ###
POST /my_index/_ccr/pause_follow
POST /my_index/_close           
POST /my_index/_ccr/unfollow    
POST /my_index/_open

Step3: On the Client side, manually re-enable ingestion of my_index to the DR cluster. (You can test that the index should be writable:

### On DR cluster ###
POST my_index/_doc/2
{
  "foo": "new"
}  

⚠️ Make sure all traffic is redirected to the DR cluster during this time.

Once the Production comes back:

Step1: On the clients side, stop writes to my_index on DR cluster.

Step2: Create remote clusters on Production and points to DR

### On Production cluster ###
PUT _cluster/settings
{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "dr" : {
          "seeds" : [
            "127.0.0.2:9300" 
          ]
        }
      }
    }
  }
}

Step3: Create follower indices in Production, connecting them to the leader in DR. The former leader indices in Production have outdated data and will need to be discarded/deleted. Wait for Production follower indices to catch up. Once it is caught up, you can turn the follower indices in Production to regular index again.

### On Production cluster ###
DELETE my_index

### Create follower index on Production to follow from DR cluster
PUT /my_index/_ccr/follow 
{ 
  "remote_cluster" : "dr", 
  "leader_index" : "my_index" 
}

### Wait for my_index to catch up with DR and contain all the documents.
GET my_index/_search

### Stop following from DR to turn my_index into a regular index.
POST /my_index/_ccr/pause_follow
POST /my_index/_close
POST /my_index/_ccr/unfollow
POST /my_index/_open 

Step4: Delete the former DR writeable indices that contain outdated data now. Create follower indices in the DR again to ensure that all changes from Production are streamed to DR. (This is the same as the initial setup)

### On DR cluster ###
DELETE my_index

### Create follower index on `DR` to follow from the `Production` cluster
PUT /my_index/_ccr/follow 
{ 
  "remote_cluster" : "production", 
  "leader_index" : "my_index" 
}

Step5: On the Client side, manually re-enable ingestion to the Production cluster.

⚠️ Ingestion should only be written onto Production, all search queries can be directed to either Production or DR clusters.

@Leaf-Lin Leaf-Lin added >enhancement Team:Docs Meta label for docs team labels Mar 10, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@jasonslater2000
Copy link

How should .kibana be handled?

@Leaf-Lin
Copy link
Contributor Author

We removed the auto-follow pattern for system indices in 8.0.0, but we can still specify follower index to replicate a specific leader_index. For example, to replicate .kibana_8.0.0_001 from clusterA to clusterB, execute this on clusterB:

DELETE .kibana_8.0.0_001
PUT .kibana_8.0.0_001/_ccr/follow?wait_for_active_shards=1
{
  "remote_cluster" : "clusterA",
  "leader_index" : ".kibana_8.0.0_001"
}

The DR flow for system indices is the same as my initial post, just that it should avoid using _ccr/auto_follow API.

@jasonslater2000
Copy link

Thanks Leaf. Can we add this to the documentation as well, as it pertains to the DR use case, and covering failover/failback scenarios?

Customers in this situation would certainly want to understand what can be sync'ed to their DR cluster, vs what cannot be sync'ed.

@Arnovandevelde
Copy link

As a follow up question, what ARE the effects of setting up the .kibana as a follower index? How does this pertain to all kinds of configuration elements as well as not allowing direct access to system indices? For example, does this also work with the taskmanager part to handle the dimension of duplicate tasks/alerts? If this would also be included so that customers understand the impact dimensions that would go a long way for the right DR setup in combination with CCR.

@a03nikki
Copy link
Contributor

a03nikki commented Apr 14, 2022

@Leaf-Lin : Another follow-up question that I have to add to the list is:

Will users be able to continue to explicitly follow system indices and complete the steps to convert them to regular indices?

The last few months there has been a number of changes that have been made to make it more difficult to work with the system indices. For example, #72815, #63513, and #74212. When our users are planning their DR strategies, they want to know how forward compatible their plans are as they need to be continually patching their deployments. So knowing if explicit CCR is planned to be taken away or not is important.

@Leaf-Lin Leaf-Lin added the :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features label Apr 19, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 19, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@Leaf-Lin
Copy link
Contributor Author

  1. On following .kibana index:

Can we add this to the documentation as well, as it pertains to the DR use case, and covering failover/failback scenarios?

Although the comment will allow us to replicate this particular system index, it is not ideal. One still needs to manually delete .kibana indices on the follower cluster before they can replicate the system indices, and it does not cover situations when the cluster gets upgraded (which changes .kibana to new index names.). Furthermore, this workaround may work with .kibana, but it is not guaranteed to work with all system indices. For example, it seems to have little value in replicating the .async-search or .tasks index outside the cluster running these tasks. Unfortunately, there's currently no systematic way to select the right set of system index to be replicated. You can see a similar comment in 1.

what ARE the effects of setting up the .kibana as a follower index? How does this pertain to all kinds of configuration elements as well as not allowing direct access to system indices? For example, does this also work with the taskmanager part to handle the dimension of duplicate tasks/alerts?

These questions are spot-on. For these reasons you have mentioned, (plus upgrade handling), the follower cluster must have .kibana set to read-only mode, thus the follower will not be able to set up alerts, visualization or any other task management that currently require writes to .kibana or .kibana_task_manager index.

  1. On the system indices + CCR planning:

Will users be able to continue to explicitly follow system indices and complete the steps to convert them to regular indices?

We agree that disaster recovery for system indices today is not well-implemented. I have raised an enhancement request2 which would require a cross-team effort to address.

I am not aware of any planned changes in the short term.

Footnotes

  1. https://github.com/elastic/elasticsearch/issues/81750#issuecomment-1075284190

  2. https://github.com/elastic/elasticsearch/issues/86168

@tlrx
Copy link
Member

tlrx commented Aug 9, 2022

@Leaf-Lin the doc team is very busy, do you think you can provide the documentation change you proposed?

@tlrx
Copy link
Member

tlrx commented Aug 9, 2022

@Leaf-Lin the doc team is very busy, do you think you can provide the documentation change you proposed?

Sorry for the noise, I just saw you provided #87099

@shainaraskas
Copy link
Contributor

resolved by #91491

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >enhancement Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants