[Design Proposal] Using Segment Replication for cross-cluster-replication #4090

ankitkala · 2022-08-02T15:53:17Z

Current implementation of CCR uses logical replication where we replay all the leader shard’s operations on the follower's primary shard. With the ongoing effort for Segment Replication, local replica will simply syncs the segments stored on the disk from primary shard offering significantly better throughput (documented here (#2229)). This documents proposes the design for Cross Cluster Replication using the Segment Replication.

Why Segment Replication for CCR

Pros:

Lower CPU utilisation on follower cluster as follower doesn't need to execute the same operations again.
Pause and resume:
- Ability to pause & resume even after 12 hours: With logical, we only support resume till 12 hours(retention lease ttl). After 12 hours, since the translog operations on leader shard won't be available, user has to to delete all the data on follower and restart the replication. This issue will become obsolete with Segment replication.
- Easy pause/resume on ingestion heavy cluster: With logical replication, after resume, follower cluster can have a huge backlog of translog operations to be replayed and can struggle to cope up with leader. It can also make the cluster unstable as the operations are fetched on follower and executed. With segment replication, we can resume anytime and follower cluster will only request diff in segments.
Lower overhead on Leader cluster: With Segment replication and remote storage integration, CCR can work with almost zero overhead on leader cluster where follower directly communicates with follower for fetching the operations.

Cons:

Higher network utilisation due to Segment merges.
Might not be able to do Active-Active replication with Segment Replication. We can still keep supporting logical replication for this though.
Reliance on refresh interval can affect the overall performance.

Design Tenets

Do not break the logical replication. All changes should be backward compatible with different OS versions.
Do not add additional APIs for Segment Replication. From customer POV’s it should be just an internal implementation swap. Response for select APIs might have to change though(particularly for stats/status APIs).
Reduce the coupling between CCR Plugin and OS Core. Contract between CCR Plugin and OS core should be strict and hard to break. Keep the low level logic in OS core as much as possible. This will also decouple the efforts for Segment Replication and migration of CCR to core.
CCR implementation should be easy to maintain. With 2 separate implementations on CCR, code logic can become complicated very easily. Strive for keeping most of the logic ReplicationType agnostic. For components requiring ReplicationType aware logic, segregate such logic into ReplicationType specific sub-packages.

How Segment Replication(local) works!
Segment Replication is triggered on primary shard refresh. Upon refresh, all replica shards are notified with a Replication checkpoint(seqno for latest doc, latest commit gen and primary term).

For each notification, replica shard will do these following operations:

Request the segment metadata from leader shard.
Figure out the diff against its current segments and request the new segments.
After receiving the segments, replica will perform validation, store the segments, remove un-referenced segments and finally refresh the DirectoryReader with new segmentInfos so that changes are available for search. This is achieved by creating a dedicated NRTReplicationEngine for replica shards.

Taken from here

Proposal for CCR Segment Replication:

CCR Replication type selection logic:
We aren't planning to give this as a choice to the customer. CCR will simply mimic the replication model used on leader's primary and replicas. So if leader cluster is relying on segment replication for replicas, CCR will also use segment replication.

Deviation from existing Segment Replication:

For local segment replication, replica shard triggers the replication event using IndexEventListeners to listen for refresh on primary. What it means is that, it requires a bi-directional communication between primary and replica. For cross cluster replication, we will keep using uni-directional connection(follower to leader) and will model the implementation as a pull mechanism where follower shard periodically polls the leader for checkpoint and trigger the segment copy.
Segment replication assumes that the NRTReplicationEngine is being used only for replica shards which won’t be true for cross cluster cases as the follower's primary shard would also rely on same engine implementation now.

Compatibility with Segment Replication and Remote Storage integration:
After this integration, replica shards will sync the segment directly from remote store instead primary shard. We'll need to build additional support for CCR so that follower cluster shards can sync the data from leader's remote store.

ankitkala · 2022-08-02T15:54:56Z

@mch2 @sachinpkale @ashking94 @Bukhtawar @nknize

ankitkala · 2022-08-16T07:59:11Z

POC:

OS changes: https://github.com/ankitkala/OpenSearch/pull/20/files
CCR changes: [Draft]SegRep POC cross-cluster-replication#479

Bukhtawar · 2022-08-23T16:07:40Z

Thanks @ankitkala, few initial thoughts

For cross cluster replication, we will keep using uni-directional connection(follower to leader) and will model the implementation as a pull mechanism where follower shard periodically polls the leader for checkpoint and trigger the segment copy.

Maybe we need to discuss this further where we could think more about a pub-sub based mechanism to let the follower cluster know about a checkpoint post which the follower can start fetching data from leader. We could reuse the same pub-sub model for remote store integration to pull data directly from remote store.

The API flow from SegmentReplicationTargetService#onNewCheckpoint(follower) -> SegmentReplicationSourceService(leader) seems out of place since this is the opposite of how control flows in a local cluster. The concern here is we might just be force-fitting CCR use case into segrep.
Rather than having LeaderClusterSegmentReplicationSource in core we should think about extension points that CCR could leverage

ankitkala · 2022-08-24T11:11:28Z

Maybe we need to discuss this further where we could think more about a pub-sub based mechanism to let the follower cluster know about a checkpoint post which the follower can start fetching data from leader. We could reuse the same pub-sub model for remote store integration to pull data directly from remote store.

The bi-directional cross cluster connection is harder to maintain. Currently we can't support replication if leader is on the higher OS version than follower as the segments might not be readable on an older lucene version. With bi-directional connection, user would be able create replication in both the directions which will create cyclic dependency for version upgrades (technically same issue is relevant for Segment replication from primary to replica and there is still no clear way forward).
I'll need to spend some more time on this and then can come up with a list of pros and cons to consider when making this design choice.
Like we discussed, I'll explore if its possible to have this bi-directional connection while still having a clear distinction between leader and follower so this issue of cyclic dependencies can be avoided.

The API flow from SegmentReplicationTargetService#onNewCheckpoint(follower) -> SegmentReplicationSourceService(leader) seems out of place since this is the opposite of how control flows in a local cluster. The concern here is we might just be force-fitting CCR use case into segrep.

The way i look at it, we'll refactor and re-use most of the logic that exists for local segment replication. The only difference would be entry point where follower would poll periodically and invoke onNewCheckpoint but the local replicas would rely on the refresh checkpoint instead.
But yeah, for the rest of the logic, we'll have to align with the long term state for local segment replication.

Rather than having LeaderClusterSegmentReplicationSource in core we should think about extension points that CCR could leverage

If CCR moving to core is the end state we want to be in, then it makes sense for us to add the new logic directly in core rather than CCR. Otherwise, we might end up rewriting these transport actions again during migration.

ankitkala added enhancement Enhancement or improvement to existing feature or request untriaged labels Aug 2, 2022

kartg added Plugins Indexing & Search distributed framework and removed untriaged labels Aug 2, 2022

minalsha removed the Plugins label Aug 16, 2022

nknize mentioned this issue Jan 31, 2023

[RFC] Support for dual writes (to a shadow-mode cluster) opensearch-project/opensearch-migrations#91

Closed

kartg mentioned this issue Feb 1, 2023

[FEATURE] Enhance CCR to support primary cluster response and shadow clusters opensearch-project/opensearch-migrations#100

Closed

Bukhtawar added the Indexing:Replication Issues and PRs related to core replication framework eg segrep label Jul 27, 2023

anasalkouz removed Indexing & Search distributed framework labels Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design Proposal] Using Segment Replication for cross-cluster-replication #4090

[Design Proposal] Using Segment Replication for cross-cluster-replication #4090

ankitkala commented Aug 2, 2022

ankitkala commented Aug 2, 2022 •

edited

Loading

ankitkala commented Aug 16, 2022

Bukhtawar commented Aug 23, 2022

ankitkala commented Aug 24, 2022

[Design Proposal] Using Segment Replication for cross-cluster-replication #4090

[Design Proposal] Using Segment Replication for cross-cluster-replication #4090

Comments

ankitkala commented Aug 2, 2022

ankitkala commented Aug 2, 2022 • edited Loading

ankitkala commented Aug 16, 2022

Bukhtawar commented Aug 23, 2022

ankitkala commented Aug 24, 2022

ankitkala commented Aug 2, 2022 •

edited

Loading