-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Segment Replication] Support shard promotion. #2212
Comments
How you guys handle duplicate file name afer replica shard promition? |
Hi @hydrogen666 thanks for the question! In the scenario you describe once the tlog recovery completes on A it would publish a new checkpoint notification to replicas. The published checkpoint data also includes a primaryTerm field. When this is incremented the replicas would process this checkpoint and fetch the new segments from the primary, even if the latest sequence numbers match. The copy process starts with replicas fetching a list of the latest segment metadata available the primary and then computing a diff against its own that includes different and missing files. In this case A' will recognize its version of _4.si is different and replace it with the primary's version. With that said I haven't yet put together a failover section to be included in the proposal on what failover will look like with segrep. If you have suggestions on how you think this should work please share them, thanks! |
Hi @mch2 I can only think of theses ways to deal with it:
|
These are awesome questions about segment replication! This is indeed a challenging situation for segrep, but is solvable with one of the three proposed options. Lucene is fundamentally a write-once (at the OS filesystem layer) index, so we must ensure that no node (primary or replica that was just newly promoted as primary) ever attempts to overwrite an existing index file on any of the existing/prior replicas. Is solution 3) too hard to do? I would think distributed consensus algorithms could do the right thing here -- all replicas that are currently reachable could respond with how far they had replicated and we pick the furthest one to promote to the new primary? Any detached (network partitioning) replicas that come online too late must possibly delete "future" segment files before re-joining the cluster, which should be fine since those detached replicas are not in use... Also note that scan/scroll is a weak implementation in ES/OS today, with the hard routing back to a single replica which might drop offline -- with segrep, any replica can accurately/precisely service the scan/scroll request (since all replicas share precise point-in-time views of the shard), and hold allocated (i.e. cannot delete, keep an open With segrep, every replica can search accurate point-in-time snapshots of the index, unlike ES with its inefficient document replication today where every replica is searching "slightly" different point-in-time views of the index. This can make for very accurate pagination, versus today where users paging through results might easily see duplicates on page N+1, or might also miss results, eroding trust. Failing that (option 3 is too hard), I think option 2 is best -- bump the next segment number to a number clearly larger than any replica might have already copied. Lucene uses a |
Noting here there's a Point in Time API design issue and corresponding PR to incrementally improve the scroll shortcomings in OpenSearch using PIT but it will most certainly need to be refactored under segrep. Thinking out loud (haven't mulled this one as much) I like option 3 but it seems like a lot of coordination and we'll certainly want to benchmark? Option 2 might be an "easier" near term implementation and we can migrate to the coordinated approach in a follow on major release. |
Also thinking out loud, I do kind of like option 2 as a good guarantee that we'll never reuse a segment number no matter what weird distributed system edge case is encountered. We could add the logic to choose a new primary based on which replica has made the most progress as an optimization while still keeping the segment high water mark logic. It would allow the "choose farthest ahead replica" logic to be best effort as opposed to being critical for correctness. |
Thanks all for your thoughts here. I like the idea of using both 2 and 3. I don't think implementing 3 will be all that hard because each replica already updates their local checkpoint and publishes an updated global checkpoint to cluster state after a copy event completes. I wonder if we could reuse that global checkpoint mechanism in reverse to fetch the furthest ahead replica.
Thanks for calling this out. For this guarantee I'm assuming we will need to block primaries from opening a new reader until we have confirmed all replicas have received and opened up the latest copy? Our implementation isn't doing this right now and only initiates a copy event after the primary has refreshed. I wonder if this is something worth gating with an additional setting? |
Is segment number is the only factor to decide what replica to promote? Do we need to take shard allocations, node load ..etc in our consideration? |
Hmm no that should not be necessary? You should instead use Lucene's |
Do Primaries handle searching too? Or are they dedicated to indexing (which would give awesome physical isolation of JVMs that do searcher versus indexing, but would require more total cluster resources typically). If Primaries do searching too then I agree we may want to take shard allocations / indexing rate across those shards into account. We have some freedom to promote a Replica on a relatively underutilized node, instead of a Replica on a node that is already doing too much indexing/searching? In fact, this should be a powerful freedom for balancing load across the cluster ... we may want to do it (promote a different Replica as Primary for this shard) pro-actively (not just during node failure cases) to keep the cluster load balanced? |
This sounds like a better mechanism to use within the I think we should look at refactoring away from the old Scroll scaffolding and move toward using this mechanism in that PR? |
For now, yes. This was discussed for "priority queries" (e.g., alerting use cases) where the primary would have documents available for search before replicas.
This is the desired next step; to isolate the reader / writer jvms. At minimum I think we'll want to look at leveraging some of the concurrency improvements being made in lucene to help in the decoupling of indexing and search time geometries? |
/cc @vigyasharma to join the discussion! |
Excellent discussion here, with some really good ideas, and hard questions about segment replication and failover scenarios. I love the overall idea, and mostly agree with the approach being discussed here. My early thoughts on this are -- ++ on the idea of shard promotion, not just for failure recovery, but also for load balancing. For e.g. consider a 2 node cluster, with 1 pri and 1 rep in its indexes. If one node drops, all replicas on the single remaining node get promoted to primaries. When the failed node comes back (is replaced), it will get all replica copies for every shard (based on current shard recovery logic). This leaves us with a 2 node cluster, will all primaries on one node and all replicas on the other. This is okay in document replication since pri and rep have roughly the same load, but would lead to a very skewed cluster in segment replication. The only way to balance such a cluster (since there are only 2 nodes), is to demote some pri to rep on node 1 and promote corresponding rep in node 2. __
Re: failover protocol, I agree with @mch2 - using checkpoints to identify the replica with most progress should be doable. For option (2), around making the newly promoted primary jump segment numbers to avoid overlap - I wonder if it can create some confusion (either in protocol implementation or in manual debugging), on whether some segments got dropped or missed. |
With segrep, CPU and IO consumption on Primary is much higher than Replica. Do we need to provide a new balancing algorithm for master node to take Primary shard balancing into consideration ? For example, one data node goes down and comes back, all shard on this data node will be Replica shard. |
Today there is no differentiation in the shard balancer algorithm in terms of placing primary and replica copies, both have equal weights. @hydrogen666 yes you are correct we would most likely need the master to load balance primary and replica. This however can be solved in general via a placement algorithm that can factor in shard heat(resource consumption) also aiming at maximizing throughput I really like @vigyasharma's approach of swapping primary and replica atomically, however we would need to think about a graceful mechanism to achieve it eg: in flight requests might not be able to replicate if roles are changed atomically on the fly. Existing peer recoveries might have to be gracefully handed off etc |
Also note that in ES (document replication), every Replica is as hot as the Primary, whereas with segrep, only Primary is that hot -- Replicas are much cooler (just doing periodic rsync + light the new segments for searching). And so the existing shard allocation system that ES uses, will be no worse in OS/segrep, it's just that many nodes will be substantially under-loaded since they are no longer doing the wasteful indexing operations. |
Agree, while this is true for CPU, I/O consumption on the primary is expected to be higher than replica due to segment transfer at every refresh/merge. Network bandwidth on the node hosting primaries will eventually be constrained once there are more replicas getting assigned on other nodes. |
Thinking about this more, having primaries only do indexing can really simplify the whole setup, and eliminate the need for shard promotion. When a primary goes down, the leader could just allocate a new, empty primary in any available node. Since primaries won't do searches, this new shard doesn't need to be up to date with the latest segments created. I believe it would be a rather fast recovery model - just allocate a new empty primary and connect all replicas to it. No need to find the latest and greatest replica shard. In terms of data loss, this is no worse than the current proposed model, since all One downside I see is that users now have to create an indexing shard (primary) and a search shard (replica); and costs can increase for small clusters. But given the simplicity of this recovery model, and the benefits of index/search compute separation; I think it has a lot of value for larger, more serious use cases. We could even keep both models and give users the option to enable this in index settings.. like a |
In this decoupled index/search world, cluster load balancing can also be done without shard promotions and relocations. The leader could put the current primary on For above, and as a general rule, I feel that segment replication needs to have the ability to identify segments that come from different primaries. As mentioned above, I feel leveraging |
Reviving this issue as the first phase of segrep is nearly merged into main. Tracked with #2355. When that issue is complete we will have a basic implementation with primary shards as the source of replication (all functionality in POC branch feature/segment-replication). Thanks for all the input here, I'd like to convert this issue to a meta issue and split the issue of supporting failover into a few smaller issues. Here is what I'm thinking...
|
The leader does something similar today to choose which nodes have an up to date copy to assign a primary to. We might need to modify this to factor in checkpoints.
|
It appears shard promotion logic is also applied eagerly by coordinator when node is identified as faulty by FollowersChecker. Coordinator uses NodeRemovalClusterStateTaskExecutor task which fails shards using RoutingNodes.failShard, which chooses the replica with highest node version.
More details in #3988 |
Hello @mikemccand / @mch2, I could like to understand incontext of how shard promotion (Leader Election) works with the below proposals. May why not consider distributed consensus algorithm like
what happen when master primary shard dies at first inside the cluster and newly prometed primary shard has same segment number and how promotion happens?
Using distributed consensus algorithm like |
Support failover scenario of a replica re-assigned as a primary.
Right now this flow fails because of how we wire up the engine as readOnly by setting IndexWriter to null.
Issues remaining (updated 7/22):
The text was updated successfully, but these errors were encountered: