-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Remote Segment Store] Failure handling #3906
Comments
We need to potentially handle cases during a peer recovery where the peer primary source(relocating) and the new primary target(initialising) can have the same primary term. We would need to guarantee they don't concurrently upload to the same remote store location thereby causing conflicts/corruptions. Uploads to remote store from both nodes should be strictly serialised. To make sure segrep is handling these potential cases I opened #3923 |
@Bukhtawar Thanks for pointing out peer recovery use-case. Just by having primary term as disambiguator may not work. Basically, the idea is to make sure a segment file will not be overwritten in the remote segment store. We also need to have some mechanism which will allow to compare files based on primary term and generation in order to resolve conflicts. This is what I am thinking. Prefix (or suffix) to a segment file name stored in remote segment store:
NodeId will just be used as disambiguator and will not play any role in resolving conflicts. |
Thanks Sachin, I'm on the fence for using just the Having said that we can always have unique segment names(using UUID preferably) on the same primary term to avoid conflicting file name and the segments_N and/or metadata file refer to it |
@Bukhtawar Using UUID makes sense and avoids all the corner cases (known and unknown). Will use UUID. |
Goal
Identify failure scenarios that can occur while uploading or restoring data to/from the remote segment store and propose a solution to handle these failures. In order to provide durability guarantees, we need to ensure persistence and integrity of data in the remote store.
Invariant
Following is the invariant that we track for remote store feature (remote translog + remote segment store)
At any given time, Remote Segment Store and Remote Translog will contain all the indexed data (Data Completeness) and with the data present in these stores, a Lucene index can be successfully built (Data Integrity).
Scope
In this doc, we will discuss failure scenarios that can occur while working with remote segment store but these scenarios may not be just limited to remote segment store. For some of the scenarios, we assume a specific working of remote translog as well as segment replication. Failure scenarios specific to remote translog or segment replication are not discussed.
Existing Flow
This flow works fine if following holds true (let’s call them Happy Flow Assumptions):
Failure Handling
Any deviation from happy flow assumptions mentioned above can create completeness or integrity issues of the data stored in the remote segment store. A deviation does not mean it will always create an issue but could lead to an issue under certain circumstances. We have listed down these failure scenarios in the section:
Appendix: Failure Scenarios
Failure Buckets
Following are the deviations from the happy flow assumptions. The corresponding failure scenarios are explained in the later section. Even though replicas do not directly come into picture while interacting with remote segment store, in case of fail-over, when replica is promoted, the state of the replica at that time can create issues.
Potential Approaches
Based on above failure scenarios, data in remote segment store will either be missing (remote store is lagging and primary goes down) or corrupted (uploaded segment file is corrupted/existing segment file is overwritten by another primary).
In order to keep the invariant, the solution to write and read data from segment store needs to consider following things:
Handling Missing Data
Here, remote segment store needs to work with remote translog to make sure that indexed data is always present in the remote store.
Handling Corrupt Data
Data corruption indicates that by using data in the remote segment store, a Lucene index can’t be built. It can happen due to various reasons:
Recommended Approach
High Level Idea
Filename Conventions
<primary_term>_<checkpoint generation>_<UUID>
suffix (or prefix) to segment filename before uploading.<primary_term>_<checkpoint generation>_<UUID>
suffix before downloading the segment to local filesystem.Algorithm Steps
On refresh
On restore
On failover
Periodic Jobs
<ToDo: Add details on how the recommended approach takes care of all the failure scenarios mentioned in this issue>
Appendix: Failure Scenarios
Following list will not be extensive list of all the failure scenarios but provides details of type of failures that can occur. Each scenario can be broken down further to get more failure scenarios.
The text was updated successfully, but these errors were encountered: