-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ILM] Allow ILM and CCR to work well together #34648
Comments
Pinging @elastic/es-core-infra |
Pinging @elastic/es-distributed |
@gwbrown and I started to think about a high level plan for how to make ILM work with CCR. It is quite abstract since it is unknown yet how certain primitives will be implemented in CCR. The high level plan is built on the assumption that ILM is currently targeted at time-series use cases, CCR will add a mechanism to indicate in the leader cluster that a leader index is being followed by one or more indices from another cluster. A follower cluster should indicate this prior to following a leader index in the leader cluster. CCR will also add a mechanism for the ILM to specify that a leader index is “done indexing” and will be read-only thereafter - that is, it has left the write period and entered the read-only period. The fact that a leader index is read only is also replicated to follower index. Once a leader index has been marked as “done indexing”, follower clusters will continue following until they have replicated all updates, then automatically unfollows (1). CCR in the following cluster will then indicate to the leader cluster that this follower index no longer follows the leader index. On the leader cluster:
On the follower cluster:
1: This would be an operation that: pauses index following, closes, unfollows and then opens the index. |
We discussed yesterday in more detail what the read only attribute is, in order to let ILM safely operate describe operations in the follower side. ILM can use the readonly action to set the This part of the CCR ILM integration can already be implemented, so we should start soon. |
We discussed using the |
Is this saying that follower indexes should never run shrink or delete? If so I don't think this restriction is necessary since the index will no longer be following by the time it gets to the warm phase EDIT: actually I see what this means now. We should not perform destructive operations if the follower index is still following. In this case we should go to the ERROR step. In the case that things have worked as expected the index would not be a follower index anymore but a regular index so would be fine to run destructive operations |
This was discussed yesterday and the conclusion was that this is problem also in other scenarios. Something like a readonly api should be build that turns an index into a read only index forever. But that should be tackled outside ccr / ilm.
Well not when it is actively following a leader index. The index would first need to be unfollowed before that could be done. |
👍 |
I agree that we should build a readonly API that makes the index readonly forever but I don't think using the write block is a good idea in the interim. Users may already be used to using the write block as part of maintenance workflows, if the index reacts to that setting at a time that the user does not expect then the results are pretty bad since the index will un-follow the leader when its up to date meaning it is not going to be a copy of the leader. I think we should use something other than the write block. |
I see, that makes sense. So instead of relying on |
This is definitely up for discussion but I think we need something that has a clear intention and meaning that the index is not intended to be written to agian |
In the ILM sync, we decided that we're going to move ahead with a solution as proposed in #35944 ( |
Small comment here - I was thinking that we add a post rollover/pre-shrink wait condition that waits for the index to have no followers. It may take a bit for all the followers to catch up with the leader index and unsubscribe. |
++ I was thinking the same thanks for commenting on it explicitly |
@martijnvg and I discussed this briefly on Slack, and while the high-level concepts for handling ILM and CCR are sound, there are a few options for how we can handle the details of doing so, particularly in the interface that will be used for writing ILM policies for indices which interact with CCR. Option 1: Add an "unfollow_when_ready" actionThis is the simplest option in that it keeps the logic for unfollowing an index contained to one new explicit action. This would likely take the form of a new Hot phase action which could be used in place of Rollover, which would move to the Error step if applied to an index that is not a follower. This would require checks in certain actions (Shrink, possibly Readonly and ForceMerge) to verify that the index is not currently following as we would not be guaranteed to be able to safely perform those operations otherwise, and may involve the Rollover action verifying that it is not being used on a following index Pros:
Cons:
Option 2: Unfollow automatically in the Hot phase or as part of RolloverThis option would either add steps to the Rollover action or inject an action into the Hot phase to automatically unfollow an index once the leader has signaled that indexing is complete. This would allow policies to be more easily reused between leader and follower clusters, and may be more inuitive for the user: Once indexing is complete, the follower would automatically decouple itself from the leader and each would proceed with their policies completely independently. Pros:
Cons:
Option 3: Unfollow automatically before dangerous actionsThis option is similar to Option 2, but would perform unfollowing if and only if an action which is unsafe to perform on a following index is specified in the policy, immediately before performing the operation. This would be implemented by adding several steps to each action which cannot be safely performed on a following index (Shrink, possibly Readonly and ForceMerge) to automatically unfollow the index before performing the action. Pros:
Cons:
Currently, my personal preference is for Option 2, though I think I could easily be swayed to Option 1. I think Option 3 is "too magical" - it's difficult to explain, less predictable, and adds a lot of complexity to the code. |
My preferences largely follow @gwbrown's. Option 3 is quite hard to explain to a user and would add a lot of complexity to the code. Also I think for most use cases the user will not want to move to the warm phase until replication has finished meaning that every action would need to first unfollow and thats essentially option 2 anyway. I think Option 1 will lead to frustration for users. I'm not sure users will see "Requires different policies to be used on the leader and on the follower" as a pro and will find it hard to understand why the policies need to be different between the leader and the follower. My preference is therefore for option 2. Additionally I think it might be better to have an implicit action rather than having the logic built into the rollover action. My reasoning here is:
|
I agree, option 2 seems to be the clearest. We have a history of injecting implicit |
After talking @talevy and @martijnvg a bit, I'd like to have a real-time discussion about which option we should go with when people are available to do so. Additionally, @martijnvg uncovered another thing we'll have to make a decision about: CCR does not respect index templates when creating follower indices, so that approach for setting policies on new indices won't work for follower indices. There are a couple ways we could handle this, which aren't necessarily mutually exclusive:
This is also something I think we should discuss as a team when everyone is available again. I don't think this blocks any work, as we can test for the moment assuming Option 1 very easily, but we do need to make a final decision before shipping. |
This change adds the unfollow action for CCR follower indices. This is needed for the shrink action in case an index is a follower index. This will give the follower index the opportunity to fully catch up with the leader index, pause index following and unfollow the leader index. After this the shrink action can safely perform the ilm shrink. The unfollow action needs to be added to the hot phase and acts as barrier for going to the next phase (warm or delete phases), so that follower indices are being unfollowed properly before indices are expected to go in read-only mode. This allows the force merge action to execute its steps safely. The unfollow action has three steps: * `wait-for-indexing-complete` step: waits for the index in question to get the `index.lifecycle.indexing_complete` setting be set to `true` * `wait-for-follow-shard-tasks` step: waits for all the shard follow tasks for the index being handled to report that the leader shard global checkpoint is equal to the follower shard global checkpoint. * `unfollow-index` step: actually performs the unfollow. This consists out of multiple operations being executed on the index being handled: pause index following, close index, unfollow and open index. (a follower index can only be unfollowed when it is closed, because the underlying engine is changed) In the case of the last two steps, if the index in being handled is a regular index then the steps acts as a no-op. Relates to elastic#34648
Following a Zoom discussion with @jakelandis, @martijnvg, @dakrone, and @talevy, we have come to a decision on the above questions. Regarding how to specify the Unfollow action: In order to give the user maximum flexibility while also maintaining ease of use, the Unfollow action will be available as an explicit action in the Hot, Warm, and Cold phases, and will automatically run before the Shrink action (and in the future, any other actions which require it) [edit: and the Rollover action, see below]. If the index is not a follower index, Unfollow is a no-op, so we do not have concerns about this impacting non-follower indices or problems with policies that specify the Unfollow action multiple times. Regarding policy names on follower indices: We are going to require follower indices to have the same policy name as their leader index for now, while keeping in mind the possibility to add the ability to specify a different policy name to the CCR APIs if and when we determine that this is a needed feature. This would be a non-breaking change and could be easily added at that point. |
IIRC we also discussed having it automatically run before the Rollover action too. If this is not the case we should revisit this part of the discussion. (also +1 on explicit + implicit unfollow) |
Ah, yes, you are correct, I just forgot to write it. Doing that on rollover as well allows for reuse of policies between leader and follower, but not doing it automatically in the Hot phase gives the flexibility to control when in the lifecycle the follower is decoupled. |
This change adds the unfollow action for CCR follower indices. This is needed for the shrink action in case an index is a follower index. This will give the follower index the opportunity to fully catch up with the leader index, pause index following and unfollow the leader index. After this the shrink action can safely perform the ilm shrink. The unfollow action needs to be added to the hot phase and acts as barrier for going to the next phase (warm or delete phases), so that follower indices are being unfollowed properly before indices are expected to go in read-only mode. This allows the force merge action to execute its steps safely. The unfollow action has three steps: * `wait-for-indexing-complete` step: waits for the index in question to get the `index.lifecycle.indexing_complete` setting be set to `true` * `wait-for-follow-shard-tasks` step: waits for all the shard follow tasks for the index being handled to report that the leader shard global checkpoint is equal to the follower shard global checkpoint. * `pause-follower-index` step: Pauses index following, necessary to unfollow * `close-follower-index` step: Closes the index, necessary to unfollow * `unfollow-follower-index` step: Actually unfollows the index using the CCR Unfollow API * `open-follower-index` step: Reopens the index now that it is a normal index * `wait-for-yellow` step: Waits for primary shards to be allocated after reopening the index to ensure the index is ready for the next step In the case of the last two steps, if the index in being handled is a regular index then the steps acts as a no-op. Relates to #34648 Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Gordon Brown <gordon.brown@elastic.co>
This change adds the unfollow action for CCR follower indices. This is needed for the shrink action in case an index is a follower index. This will give the follower index the opportunity to fully catch up with the leader index, pause index following and unfollow the leader index. After this the shrink action can safely perform the ilm shrink. The unfollow action needs to be added to the hot phase and acts as barrier for going to the next phase (warm or delete phases), so that follower indices are being unfollowed properly before indices are expected to go in read-only mode. This allows the force merge action to execute its steps safely. The unfollow action has three steps: * `wait-for-indexing-complete` step: waits for the index in question to get the `index.lifecycle.indexing_complete` setting be set to `true` * `wait-for-follow-shard-tasks` step: waits for all the shard follow tasks for the index being handled to report that the leader shard global checkpoint is equal to the follower shard global checkpoint. * `pause-follower-index` step: Pauses index following, necessary to unfollow * `close-follower-index` step: Closes the index, necessary to unfollow * `unfollow-follower-index` step: Actually unfollows the index using the CCR Unfollow API * `open-follower-index` step: Reopens the index now that it is a normal index * `wait-for-yellow` step: Waits for primary shards to be allocated after reopening the index to ensure the index is ready for the next step In the case of the last two steps, if the index in being handled is a regular index then the steps acts as a no-op. Relates to #34648 Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Gordon Brown <gordon.brown@elastic.co>
All of the tasks listed above are complete & backported to 6.x, which means the outstanding concerns we have around ILM and CCR operating on the same indices have been addressed. |
Given a recent discovery, I'm reopening this until work on #37165 progresses until this item is complete: Until then, the work already done on ILM to utilize shard history retention leases is effectively a no-op. |
If an index is a CCR leader or follower index then the delete and shrink action should wait proceding any operations. This is to avoid problems described under original problem description.
Leader indices
ILM needs to query the indices stats api and check the shard history retention leases in order to determine whether an index is a leader index.
If an index is a leader index then the delete and shrink actions first need to execute the following steps:
index.lifecycle.indexing_complete
index setting totrue
.After this it is safe the proceed any steps that are part of the ILM delete and shrink actions.
Follower indices
ILM needs to check an index's custom index metadata to check whether an index is a follower index.
If an index is a follower index then the shrink action first needs to execute the following steps:
index.lifecycle.indexing_complete
index setting to be replicated from the leader index.After this it is safe the proceed any steps that are part of the ILM shrink action.
Tasks
Original problem description
Currently if a user wishes to use CCR and ILM together on the same index they can run into problems. To help describe these problems imagine we have two clusters (for this discussion I'm going to call them
leader
andfollower
) and we are using CCR's auto-follow on the follower cluster to follow any indices on the leader cluster matchingtest-*
.Now, because in our scenario we have a time series use case it would also be good to have ILM manage the indices, so on the leader we set up a policy on the leader cluster which uses rollover, warm allocation, forcemerge, and shrink. Then we add the policy name to the index template for
test-*
, bootstrap ILM by creating the first index and now we have ILM working on our leader cluster and managing thetest-*
indices.Problem 1 - Setting up a policy for the following indices
Having the
test-*
indices managed by ILM on the leader cluster is great but equally we would like ILM to manage the following indices on the follower cluster too. However, we can't use the exact same policy on the follower cluster because the following index will not have the write alias and even if it did we don't want the following index to rollover on its own criteria, we want it to mirror the leader index. This means the following index needs an indication that the leader has rolled over and moved to the warm phase so the following index also knows it can move to the warm phase.Problem 2 - The leader index and the shrink action
In ILM the shrink action allocates one copy of each shard to a single node, then performs the shrink operation and then deletes the original (un-shrunk) index and sets an alias on the new (shrunken) index with the same name as the original index. This allows the naive user to search the index as if it was still the same index but under the covers the index is a different index.
The problem when combining this with CCR is that the following index may not be completely up to date with the leader index at the point the shrink action is performed, meaning that it may suddenly discover the leader index no longer exists and not be able to progress since there is no way for it to know that the index is equivalent to the shrunken index on the leader and means that the follower and leader cluster are indefinitely out of sync.
One solution to this would be for the un-shrunken following index to delete itself and for there to be a separate auto-follow rule to sync the shrunken indices from the leader. The problem with this is that it requires all the follower shrunken index to be synced from scratch copying all the same data as it had already in the un-shrunken index which is a waste of resources but more importantly means there is a period where the follower cluster will actually be getting further out of sync with the leader since its thrown away the un-shrunken index and is waiting to fully sync the shrunken index from the leader.
The text was updated successfully, but these errors were encountered: