-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TSDB: Implement Downsampling ILM Action for time-series indices #87269
Conversation
Pinging @elastic/es-data-management (Team:Data Management) |
Pinging @elastic/es-analytics-geo (Team:Analytics) |
1 similar comment
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Hi @csoulios, I've created a changelog YAML for you. |
Fix tests
Removed the `config` object and added `fixed_interval` param right in the `rollup` action
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes Christos! I think this generally looks good to me, I left a few more comments and I think once we handle the rollup failure this will be ready.
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/RollupILMAction.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/RollupILMAction.java
Outdated
Show resolved
Hide resolved
WaitForIndexColorStep waitForGreenIndexHealthStep = new WaitForIndexColorStep( | ||
waitForGreenRollupIndexKey, | ||
copyMetadataKey, | ||
ClusterHealthStatus.GREEN, | ||
(indexName, lifecycleState) -> lifecycleState.rollupIndexName() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a bit more complicated, for example, in shrink
we use ClusterStateWaitUntilThresholdStep
to make the shrink "retryable", and to match that, I think this should be something like:
ClusterStateWaitUntilThresholdStep rollupAllocated = new ClusterStateWaitUntilThresholdStep(
new WaitForIndexColorStep(
waitForGreenRollupIndexKey,
copyMetadataKey,
ClusterHealthStatus.YELLOW,
(indexName, lifecycleState) -> lifecycleState.rollupIndexName()
),
cleanupRollupIndexKey
);
I also changed this to "yellow" because I don't think it's necessary for us to wait the rollup to be green for 12 hours (in a single-node cluster for instance) before retrying.
I'm not 100% sure of all the failure scenarios of the rollup action, so what do you think—does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense. I only wonder what would happen if rollup index is recovered (status turns yellow), we delete the source index and then we lose one of the nodes before the rollup index is replicated?
Is this scenario possible? Will it lead to data loss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we delete the source index and then we lose one of the nodes before the rollup index is replicated? Is this scenario possible? Will it lead to data loss?
That is possible, but waiting for green would mean we are subject to waiting forever for single-node clusters, which is also out I think.
} | ||
|
||
// Here is where the actual rollup action takes place | ||
RollupStep rollupStep = new RollupStep(rollupKey, waitForGreenRollupIndexKey, client, fixedInterval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RollupStep
will fail when the rollup target index exists but doesn't have a "SUCCESS" status, but what we really need to do in this step is to go back to the cleanupRollupIndexKey
so that the index can be deleted and we can try the rollup again. I think this should be similar to a branching step then, so that the failure scenario can provide a custom StepKey
for going back to the cleanup step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good catch. I preferred to delete the index at the same step, rather than redirecting to another step. At that point, I have all the information I need to remove the rollup index. I think this would be cleaner than redirecting to a previous step and adding a cleanup-step there.
Let me know if you think otherwise
); | ||
listener.onResponse(null); | ||
} else { | ||
listener.onFailure( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, here we'd want to jump back to the cleanup step (as I mentioned in my other comment)
…ilm/RollupILMAction.java Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
…into tsdb-rollup-ilm
After merging PR elastic#88565, that delegates this task to the actual rollup step
|
||
public CopySettingsStep(StepKey key, StepKey nextStepKey, String indexPrefix, String... settingsKeys) { | ||
private final BiFunction<String, LifecycleExecutionState, String> targetIndexNameSupplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I have removed the copy-settings step from rollup ILM action, I can revert the CopySettingsStep
so that it uses indexPrefix
.
I only thought that making it more generic using an index name supplier function may be needed in the future. I don't have a strong opinion about keeping it this way or reverting it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to keep for now
|
||
public CopySettingsStep(StepKey key, StepKey nextStepKey, String indexPrefix, String... settingsKeys) { | ||
private final BiFunction<String, LifecycleExecutionState, String> targetIndexNameSupplier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to keep for now
if (lifecycleState.lifecycleDate() == null) { | ||
throw new IllegalStateException("source index [" + indexMetadata.getIndex().getName() + "] is missing lifecycle date"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why do we need to ensure that the lifecycle date is set here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a good answer for this. I only copied it from the ShrinkStep
thinking this may be needed later sinceShrinkStep
and RollupStep
are pretty similar in their operation.
Do you think I should remove this check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I don't think it's necessarily a bad check, I was just curious where it came from.
createIndexWithSettings(client(), index, alias, settings, mapping); | ||
} | ||
|
||
public void testRollupIndex() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these tests also perform a simple query to ensure that they can query the data stream without any problems after the rollup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an assertion that the IndexMetadata.INDEX_ROLLUP_STATUS
index setting is RollupTaskStatus.SUCCESS
at the end of the process: 7345a57
For queries on rollup indexes there is the RollupActionSingleNodeTests
class that validates the result of the rollup action.
WaitForIndexColorStep waitForGreenIndexHealthStep = new WaitForIndexColorStep( | ||
waitForGreenRollupIndexKey, | ||
copyMetadataKey, | ||
ClusterHealthStatus.GREEN, | ||
(indexName, lifecycleState) -> lifecycleState.rollupIndexName() | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we delete the source index and then we lose one of the nodes before the rollup index is replicated? Is this scenario possible? Will it lead to data loss?
That is possible, but waiting for green would mean we are subject to waiting forever for single-node clusters, which is also out I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I left a few really trivial comments/questions
This PR adds support for an ILM action that downsamples a time-series index by invoking the
_rollup
endpoint (#85708)A policy that includes the rollup action will look like the following
Relates to #74660
Fixes #68609
Finally, this PR is based on the code written for #68184