Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSDB: Implement Downsampling ILM Action for time-series indices #87269

Merged
merged 52 commits into from
Jul 26, 2022

Conversation

csoulios
Copy link
Contributor

@csoulios csoulios commented May 31, 2022

This PR adds support for an ILM action that downsamples a time-series index by invoking the _rollup endpoint (#85708)

A policy that includes the rollup action will look like the following

PUT _ilm/policy/my_policy
{
  "policy": {
    "phases": {
      "warm": {
        "actions": {
          "rollup": {
  	    "fixed_interval": "24h"
  	  }
  	}
      }
    }
  }
}

Relates to #74660
Fixes #68609

Finally, this PR is based on the code written for #68184

@csoulios csoulios added >feature :Data Management/ILM+SLM Index and Snapshot lifecycle management :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data :StorageEngine/TSDB You know, for Metrics v8.4.0 labels May 31, 2022
@elasticmachine elasticmachine added Team:Data Management Meta label for data/management team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels May 31, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

1 similar comment
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@csoulios csoulios marked this pull request as draft May 31, 2022 19:21
@elasticsearchmachine
Copy link
Collaborator

Hi @csoulios, I've created a changelog YAML for you.

@csoulios csoulios requested a review from martijnvg June 16, 2022 10:25
csoulios added 2 commits July 13, 2022 16:35
Removed the `config` object and added `fixed_interval`
param right in the `rollup` action
Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes Christos! I think this generally looks good to me, I left a few more comments and I think once we handle the rollup failure this will be ready.

Comment on lines 148 to 153
WaitForIndexColorStep waitForGreenIndexHealthStep = new WaitForIndexColorStep(
waitForGreenRollupIndexKey,
copyMetadataKey,
ClusterHealthStatus.GREEN,
(indexName, lifecycleState) -> lifecycleState.rollupIndexName()
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a bit more complicated, for example, in shrink we use ClusterStateWaitUntilThresholdStep to make the shrink "retryable", and to match that, I think this should be something like:

ClusterStateWaitUntilThresholdStep rollupAllocated = new ClusterStateWaitUntilThresholdStep(
    new WaitForIndexColorStep(
        waitForGreenRollupIndexKey,
        copyMetadataKey,
        ClusterHealthStatus.YELLOW,
        (indexName, lifecycleState) -> lifecycleState.rollupIndexName()
    ),
    cleanupRollupIndexKey
);

I also changed this to "yellow" because I don't think it's necessary for us to wait the rollup to be green for 12 hours (in a single-node cluster for instance) before retrying.

I'm not 100% sure of all the failure scenarios of the rollup action, so what do you think—does this make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. I only wonder what would happen if rollup index is recovered (status turns yellow), we delete the source index and then we lose one of the nodes before the rollup index is replicated?
Is this scenario possible? Will it lead to data loss?

Copy link
Member

@dakrone dakrone Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we delete the source index and then we lose one of the nodes before the rollup index is replicated? Is this scenario possible? Will it lead to data loss?

That is possible, but waiting for green would mean we are subject to waiting forever for single-node clusters, which is also out I think.

}

// Here is where the actual rollup action takes place
RollupStep rollupStep = new RollupStep(rollupKey, waitForGreenRollupIndexKey, client, fixedInterval);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RollupStep will fail when the rollup target index exists but doesn't have a "SUCCESS" status, but what we really need to do in this step is to go back to the cleanupRollupIndexKey so that the index can be deleted and we can try the rollup again. I think this should be similar to a branching step then, so that the failure scenario can provide a custom StepKey for going back to the cleanup step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good catch. I preferred to delete the index at the same step, rather than redirecting to another step. At that point, I have all the information I need to remove the rollup index. I think this would be cleaner than redirecting to a previous step and adding a cleanup-step there.

Let me know if you think otherwise

);
listener.onResponse(null);
} else {
listener.onFailure(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, here we'd want to jump back to the cleanup step (as I mentioned in my other comment)

@csoulios csoulios requested a review from dakrone July 19, 2022 16:38
@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:07
@csoulios
Copy link
Contributor Author

@dakrone After merging #88565, I have removed the copy-settings step.
Also, I have addressed all the other comments you have left.

Can you please have another look?


public CopySettingsStep(StepKey key, StepKey nextStepKey, String indexPrefix, String... settingsKeys) {
private final BiFunction<String, LifecycleExecutionState, String> targetIndexNameSupplier;
Copy link
Contributor Author

@csoulios csoulios Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I have removed the copy-settings step from rollup ILM action, I can revert the CopySettingsStep so that it uses indexPrefix.

I only thought that making it more generic using an index name supplier function may be needed in the future. I don't have a strong opinion about keeping it this way or reverting it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep for now


public CopySettingsStep(StepKey key, StepKey nextStepKey, String indexPrefix, String... settingsKeys) {
private final BiFunction<String, LifecycleExecutionState, String> targetIndexNameSupplier;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to keep for now

Comment on lines +57 to +59
if (lifecycleState.lifecycleDate() == null) {
throw new IllegalStateException("source index [" + indexMetadata.getIndex().getName() + "] is missing lifecycle date");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, why do we need to ensure that the lifecycle date is set here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a good answer for this. I only copied it from the ShrinkStep thinking this may be needed later sinceShrinkStep and RollupStep are pretty similar in their operation.

Do you think I should remove this check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I don't think it's necessarily a bad check, I was just curious where it came from.

createIndexWithSettings(client(), index, alias, settings, mapping);
}

public void testRollupIndex() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these tests also perform a simple query to ensure that they can query the data stream without any problems after the rollup?

Copy link
Contributor Author

@csoulios csoulios Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an assertion that the IndexMetadata.INDEX_ROLLUP_STATUS index setting is RollupTaskStatus.SUCCESS at the end of the process: 7345a57

For queries on rollup indexes there is the RollupActionSingleNodeTests class that validates the result of the rollup action.

Comment on lines 148 to 153
WaitForIndexColorStep waitForGreenIndexHealthStep = new WaitForIndexColorStep(
waitForGreenRollupIndexKey,
copyMetadataKey,
ClusterHealthStatus.GREEN,
(indexName, lifecycleState) -> lifecycleState.rollupIndexName()
);
Copy link
Member

@dakrone dakrone Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we delete the source index and then we lose one of the nodes before the rollup index is replicated? Is this scenario possible? Will it lead to data loss?

That is possible, but waiting for green would mean we are subject to waiting forever for single-node clusters, which is also out I think.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left a few really trivial comments/questions

@csoulios csoulios merged commit 55c0d1a into elastic:main Jul 26, 2022
@csoulios csoulios deleted the tsdb-rollup-ilm branch July 26, 2022 15:27
joegallo added a commit that referenced this pull request Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >feature :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Data Management Meta label for data/management team v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] RollupActionIT testRollupIndexAndSetNewRollupPolicy failing
5 participants