Add built-in ILM policies for common user use cases #76791

dakrone · 2021-08-20T21:12:29Z

This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are
intending to be starting points for a user to use until they switch to using a custom built ILM
policy.

This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are intending to be starting points for a user to use until they switch to using a custom built ILM policy.

jpountz · 2021-08-30T10:32:00Z

I like the idea of shipping base templates that users can pick from, so that they don't have to worry about configuring an ILM policy before ingesting data and still have sane defaults. Some questions/thoughts:

Do we need a min_age on the warm tier? Something like 1d on the 7-days-hot-warm template and 7d on other templates?
Should we force-merge again in cold? Since we document that indices may still receive rare inserts/updates/deletes in the warm tier, we might no longer have a single segment when we reach cold even though data has been force-merged before entering the warm tier.
It's a pity these base templates can't use the searchable_snapshots action. I understand it's not possible to do this today but we should look into making it possible to ship templates that leverage this action eventually?

mostlyjason · 2021-08-30T15:48:08Z

@dakrone our current default policy for o11y is hot forever. I think the reason we had it so to deliver good performance out of the box, and avoid deleting data without the user's consent. Is that still the best policy for new clusters? If so, should we include it here? Also, will adding warm require adding more nodes? If so, will that make it harder to get started or increase their bill? What about including policies that match the tiers available without adding nodes?

dakrone · 2021-08-30T16:09:11Z

Thanks for taking a look!

Do we need a min_age on the warm tier? Something like 1d on the 7-days-hot-warm template and 7d on other templates?

I don't think so, generally all of the users I've encountered thus far want to get data off of the hot tier as quickly as possible, as those are the most expensive machines.

Should we force-merge again in cold? Since we document that indices may still receive rare inserts/updates/deletes in the warm tier, we might no longer have a single segment when we reach cold even though data has been force-merged before entering the warm tier.

I think this is more of an expert scenarios, since indices using these policies are almost certain to be part of a data stream, and thus not likely to have writes once they roll over. I suspect this will meet the 80% case, and for the users with a more advanced configuration, they will want to configured their own policies.

It's a pity these base templates can't use the searchable_snapshots action. I understand it's not possible to do this today but we should look into making it possible to ship templates that leverage this action eventually?

Yes, in order to do this we'll need to implement #66040 first.

our current default policy for o11y is hot forever. I think the reason we had it so to deliver good performance out of the box, and avoid deleting data without the user's consent. Is that still the best policy for new clusters?

None of these policies are used by a template by default, these are intended for the user to pick from whilst adding an integration. So we are not deleting data without the user's consent, the user has specifically picked one of these policies (or a different one like our logs or metrics policies that are still hot-forever).

If so, should we include it here?

We already install logs, metrics, and synthetics templates that are all hot-forever by default.

Also, will adding warm require adding more nodes?

Using the warm phase doesn't necessitate requiring more tiers, however, if a user is running this on Cloud and has autoscaling enabled, then it will add those nodes automatically.

If so, will that make it harder to get started or increase their bill?

Making something "harder to get started" is very difficult to quantify, since I could also argue that doing hot-forever by default makes it harder to get started with different data tiers. In the future I suspect we will want to add searchable snapshots (see the prerequisite I mentioned above), which will also necessitate moving away from a hot-only configuration.

Adding nodes does indeed increase their bill, though this is something a user on Cloud controls through their autoscaling limits.

What about including policies that match the tiers available without adding nodes?

If a user disables or limits autoscaling, then these policies will indeed use the tiers that are already part of the cluster without adding nodes.

jpountz · 2021-08-30T17:01:14Z

Thanks Lee for your replies. They make sense to me, except this one:

Do we need a min_age on the warm tier? Something like 1d on the 7-days-hot-warm template and 7d on other templates?

I don't think so, generally all of the users I've encountered thus far want to get data off of the hot tier as quickly as possible, as those are the most expensive machines.

I understand why users want to flow data relatively quickly through tiers to manage costs, but I think that it's also important to retain data on the Hot tier for some minimal amount of time in order to give a good user experience with Discover or live-tailing of logs in the Logs UI. While the default of keeping data in Hot forever is just a waste of money, I worry about setting the cursor too far in the direction of optimizing for storage costs to the detriment of search experience. FWIW I'd be fine with even low-ish values for the min_age like 1d that ensure good search experience with very recent data.

dakrone · 2021-08-30T17:09:31Z

I worry about setting the cursor too far in the direction of optimizing for storage costs to the detriment of search experience. FWIW I'd be fine with even low-ish values for the min_age like 1d that ensure good search experience with very recent data.

For discover or live-tailing of logs, how low is too low? What about something like 2h? Is this something you think we should do for each of the policies, or only certain ones?

jpountz · 2021-08-31T08:13:47Z

2h would certainly be better than 0. But if I put myself in the shoes of a user who is using our integrations to monitor their infrastructure and they need to debug a production issue, I think I would like to have fast loading dashboards for at least 24h of data in order to be able to compare how things looked before/after the incident. Maybe even a couple days in case of daily patterns (e.g. a website that receives more traffic in the day than in the night, and even more traffic in the evening than in the middle of the day) so that one could compare how things looked around the time of the incident with another day under similar conditions.

Separately it's unclear to me that a min_age of 0 or 2h for the Warm tier actually helps reduce the number of nodes that are needed for the Hot tier vs. a min_age of a couple days. With the instances that we are currently using in the Hot tier, we seem to be producing at most in the order of 500GB of index data per day per node while nodes have ~2TB of storage. So aggressively moving data to Warm with a min_age of 0 or 2h wouldn't help reduce the number of nodes in the Hot tier vs. a min_age of 3d? @danielmitterdorfer Please correct me if this is wrong, and feel free to add more color.

dakrone · 2021-09-01T16:54:45Z

Separately it's unclear to me that a min_age of 0 or 2h for the Warm tier actually helps reduce the number of nodes that are needed for the Hot tier vs. a min_age of a couple days.

Are you basing this off of the calculation below?

With the instances that we are currently using in the Hot tier, we seem to be producing at most in the order of 500GB of index data per day per node while nodes have ~2TB of storage.

I'm not sure where you get those numbers, can you elaborate on that? In my experience users using ILM in the past have had big problems when rollover failed (for instance, before we moved to retrying all ILM actions) even for a day, where their data was kept on the hot tier longer than intended, and that failure caused cascading failures for the rest of their indexing.

So aggressively moving data to Warm with a min_age of 0 or 2h wouldn't help reduce the number of nodes in the Hot tier vs. a min_age of 3d?

I would really like to avoid a min_age of 3d if possible, as I think it's going to be far too high for higher ingestion workloads. If we did agree on something reasonable, I think 1d is a reasonable agreement between the two.

jpountz · 2021-09-06T14:10:07Z

x-pack/plugin/stack/src/main/java/org/elasticsearch/xpack/stack/StackTemplateRegistry.java

+    public static final String ILM_30_DAYS_POLICY_NAME = "30-days-hot-warm";
+    public static final String ILM_90_DAYS_POLICY_NAME = "90-days-hot-warm-cold";
+    public static final String ILM_180_DAYS_POLICY_NAME = "180-days-hot-warm-cold";
+    public static final String ILM_365_DAYS_POLICY_NAME = "365-days-hot-warm-cold";


We might want to avoid spelling out which phases we are using in the name of the policy, so that we could later change this, e.g. if we later realize that it would make sense to use frozen when users keep data for 365 days?

I think it'd be nice to avoid those names, but currently we don't have a concept of a "title" of a policy from within ILM or the UI, so if we made those less general, it would be much harder for a user to determine what the policy does (especially since these are intended for users not that familiar with ILM) if we named it even more generic like "365-days".

Do you think the lack of specification for upgrades is better, or would it be better to be more specific so a user can tell what it is doing?

I'd rather like to be less specific so that we can more easily change our mind on what we think are good defaults. Maybe the name could be something like 180-days-default instead of 180-days to make it clearer that there are multiple ways to retain data for 180 days and that this is just one way of doing it?

Sounds good, I've renamed the policies.

dakrone · 2021-09-20T20:48:55Z

@jpountz do you have any additional feedback for this after investigating the performance side?

jpountz · 2021-09-21T17:48:17Z

@dakrone and I discussed offline about what should be the default value for the Warm tier's min_age, here's our thinking:

It takes around 3 days to fill disks of Hot nodes up to the disk low watermark when indexing at full throttle with our solutions/logs benchmark. But we also want to factor in the fact that users will generally have tens of data streams, with a couple data streams indexing very quickly, and most data streams indexing slowly. These data streams that index slowly and roll over on min_age might require up to 100GB per data stream (50GB rollover size, times 2 for replicas) across the Hot tier. This is disk space that the fast indexing data streams cannot use so disks would likely fill up before 3 days. How much exactly is highly use-case-dependent: number of data streams, number of nodes, etc. We agreed to go with a default min_age of 2 days since our gut feeling is that it should generally be an ok trade-off.

elasticmachine · 2021-09-21T22:24:52Z

Pinging @elastic/es-data-management (Team:Data Management)

jpountz

LGTM!

dakrone · 2021-09-22T15:52:24Z

@elasticmachine update branch

dakrone · 2021-09-22T16:20:43Z

@elasticmachine update branch

dakrone · 2021-09-22T21:59:50Z

@elasticmachine update branch

This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are intending to be starting points for a user to use until they switch to using a custom built ILM policy.

…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791

…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946) In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes elastic#77025 Relates elastic#76791

…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80052) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791

…low down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80053) * Preventing unnecessary ILM policy deletions that drastically slow down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791 * fixing backported code for 7.16 * allowing type removal warnings

Add built-in ILM policies for common user use cases

26ab744

This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are intending to be starting points for a user to use until they switch to using a custom built ILM policy.

elasticsearchmachine added the v8.0.0 label Aug 20, 2021

jpountz reviewed Sep 6, 2021

View reviewed changes

dakrone added 2 commits September 21, 2021 16:18

Merge remote-tracking branch 'origin/master' into built-in-ilm-policies

e5c707a

Rename to generic name and add min_age: 2d for policies

2fa74f9

dakrone marked this pull request as ready for review September 21, 2021 22:24

dakrone added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Sep 21, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Sep 21, 2021

dakrone requested a review from jpountz September 21, 2021 22:25

jpountz approved these changes Sep 22, 2021

View reviewed changes

Merge branch 'master' into built-in-ilm-policies

c292e97

Merge branch 'master' into built-in-ilm-policies

2a285f8

elasticmachine and others added 3 commits September 23, 2021 07:59

Merge branch 'master' into built-in-ilm-policies

8aedaa1

Fix ML qa tests

1ce5142

Merge remote-tracking branch 'origin/master' into built-in-ilm-policies

b79ac15

dakrone merged commit 073584d into elastic:master Sep 23, 2021

dakrone deleted the built-in-ilm-policies branch September 23, 2021 19:38

dakrone added the v7.16.0 label Sep 23, 2021

dakrone mentioned this pull request Sep 23, 2021

[7.x] Add built-in ILM policies for common user use cases (#76791) #78278

Merged

yuliacech mentioned this pull request Oct 5, 2021

Indicate managed ILM policies in the UI elastic/kibana#101438

Closed

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

This was referenced Oct 27, 2021

[CI] SmokeTestMultiNodeClientYamlTestSuiteIT classMethod failing #77025

Closed

Preventing unnecessary ILM policy deletions that drastically slow down SmokeTestMultiNodeClientYamlTestSuiteIT #79946

Merged

danhermann added the >enhancement label Dec 3, 2021

jrodewig mentioned this pull request Feb 3, 2022

[DOCS] Document built-in ILM policies #83479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add built-in ILM policies for common user use cases #76791

Add built-in ILM policies for common user use cases #76791

dakrone commented Aug 20, 2021

jpountz commented Aug 30, 2021

mostlyjason commented Aug 30, 2021 •

edited

Loading

dakrone commented Aug 30, 2021 •

edited

Loading

jpountz commented Aug 30, 2021

dakrone commented Aug 30, 2021

jpountz commented Aug 31, 2021

dakrone commented Sep 1, 2021

jpountz Sep 6, 2021

dakrone Sep 8, 2021

jpountz Sep 21, 2021

dakrone Sep 21, 2021

dakrone commented Sep 20, 2021

jpountz commented Sep 21, 2021

elasticmachine commented Sep 21, 2021

jpountz left a comment

dakrone commented Sep 22, 2021

dakrone commented Sep 22, 2021

dakrone commented Sep 22, 2021

Add built-in ILM policies for common user use cases #76791

Add built-in ILM policies for common user use cases #76791

Conversation

dakrone commented Aug 20, 2021

jpountz commented Aug 30, 2021

mostlyjason commented Aug 30, 2021 • edited Loading

dakrone commented Aug 30, 2021 • edited Loading

jpountz commented Aug 30, 2021

dakrone commented Aug 30, 2021

jpountz commented Aug 31, 2021

dakrone commented Sep 1, 2021

jpountz Sep 6, 2021

Choose a reason for hiding this comment

dakrone Sep 8, 2021

Choose a reason for hiding this comment

jpountz Sep 21, 2021

Choose a reason for hiding this comment

dakrone Sep 21, 2021

Choose a reason for hiding this comment

dakrone commented Sep 20, 2021

jpountz commented Sep 21, 2021

elasticmachine commented Sep 21, 2021

jpountz left a comment

Choose a reason for hiding this comment

dakrone commented Sep 22, 2021

dakrone commented Sep 22, 2021

dakrone commented Sep 22, 2021

mostlyjason commented Aug 30, 2021 •

edited

Loading

dakrone commented Aug 30, 2021 •

edited

Loading