Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add built-in ILM policies for common user use cases #76791

Merged
merged 8 commits into from
Sep 23, 2021

Conversation

dakrone
Copy link
Member

@dakrone dakrone commented Aug 20, 2021

This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are
intending to be starting points for a user to use until they switch to using a custom built ILM
policy.

This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are
intending to be starting points for a user to use until they switch to using a custom built ILM
policy.
@jpountz
Copy link
Contributor

jpountz commented Aug 30, 2021

I like the idea of shipping base templates that users can pick from, so that they don't have to worry about configuring an ILM policy before ingesting data and still have sane defaults. Some questions/thoughts:

  • Do we need a min_age on the warm tier? Something like 1d on the 7-days-hot-warm template and 7d on other templates?
  • Should we force-merge again in cold? Since we document that indices may still receive rare inserts/updates/deletes in the warm tier, we might no longer have a single segment when we reach cold even though data has been force-merged before entering the warm tier.
  • It's a pity these base templates can't use the searchable_snapshots action. I understand it's not possible to do this today but we should look into making it possible to ship templates that leverage this action eventually?

@mostlyjason
Copy link

mostlyjason commented Aug 30, 2021

@dakrone our current default policy for o11y is hot forever. I think the reason we had it so to deliver good performance out of the box, and avoid deleting data without the user's consent. Is that still the best policy for new clusters? If so, should we include it here? Also, will adding warm require adding more nodes? If so, will that make it harder to get started or increase their bill? What about including policies that match the tiers available without adding nodes?

@dakrone
Copy link
Member Author

dakrone commented Aug 30, 2021

Thanks for taking a look!

Do we need a min_age on the warm tier? Something like 1d on the 7-days-hot-warm template and 7d on other templates?

I don't think so, generally all of the users I've encountered thus far want to get data off of the hot tier as quickly as possible, as those are the most expensive machines.

Should we force-merge again in cold? Since we document that indices may still receive rare inserts/updates/deletes in the warm tier, we might no longer have a single segment when we reach cold even though data has been force-merged before entering the warm tier.

I think this is more of an expert scenarios, since indices using these policies are almost certain to be part of a data stream, and thus not likely to have writes once they roll over. I suspect this will meet the 80% case, and for the users with a more advanced configuration, they will want to configured their own policies.

It's a pity these base templates can't use the searchable_snapshots action. I understand it's not possible to do this today but we should look into making it possible to ship templates that leverage this action eventually?

Yes, in order to do this we'll need to implement #66040 first.

our current default policy for o11y is hot forever. I think the reason we had it so to deliver good performance out of the box, and avoid deleting data without the user's consent. Is that still the best policy for new clusters?

None of these policies are used by a template by default, these are intended for the user to pick from whilst adding an integration. So we are not deleting data without the user's consent, the user has specifically picked one of these policies (or a different one like our logs or metrics policies that are still hot-forever).

If so, should we include it here?

We already install logs, metrics, and synthetics templates that are all hot-forever by default.

Also, will adding warm require adding more nodes?

Using the warm phase doesn't necessitate requiring more tiers, however, if a user is running this on Cloud and has autoscaling enabled, then it will add those nodes automatically.

If so, will that make it harder to get started or increase their bill?

Making something "harder to get started" is very difficult to quantify, since I could also argue that doing hot-forever by default makes it harder to get started with different data tiers. In the future I suspect we will want to add searchable snapshots (see the prerequisite I mentioned above), which will also necessitate moving away from a hot-only configuration.

Adding nodes does indeed increase their bill, though this is something a user on Cloud controls through their autoscaling limits.

What about including policies that match the tiers available without adding nodes?

If a user disables or limits autoscaling, then these policies will indeed use the tiers that are already part of the cluster without adding nodes.

@jpountz
Copy link
Contributor

jpountz commented Aug 30, 2021

Thanks Lee for your replies. They make sense to me, except this one:

Do we need a min_age on the warm tier? Something like 1d on the 7-days-hot-warm template and 7d on other templates?

I don't think so, generally all of the users I've encountered thus far want to get data off of the hot tier as quickly as possible, as those are the most expensive machines.

I understand why users want to flow data relatively quickly through tiers to manage costs, but I think that it's also important to retain data on the Hot tier for some minimal amount of time in order to give a good user experience with Discover or live-tailing of logs in the Logs UI. While the default of keeping data in Hot forever is just a waste of money, I worry about setting the cursor too far in the direction of optimizing for storage costs to the detriment of search experience. FWIW I'd be fine with even low-ish values for the min_age like 1d that ensure good search experience with very recent data.

@dakrone
Copy link
Member Author

dakrone commented Aug 30, 2021

I worry about setting the cursor too far in the direction of optimizing for storage costs to the detriment of search experience. FWIW I'd be fine with even low-ish values for the min_age like 1d that ensure good search experience with very recent data.

For discover or live-tailing of logs, how low is too low? What about something like 2h? Is this something you think we should do for each of the policies, or only certain ones?

@jpountz
Copy link
Contributor

jpountz commented Aug 31, 2021

2h would certainly be better than 0. But if I put myself in the shoes of a user who is using our integrations to monitor their infrastructure and they need to debug a production issue, I think I would like to have fast loading dashboards for at least 24h of data in order to be able to compare how things looked before/after the incident. Maybe even a couple days in case of daily patterns (e.g. a website that receives more traffic in the day than in the night, and even more traffic in the evening than in the middle of the day) so that one could compare how things looked around the time of the incident with another day under similar conditions.

Separately it's unclear to me that a min_age of 0 or 2h for the Warm tier actually helps reduce the number of nodes that are needed for the Hot tier vs. a min_age of a couple days. With the instances that we are currently using in the Hot tier, we seem to be producing at most in the order of 500GB of index data per day per node while nodes have ~2TB of storage. So aggressively moving data to Warm with a min_age of 0 or 2h wouldn't help reduce the number of nodes in the Hot tier vs. a min_age of 3d? @danielmitterdorfer Please correct me if this is wrong, and feel free to add more color.

@dakrone
Copy link
Member Author

dakrone commented Sep 1, 2021

Separately it's unclear to me that a min_age of 0 or 2h for the Warm tier actually helps reduce the number of nodes that are needed for the Hot tier vs. a min_age of a couple days.

Are you basing this off of the calculation below?

With the instances that we are currently using in the Hot tier, we seem to be producing at most in the order of 500GB of index data per day per node while nodes have ~2TB of storage.

I'm not sure where you get those numbers, can you elaborate on that? In my experience users using ILM in the past have had big problems when rollover failed (for instance, before we moved to retrying all ILM actions) even for a day, where their data was kept on the hot tier longer than intended, and that failure caused cascading failures for the rest of their indexing.

So aggressively moving data to Warm with a min_age of 0 or 2h wouldn't help reduce the number of nodes in the Hot tier vs. a min_age of 3d?

I would really like to avoid a min_age of 3d if possible, as I think it's going to be far too high for higher ingestion workloads. If we did agree on something reasonable, I think 1d is a reasonable agreement between the two.

public static final String ILM_30_DAYS_POLICY_NAME = "30-days-hot-warm";
public static final String ILM_90_DAYS_POLICY_NAME = "90-days-hot-warm-cold";
public static final String ILM_180_DAYS_POLICY_NAME = "180-days-hot-warm-cold";
public static final String ILM_365_DAYS_POLICY_NAME = "365-days-hot-warm-cold";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to avoid spelling out which phases we are using in the name of the policy, so that we could later change this, e.g. if we later realize that it would make sense to use frozen when users keep data for 365 days?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to avoid those names, but currently we don't have a concept of a "title" of a policy from within ILM or the UI, so if we made those less general, it would be much harder for a user to determine what the policy does (especially since these are intended for users not that familiar with ILM) if we named it even more generic like "365-days".

Do you think the lack of specification for upgrades is better, or would it be better to be more specific so a user can tell what it is doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather like to be less specific so that we can more easily change our mind on what we think are good defaults. Maybe the name could be something like 180-days-default instead of 180-days to make it clearer that there are multiple ways to retain data for 180 days and that this is just one way of doing it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I've renamed the policies.

@dakrone
Copy link
Member Author

dakrone commented Sep 20, 2021

@jpountz do you have any additional feedback for this after investigating the performance side?

@jpountz
Copy link
Contributor

jpountz commented Sep 21, 2021

@dakrone and I discussed offline about what should be the default value for the Warm tier's min_age, here's our thinking:

It takes around 3 days to fill disks of Hot nodes up to the disk low watermark when indexing at full throttle with our solutions/logs benchmark. But we also want to factor in the fact that users will generally have tens of data streams, with a couple data streams indexing very quickly, and most data streams indexing slowly. These data streams that index slowly and roll over on min_age might require up to 100GB per data stream (50GB rollover size, times 2 for replicas) across the Hot tier. This is disk space that the fast indexing data streams cannot use so disks would likely fill up before 3 days. How much exactly is highly use-case-dependent: number of data streams, number of nodes, etc. We agreed to go with a default min_age of 2 days since our gut feeling is that it should generally be an ok trade-off.

@dakrone dakrone marked this pull request as ready for review September 21, 2021 22:24
@dakrone dakrone added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Sep 21, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Sep 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@dakrone dakrone requested a review from jpountz September 21, 2021 22:25
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dakrone
Copy link
Member Author

dakrone commented Sep 22, 2021

@elasticmachine update branch

@dakrone
Copy link
Member Author

dakrone commented Sep 22, 2021

@elasticmachine update branch

@dakrone
Copy link
Member Author

dakrone commented Sep 22, 2021

@elasticmachine update branch

@dakrone dakrone merged commit 073584d into elastic:master Sep 23, 2021
@dakrone dakrone deleted the built-in-ilm-policies branch September 23, 2021 19:38
dakrone added a commit to dakrone/elasticsearch that referenced this pull request Sep 23, 2021
This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are
intending to be starting points for a user to use until they switch to using a custom built ILM
policy.
elasticsearchmachine pushed a commit that referenced this pull request Sep 23, 2021
This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are
intending to be starting points for a user to use until they switch to using a custom built ILM
policy.
masseyke added a commit that referenced this pull request Oct 28, 2021
…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946)

In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes #77025
Relates #76791
masseyke added a commit to masseyke/elasticsearch that referenced this pull request Oct 28, 2021
…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946)

In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes elastic#77025
Relates elastic#76791
masseyke added a commit to masseyke/elasticsearch that referenced this pull request Oct 28, 2021
…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946)

In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes elastic#77025
Relates elastic#76791
elasticsearchmachine pushed a commit that referenced this pull request Oct 28, 2021
…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80052)

In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes #77025
Relates #76791
elasticsearchmachine pushed a commit that referenced this pull request Oct 29, 2021
…low down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80053)

* Preventing unnecessary ILM policy deletions that drastically slow down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946)

In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does
not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated
to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a
significant amount of time -- about 30% of the runtime of the test. This was causing
SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to
the list of known policies not to delete.
Closes #77025
Relates #76791

* fixing backported code for 7.16

* allowing type removal warnings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >enhancement Team:Data Management Meta label for data/management team v7.16.0 v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants