-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add built-in ILM policies for common user use cases #76791
Conversation
This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are intending to be starting points for a user to use until they switch to using a custom built ILM policy.
I like the idea of shipping base templates that users can pick from, so that they don't have to worry about configuring an ILM policy before ingesting data and still have sane defaults. Some questions/thoughts:
|
@dakrone our current default policy for o11y is hot forever. I think the reason we had it so to deliver good performance out of the box, and avoid deleting data without the user's consent. Is that still the best policy for new clusters? If so, should we include it here? Also, will adding warm require adding more nodes? If so, will that make it harder to get started or increase their bill? What about including policies that match the tiers available without adding nodes? |
Thanks for taking a look!
I don't think so, generally all of the users I've encountered thus far want to get data off of the hot tier as quickly as possible, as those are the most expensive machines.
I think this is more of an expert scenarios, since indices using these policies are almost certain to be part of a data stream, and thus not likely to have writes once they roll over. I suspect this will meet the 80% case, and for the users with a more advanced configuration, they will want to configured their own policies.
Yes, in order to do this we'll need to implement #66040 first.
None of these policies are used by a template by default, these are intended for the user to pick from whilst adding an integration. So we are not deleting data without the user's consent, the user has specifically picked one of these policies (or a different one like our
We already install
Using the
Making something "harder to get started" is very difficult to quantify, since I could also argue that doing hot-forever by default makes it harder to get started with different data tiers. In the future I suspect we will want to add searchable snapshots (see the prerequisite I mentioned above), which will also necessitate moving away from a hot-only configuration. Adding nodes does indeed increase their bill, though this is something a user on Cloud controls through their autoscaling limits.
If a user disables or limits autoscaling, then these policies will indeed use the tiers that are already part of the cluster without adding nodes. |
Thanks Lee for your replies. They make sense to me, except this one:
I understand why users want to flow data relatively quickly through tiers to manage costs, but I think that it's also important to retain data on the Hot tier for some minimal amount of time in order to give a good user experience with Discover or live-tailing of logs in the Logs UI. While the default of keeping data in Hot forever is just a waste of money, I worry about setting the cursor too far in the direction of optimizing for storage costs to the detriment of search experience. FWIW I'd be fine with even low-ish values for the |
For discover or live-tailing of logs, how low is too low? What about something like |
2h would certainly be better than 0. But if I put myself in the shoes of a user who is using our integrations to monitor their infrastructure and they need to debug a production issue, I think I would like to have fast loading dashboards for at least 24h of data in order to be able to compare how things looked before/after the incident. Maybe even a couple days in case of daily patterns (e.g. a website that receives more traffic in the day than in the night, and even more traffic in the evening than in the middle of the day) so that one could compare how things looked around the time of the incident with another day under similar conditions. Separately it's unclear to me that a |
Are you basing this off of the calculation below?
I'm not sure where you get those numbers, can you elaborate on that? In my experience users using ILM in the past have had big problems when rollover failed (for instance, before we moved to retrying all ILM actions) even for a day, where their data was kept on the hot tier longer than intended, and that failure caused cascading failures for the rest of their indexing.
I would really like to avoid a min_age of |
public static final String ILM_30_DAYS_POLICY_NAME = "30-days-hot-warm"; | ||
public static final String ILM_90_DAYS_POLICY_NAME = "90-days-hot-warm-cold"; | ||
public static final String ILM_180_DAYS_POLICY_NAME = "180-days-hot-warm-cold"; | ||
public static final String ILM_365_DAYS_POLICY_NAME = "365-days-hot-warm-cold"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to avoid spelling out which phases we are using in the name of the policy, so that we could later change this, e.g. if we later realize that it would make sense to use frozen
when users keep data for 365 days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be nice to avoid those names, but currently we don't have a concept of a "title" of a policy from within ILM or the UI, so if we made those less general, it would be much harder for a user to determine what the policy does (especially since these are intended for users not that familiar with ILM) if we named it even more generic like "365-days".
Do you think the lack of specification for upgrades is better, or would it be better to be more specific so a user can tell what it is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather like to be less specific so that we can more easily change our mind on what we think are good defaults. Maybe the name could be something like 180-days-default
instead of 180-days
to make it clearer that there are multiple ways to retain data for 180 days and that this is just one way of doing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I've renamed the policies.
@jpountz do you have any additional feedback for this after investigating the performance side? |
@dakrone and I discussed offline about what should be the default value for the Warm tier's It takes around 3 days to fill disks of Hot nodes up to the disk low watermark when indexing at full throttle with our solutions/logs benchmark. But we also want to factor in the fact that users will generally have tens of data streams, with a couple data streams indexing very quickly, and most data streams indexing slowly. These data streams that index slowly and roll over on |
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@elasticmachine update branch |
@elasticmachine update branch |
@elasticmachine update branch |
This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are intending to be starting points for a user to use until they switch to using a custom built ILM policy.
…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791
…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946) In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes elastic#77025 Relates elastic#76791
…n SmokeTestMultiNodeClientYamlTestSuiteIT (elastic#79946) In elastic#76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes elastic#77025 Relates elastic#76791
…n SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80052) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791
…low down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) (#80053) * Preventing unnecessary ILM policy deletions that drastically slow down SmokeTestMultiNodeClientYamlTestSuiteIT (#79946) In #76791 several new default ILM policies were added. EsRestTestCase deletes all ILM policies that it does not know about one-at-a-time. Each of these deletions causes a cluster state change that needs to be propagated to all nodes. In a large test on a multi-node cluster (like SmokeTestMultiNodeClientYamlTestSuiteIT) this eats up a significant amount of time -- about 30% of the runtime of the test. This was causing SmokeTestMultiNodeClientYamlTestSuiteIT to fail with timeouts. This commit adds the new standard ILM policies to the list of known policies not to delete. Closes #77025 Relates #76791 * fixing backported code for 7.16 * allowing type removal warnings
This commit adds five built-in ILM policies ranging from 7 days to 365 days of retention. These are
intending to be starting points for a user to use until they switch to using a custom built ILM
policy.