[ML] relax throttling on expired data cleanup #56711

benwtrent · 2020-05-13T19:02:58Z

Throttling nightly cleanup as much as we do has been over cautious.

Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.

These numbers do seem magical...maybe it is better to not throttle
at all...

elasticmachine · 2020-05-13T19:03:00Z

Pinging @elastic/ml-core (:ml)

benwtrent · 2020-05-13T19:04:00Z

@droberts195 it almost seems like we should add a setting for this. I am not too keen on these "magic" values.

hendrikmuhs · 2020-05-14T07:12:16Z

These numbers do seem magical...maybe it is better to not throttle
at all...

FYI

This could be a usecase for #37867 in some way, e.g. if the query part of the DBQ is running with lower priority, it will run slower and therefore also delete slower. This could supersede the idea of throttling.

#37867 (comment) contains an interesting answer about "Different thread pool for system indices", which touches our case, however we want lower priority, not higher. Should we add our usecase there? I leave that up to @droberts195.

droberts195 · 2020-05-14T08:06:59Z

it almost seems like we should add a setting for this. I am not too keen on these "magic" values.

Yes, good idea. We need some configurability.

I think it's best if the DeleteExpiredDataAction takes two request arguments:

Maximum runtime, default 8 hours
Requests per second for results DBQ, default null, i.e. no limit

Then move the logic for setting the requests per second for results DBQ in nightly maintenance to the nightly maintenance code, which supplies it as a request argument when it calls DeleteExpiredDataAction.

This will mean that whatever our nightly maintenance does by default, a user can catch up without throttling if required by directly calling delete expired data themselves.

We could also have a dynamic cluster-wide setting to control the requests per second for results DBQ when called from nightly maintenance. The default could be -1, meaning use the magic value, and any other value would get passed to DeleteExpiredDataAction by the nightly maintenance code.

droberts195 · 2020-05-14T08:16:03Z

This could be a usecase for #37867 in some way, e.g. if the query part of the DBQ is running with lower priority, it will run slower and therefore also delete slower. This could supersede the idea of throttling.

Yes, true, if we could run the DBQ searches with lower priority then that could be an alternative to throttling in the future. But we need to do something now, as people are running into the problem of us over throttling in big deployments.

#37867 (comment) contains an interesting answer about "Different thread pool for system indices", which touches our case, however we want lower priority, not higher. Should we add our usecase there?

Our state and results indices will be hidden indices, not system indices. Also, I imagine a lot of the burden of deleting millions of documents is in disk access and Lucene segment management. So I am not sure we should complicate that proposal with our cleanup requirements.

davidkyle · 2020-05-14T10:35:52Z

I think it's best if the DeleteExpiredDataAction takes two request arguments:

Consider job_id as a third option as it would effectively slice the work up and may help users get over the hump

droberts195 · 2020-05-14T10:39:08Z

Consider job_id as a third option as it would effectively slice the work up and may help users get over the hump

Good idea. It would have to be wildcardable, with a default of *, but it's true that it would be great for the case where one particular job is huge and either the user needs to given that extra attention or alternatively just clean up the other jobs with little effort while some other (higher effort) mechanism is used on the huge job.

davidkyle · 2020-05-14T12:56:15Z

I'll raise a separate PR for the job_id change

davidkyle

LGTM

davidkyle · 2020-05-15T09:46:58Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/DeleteExpiredDataRequest.java

+    /**
+     * The requests allowed per second in the underlying Delete by Query requests executed.
+     *
+     * `1.0f` indicates the default behavior where throttling scales according too the number of data nodes


Suggested change

* `1.0f` indicates the default behavior where throttling scales according too the number of data nodes

* `1.0f` indicates the default behavior where throttling scales according to the number of data nodes

GRAMMAR! My old nemesis.

It looks like the magic value is -1.0f in the core code, not 1.0f. Negative also makes more sense for the magic value.

Good point. I think null (unspecified) means use the default and in the action that is interpreted as the magic value -1.0f

null means no throttle

-1.0f is our "magic" calculation.

davidkyle · 2020-05-15T10:17:01Z

...lugin/core/src/main/java/org/elasticsearch/xpack/core/ml/action/DeleteExpiredDataAction.java

+            if (this == o) return true;
+            if (o == null || getClass() != o.getClass()) return false;
+            Request request = (Request) o;
+            return Float.compare(


Why not a plain Object.equals(requestsPerSecond, request.requestsPerSecond)

I see there is a difference between compare and equals but only in terms of NANs and comparing +0 to -0

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Float.html#equals(java.lang.Object)

you are correct. This is left over from when I had requestsPerSecond an unboxed value :).

davidkyle · 2020-05-15T10:46:08Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MlInitializationService.java

            clusterService.addLifecycleListener(new LifecycleListener() {
+                @Override
+                public void afterStart() {
+                    clusterService.getClusterSettings().addSettingsUpdateConsumer(


Is there a way to deregister the update consumer?

The consumer is added when the node becomes a master node but when it goes off master mlDailyMaintenanceService is set to null in uninstallDailyMaintenanceService() but this consumer referencing mlDailyMaintenanceService will prevent it being garbage collected.

Maybe make this class the consumer the set method will directly set the value on mlDailyMaintenanceService if it is not null

🤔 good point. The setting updater might have to be in the initialization service itself...

To my knowledge there is no way to deregister a setting consumer. The consumers have no unique identification.

Why is this being set to null anyways? I assume so it can be GC'd, but the object itself is not HUGE, and only really has references to things are referenced by the MlInitializationService.java.

I am gonna change the code so that it does not get set null. It seems like a waste to me.

@dimitris-athanasiou @droberts195 ^ Let me know if you have a prevailing opinion the other way.

davidkyle · 2020-05-15T11:34:26Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/retention/UnusedStateRemover.java

    }

-    private Set<String> getAnamalyDetectionJobIds() {
+    private Set<String> getAnomalyDetectionJobIds() {


Good 👀 I stared at this for a long time before I saw the difference

droberts195 · 2020-05-15T12:23:08Z

...lugin/core/src/main/java/org/elasticsearch/xpack/core/ml/action/DeleteExpiredDataAction.java

+
+        public static final ObjectParser<Request, Void> PARSER = new ObjectParser<>(
+            "delete_expired_data_request",
+            true,


We would usually error on unknown fields when parsing REST requests. Should this be false on the server side?

yes yes, it should be false. Fixing!

droberts195 · 2020-05-15T12:26:05Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/DeleteExpiredDataRequest.java

+    /**
+     * The requests allowed per second in the underlying Delete by Query requests executed.
+     *
+     * `1.0f` indicates the default behavior where throttling scales according too the number of data nodes


It looks like the magic value is -1.0f in the core code, not 1.0f. Negative also makes more sense for the magic value.

droberts195 · 2020-05-15T12:41:33Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportDeleteExpiredDataAction.java

        logger.info("Deleting expired data");
-        Instant timeoutTime = Instant.now(clock).plus(MAX_DURATION);
+        Instant timeoutTime = Instant.now(clock).plus(
+            request.getTimeout() == null ? MAX_DURATION : Duration.ofMillis(request.getTimeout().millis())


Maybe rename MAX_DURATION to DEFAULT_MAX_DURATION now it's not always used.

droberts195 · 2020-05-15T12:43:32Z

client/rest-high-level/src/test/java/org/elasticsearch/client/MLRequestConvertersTests.java

-    public void testDeleteExpiredData() {
-        DeleteExpiredDataRequest deleteExpiredDataRequest = new DeleteExpiredDataRequest();
+    public void testDeleteExpiredData() throws Exception {
+        DeleteExpiredDataRequest deleteExpiredDataRequest = new DeleteExpiredDataRequest(1.0f, TimeValue.timeValueHours(1));


Might be better to test with values other than 1, as 1 is more likely to end up in the output due to the combination of a bug and a fluke.

benwtrent · 2020-05-15T15:07:48Z

@elasticmachine update branch

droberts195

LGTM

Thanks for doing this impromptu piece of work, but I think it's something a few users need urgently.

I saw a couple of nits but am happy for you to merge this without another review.

droberts195 · 2020-05-15T19:04:33Z

...l/src/test/java/org/elasticsearch/xpack/ml/action/TransportDeleteExpiredDataActionTests.java

        Supplier<Boolean> isTimedOutSupplier = () -> (removersRemaining.getAndDecrement() <= 0);

-        transportDeleteExpiredDataAction.deleteExpiredData(removers.iterator(), finalListener, isTimedOutSupplier, true);
+        transportDeleteExpiredDataAction.deleteExpiredData(removers.iterator(), .0f, finalListener, isTimedOutSupplier, true);


It seems dodgy to test two error conditions together: a timeout and requests per second = 0. Maybe the .0f was a typo? I think this test should just test the timeout.

droberts195 · 2020-05-15T19:06:24Z

client/rest-high-level/src/main/java/org/elasticsearch/client/ml/DeleteExpiredDataRequest.java

+    /**
+     * The requests allowed per second in the underlying Delete by Query requests executed.
+     *
+     * `-1.0f` indicates the default behavior where throttling scales according to the number of data nodes.


"default" is probably the wrong word for this comment. The other HLRC docs say the default is null, and that's what's implemented in the HLRC.

So maybe "default" to "standard nightly maintenance behavior" or something like that.

Definitely, saying default is a bit disingenuous.

…faster-expired-data-cleanup

Throttling nightly cleanup as much as we do has been over cautious. Night cleanup should be more lenient in its throttling. We still keep the same batch size, but now the requests per second scale with the number of data nodes. If we have more than 5 data nodes, we don't throttle at all. Additionally, the API now has `requests_per_second` and `timeout` set. So users calling the API directly can set the throttling. This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`. This will allow users to adjust throttling of the nightly maintenance.

…c#56895) Throttling nightly cleanup as much as we do has been over cautious. Night cleanup should be more lenient in its throttling. We still keep the same batch size, but now the requests per second scale with the number of data nodes. If we have more than 5 data nodes, we don't throttle at all. Additionally, the API now has `requests_per_second` and `timeout` set. So users calling the API directly can set the throttling. This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`. This will allow users to adjust throttling of the nightly maintenance.

Throttling nightly cleanup as much as we do has been over cautious. Night cleanup should be more lenient in its throttling. We still keep the same batch size, but now the requests per second scale with the number of data nodes. If we have more than 5 data nodes, we don't throttle at all. Additionally, the API now has `requests_per_second` and `timeout` set. So users calling the API directly can set the throttling. This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`. This will allow users to adjust throttling of the nightly maintenance.

[ML] relax throttling on expired data cleanup

9d473e0

benwtrent added >refactoring :ml Machine learning v8.0.0 v7.8.1 v7.9.0 labels May 13, 2020

benwtrent requested a review from droberts195 May 13, 2020 19:02

benwtrent added 5 commits May 14, 2020 10:55

addressing pr comments

9ad0cf7

Merge branch 'master' into feature/ml-allow-faster-expired-data-cleanup

b8f4722

adding parameters to expired data cleanup

6c31b0e

fixing and adding tests

bfc44a5

removing unused import

4fca004

davidkyle approved these changes May 15, 2020

View reviewed changes

droberts195 reviewed May 15, 2020

View reviewed changes

benwtrent added 2 commits May 15, 2020 09:31

addressing pr comments

2b92d09

addressing pr comments

941b288

benwtrent requested a review from droberts195 May 15, 2020 15:07

Merge branch 'master' into feature/ml-allow-faster-expired-data-cleanup

6cd2bb7

droberts195 approved these changes May 15, 2020

View reviewed changes

benwtrent added 2 commits May 15, 2020 16:18

addressing pr comments

790d30c

Merge remote-tracking branch 'upstream/master' into feature/ml-allow-…

17b3e97

…faster-expired-data-cleanup

benwtrent merged commit 8fed077 into elastic:master May 18, 2020

benwtrent deleted the feature/ml-allow-faster-expired-data-cleanup branch May 18, 2020 11:21

benwtrent mentioned this pull request May 18, 2020

[7.x] [ML] relax throttling on expired data cleanup (#56711) #56895

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

	* `1.0f` indicates the default behavior where throttling scales according too the number of data nodes
	* `1.0f` indicates the default behavior where throttling scales according to the number of data nodes

[ML] relax throttling on expired data cleanup #56711

[ML] relax throttling on expired data cleanup #56711

Uh oh!

Conversation

benwtrent commented May 13, 2020

Uh oh!

elasticmachine commented May 13, 2020

Uh oh!

benwtrent commented May 13, 2020

Uh oh!

hendrikmuhs commented May 14, 2020

Uh oh!

droberts195 commented May 14, 2020

Uh oh!

droberts195 commented May 14, 2020

Uh oh!

davidkyle commented May 14, 2020

Uh oh!

droberts195 commented May 14, 2020

Uh oh!

davidkyle commented May 14, 2020

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent commented May 15, 2020

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants