Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] relax throttling on expired data cleanup #56711

Merged

Conversation

benwtrent
Copy link
Member

Throttling nightly cleanup as much as we do has been over cautious.

Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.

These numbers do seem magical...maybe it is better to not throttle
at all...

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent
Copy link
Member Author

@droberts195 it almost seems like we should add a setting for this. I am not too keen on these "magic" values.

@hendrikmuhs
Copy link

These numbers do seem magical...maybe it is better to not throttle
at all...

FYI

This could be a usecase for #37867 in some way, e.g. if the query part of the DBQ is running with lower priority, it will run slower and therefore also delete slower. This could supersede the idea of throttling.

#37867 (comment) contains an interesting answer about "Different thread pool for system indices", which touches our case, however we want lower priority, not higher. Should we add our usecase there? I leave that up to @droberts195.

@droberts195
Copy link
Contributor

it almost seems like we should add a setting for this. I am not too keen on these "magic" values.

Yes, good idea. We need some configurability.

I think it's best if the DeleteExpiredDataAction takes two request arguments:

  1. Maximum runtime, default 8 hours
  2. Requests per second for results DBQ, default null, i.e. no limit

Then move the logic for setting the requests per second for results DBQ in nightly maintenance to the nightly maintenance code, which supplies it as a request argument when it calls DeleteExpiredDataAction.

This will mean that whatever our nightly maintenance does by default, a user can catch up without throttling if required by directly calling delete expired data themselves.

We could also have a dynamic cluster-wide setting to control the requests per second for results DBQ when called from nightly maintenance. The default could be -1, meaning use the magic value, and any other value would get passed to DeleteExpiredDataAction by the nightly maintenance code.

@droberts195
Copy link
Contributor

This could be a usecase for #37867 in some way, e.g. if the query part of the DBQ is running with lower priority, it will run slower and therefore also delete slower. This could supersede the idea of throttling.

Yes, true, if we could run the DBQ searches with lower priority then that could be an alternative to throttling in the future. But we need to do something now, as people are running into the problem of us over throttling in big deployments.

#37867 (comment) contains an interesting answer about "Different thread pool for system indices", which touches our case, however we want lower priority, not higher. Should we add our usecase there?

Our state and results indices will be hidden indices, not system indices. Also, I imagine a lot of the burden of deleting millions of documents is in disk access and Lucene segment management. So I am not sure we should complicate that proposal with our cleanup requirements.

@davidkyle
Copy link
Member

I think it's best if the DeleteExpiredDataAction takes two request arguments:

Consider job_id as a third option as it would effectively slice the work up and may help users get over the hump

@droberts195
Copy link
Contributor

Consider job_id as a third option as it would effectively slice the work up and may help users get over the hump

Good idea. It would have to be wildcardable, with a default of *, but it's true that it would be great for the case where one particular job is huge and either the user needs to given that extra attention or alternatively just clean up the other jobs with little effort while some other (higher effort) mechanism is used on the huge job.

@davidkyle
Copy link
Member

I'll raise a separate PR for the job_id change

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

/**
* The requests allowed per second in the underlying Delete by Query requests executed.
*
* `1.0f` indicates the default behavior where throttling scales according too the number of data nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `1.0f` indicates the default behavior where throttling scales according too the number of data nodes
* `1.0f` indicates the default behavior where throttling scales according to the number of data nodes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GRAMMAR! My old nemesis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the magic value is -1.0f in the core code, not 1.0f. Negative also makes more sense for the magic value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think null (unspecified) means use the default and in the action that is interpreted as the magic value -1.0f

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null means no throttle

-1.0f is our "magic" calculation.

if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
Request request = (Request) o;
return Float.compare(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a plain Object.equals(requestsPerSecond, request.requestsPerSecond)

I see there is a difference between compare and equals but only in terms of NANs and comparing +0 to -0

https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Float.html#equals(java.lang.Object)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are correct. This is left over from when I had requestsPerSecond an unboxed value :).

mlDailyMaintenanceService.start();
clusterService.addLifecycleListener(new LifecycleListener() {
@Override
public void afterStart() {
clusterService.getClusterSettings().addSettingsUpdateConsumer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to deregister the update consumer?

The consumer is added when the node becomes a master node but when it goes off master mlDailyMaintenanceService is set to null in uninstallDailyMaintenanceService() but this consumer referencing mlDailyMaintenanceService will prevent it being garbage collected.

Maybe make this class the consumer the set method will directly set the value on mlDailyMaintenanceService if it is not null

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 good point. The setting updater might have to be in the initialization service itself...

To my knowledge there is no way to deregister a setting consumer. The consumers have no unique identification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being set to null anyways? I assume so it can be GC'd, but the object itself is not HUGE, and only really has references to things are referenced by the MlInitializationService.java.

I am gonna change the code so that it does not get set null. It seems like a waste to me.

@dimitris-athanasiou @droberts195 ^ Let me know if you have a prevailing opinion the other way.

jobIds.addAll(getDataFrameAnalyticsJobIds());
return jobIds;
}

private Set<String> getAnamalyDetectionJobIds() {
private Set<String> getAnomalyDetectionJobIds() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good 👀 I stared at this for a long time before I saw the difference


public static final ObjectParser<Request, Void> PARSER = new ObjectParser<>(
"delete_expired_data_request",
true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would usually error on unknown fields when parsing REST requests. Should this be false on the server side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes yes, it should be false. Fixing!

/**
* The requests allowed per second in the underlying Delete by Query requests executed.
*
* `1.0f` indicates the default behavior where throttling scales according too the number of data nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the magic value is -1.0f in the core code, not 1.0f. Negative also makes more sense for the magic value.

@@ -73,12 +74,18 @@ public TransportDeleteExpiredDataAction(ThreadPool threadPool, TransportService
protected void doExecute(Task task, DeleteExpiredDataAction.Request request,
ActionListener<DeleteExpiredDataAction.Response> listener) {
logger.info("Deleting expired data");
Instant timeoutTime = Instant.now(clock).plus(MAX_DURATION);
Instant timeoutTime = Instant.now(clock).plus(
request.getTimeout() == null ? MAX_DURATION : Duration.ofMillis(request.getTimeout().millis())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename MAX_DURATION to DEFAULT_MAX_DURATION now it's not always used.

public void testDeleteExpiredData() {
DeleteExpiredDataRequest deleteExpiredDataRequest = new DeleteExpiredDataRequest();
public void testDeleteExpiredData() throws Exception {
DeleteExpiredDataRequest deleteExpiredDataRequest = new DeleteExpiredDataRequest(1.0f, TimeValue.timeValueHours(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to test with values other than 1, as 1 is more likely to end up in the output due to the combination of a bug and a fluke.

@benwtrent benwtrent requested a review from droberts195 May 15, 2020 15:07
@benwtrent
Copy link
Member Author

@elasticmachine update branch

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for doing this impromptu piece of work, but I think it's something a few users need urgently.

I saw a couple of nits but am happy for you to merge this without another review.

@@ -93,7 +97,7 @@ public void testDeleteExpiredDataIterationWithTimeout() {

Supplier<Boolean> isTimedOutSupplier = () -> (removersRemaining.getAndDecrement() <= 0);

transportDeleteExpiredDataAction.deleteExpiredData(removers.iterator(), finalListener, isTimedOutSupplier, true);
transportDeleteExpiredDataAction.deleteExpiredData(removers.iterator(), .0f, finalListener, isTimedOutSupplier, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems dodgy to test two error conditions together: a timeout and requests per second = 0. Maybe the .0f was a typo? I think this test should just test the timeout.

/**
* The requests allowed per second in the underlying Delete by Query requests executed.
*
* `-1.0f` indicates the default behavior where throttling scales according to the number of data nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"default" is probably the wrong word for this comment. The other HLRC docs say the default is null, and that's what's implemented in the HLRC.

So maybe "default" to "standard nightly maintenance behavior" or something like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, saying default is a bit disingenuous.

@benwtrent benwtrent merged commit 8fed077 into elastic:master May 18, 2020
@benwtrent benwtrent deleted the feature/ml-allow-faster-expired-data-cleanup branch May 18, 2020 11:21
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request May 18, 2020
Throttling nightly cleanup as much as we do has been over cautious.

Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.

Additionally, the API now has `requests_per_second` and `timeout` set.
So users calling the API directly can set the throttling.

This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`.
This will allow users to adjust throttling of the nightly maintenance.
benwtrent added a commit that referenced this pull request May 18, 2020
Throttling nightly cleanup as much as we do has been over cautious.

Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.

Additionally, the API now has `requests_per_second` and `timeout` set.
So users calling the API directly can set the throttling.

This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`.
This will allow users to adjust throttling of the nightly maintenance.
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request May 18, 2020
…c#56895)

Throttling nightly cleanup as much as we do has been over cautious.

Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.

Additionally, the API now has `requests_per_second` and `timeout` set.
So users calling the API directly can set the throttling.

This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`.
This will allow users to adjust throttling of the nightly maintenance.
benwtrent added a commit that referenced this pull request May 18, 2020
Throttling nightly cleanup as much as we do has been over cautious.

Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.

Additionally, the API now has `requests_per_second` and `timeout` set.
So users calling the API directly can set the throttling.

This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`.
This will allow users to adjust throttling of the nightly maintenance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants