Skip to content

Commit

Permalink
[ML] Reduce no-scale events from serverless autoscaling
Browse files Browse the repository at this point in the history
Recently the Elasticsearch serverless autoscaler has been changed
to reset the stabilization window when it receives a no-scale
event. It needs to receive a continuous stream of downscale events
for the entirety of the stabilization window or it won't scale
down.

Prior to this change the ML autoscaling stats would flip to a
no-scale every 5 minutes, when the ML memory tracker was considered
to be stale. This prevents the ML tier ever scaling down.

This change alters the flow so that a stale ML memory tracker does
not automatically cause a no-scale event to be returned. In the
majority of cases the ML memory tracker will only be "stale" by 5
seconds, which is not really worth worrying about. In cases where
the ML memory tracker does not contain all required information
(because for example it hasn't even been initialised on a new
master node) we will still return no-scale events while it is
refreshed due to the null checks in MlAutoscalingResourceTracker.
  • Loading branch information
droberts195 committed Nov 3, 2023
1 parent f1c7436 commit ceace17
Showing 1 changed file with 15 additions and 16 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -27,16 +27,13 @@
import org.elasticsearch.xpack.ml.autoscaling.MlAutoscalingResourceTracker;
import org.elasticsearch.xpack.ml.process.MlMemoryTracker;

import java.util.concurrent.Executor;

/**
* Internal (no-REST) transport to retrieve metrics for serverless autoscaling.
*/
public class TransportGetMlAutoscalingStats extends TransportMasterNodeAction<Request, Response> {

private final MlMemoryTracker mlMemoryTracker;
private final Settings settings;
private final Executor timeoutExecutor;

@Inject
public TransportGetMlAutoscalingStats(
Expand All @@ -61,26 +58,28 @@ public TransportGetMlAutoscalingStats(
);
this.mlMemoryTracker = mlMemoryTracker;
this.settings = settings;
this.timeoutExecutor = threadPool.generic();
}

@Override
protected void masterOperation(Task task, Request request, ClusterState state, ActionListener<Response> listener) {

if (mlMemoryTracker.isRecentlyRefreshed()) {
MlAutoscalingResourceTracker.getMlAutoscalingStats(
state,
clusterService.getClusterSettings(),
mlMemoryTracker,
settings,
ActionListener.wrap(autoscalingResources -> listener.onResponse(new Response(autoscalingResources)), listener::onFailure)
);
} else {
// Recent memory statistics aren't available at the moment, trigger a refresh and return a no-scale.
// (If a refresh is already in progress, this won't trigger a new one.)
if (mlMemoryTracker.isRecentlyRefreshed() == false) {
// Recent memory statistics aren't available at the moment, trigger a refresh in the background.
// (If a refresh is already in progress, this won't trigger a new one.) We still proceed to return a
// scaling response in this case. This API gets called every 5 seconds, so the memory info is likely only
// a few seconds stale, and still good enough. If it gets _really_ badly out of date, or has never been
// populated in the first place then there are places in MlAutoscalingResourceTracker.getMlAutoscalingStats
// that will return a no-scale.
mlMemoryTracker.asyncRefresh();
listener.onResponse(new Response(MlAutoscalingResourceTracker.noScaleStats(state)));
}

MlAutoscalingResourceTracker.getMlAutoscalingStats(
state,
clusterService.getClusterSettings(),
mlMemoryTracker,
settings,
ActionListener.wrap(autoscalingResources -> listener.onResponse(new Response(autoscalingResources)), listener::onFailure)
);
}

@Override
Expand Down

0 comments on commit ceace17

Please sign in to comment.