[ML] Reduce no-scale events from serverless autoscaling

Recently the Elasticsearch serverless autoscaler has been changed to reset the stabilization window when it receives a no-scale event. It needs to receive a continuous stream of downscale events for the entirety of the stabilization window or it won't scale down. Prior to this change the ML autoscaling stats would flip to a no-scale every 5 minutes, when the ML memory tracker was considered to be stale. This prevents the ML tier ever scaling down. This change alters the flow so that a stale ML memory tracker does not automatically cause a no-scale event to be returned. In the majority of cases the ML memory tracker will only be "stale" by 5 seconds, which is not really worth worrying about. In cases where the ML memory tracker does not contain all required information (because for example it hasn't even been initialised on a new master node) we will still return no-scale events while it is refreshed due to the null checks in MlAutoscalingResourceTracker.
droberts195 · Nov 3, 2023 · ceace17 · ceace17
1 parent f1c7436
commit ceace17
Showing 1 changed file with 15 additions and 16 deletions.
diff --git a/...in/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetMlAutoscalingStats.java b/...in/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetMlAutoscalingStats.java
@@ -27,16 +27,13 @@
 import org.elasticsearch.xpack.ml.autoscaling.MlAutoscalingResourceTracker;
 import org.elasticsearch.xpack.ml.process.MlMemoryTracker;
 
-import java.util.concurrent.Executor;
-
 /**
  * Internal (no-REST) transport to retrieve metrics for serverless autoscaling.
  */
 public class TransportGetMlAutoscalingStats extends TransportMasterNodeAction<Request, Response> {
 
  private final MlMemoryTracker mlMemoryTracker;
  private final Settings settings;
- private final Executor timeoutExecutor;
 
  @Inject
  public TransportGetMlAutoscalingStats(
@@ -61,26 +58,28 @@ public TransportGetMlAutoscalingStats(
  );
  this.mlMemoryTracker = mlMemoryTracker;
  this.settings = settings;
- this.timeoutExecutor = threadPool.generic();
  }
 
  @Override
  protected void masterOperation(Task task, Request request, ClusterState state, ActionListener<Response> listener) {
 
- if (mlMemoryTracker.isRecentlyRefreshed()) {
- MlAutoscalingResourceTracker.getMlAutoscalingStats(
- state,
- clusterService.getClusterSettings(),
- mlMemoryTracker,
- settings,
- ActionListener.wrap(autoscalingResources -> listener.onResponse(new Response(autoscalingResources)), listener::onFailure)
- );
- } else {
- // Recent memory statistics aren't available at the moment, trigger a refresh and return a no-scale.
- // (If a refresh is already in progress, this won't trigger a new one.)
+ if (mlMemoryTracker.isRecentlyRefreshed() == false) {
+ // Recent memory statistics aren't available at the moment, trigger a refresh in the background.
+ // (If a refresh is already in progress, this won't trigger a new one.) We still proceed to return a
+ // scaling response in this case. This API gets called every 5 seconds, so the memory info is likely only
+ // a few seconds stale, and still good enough. If it gets _really_ badly out of date, or has never been
+ // populated in the first place then there are places in MlAutoscalingResourceTracker.getMlAutoscalingStats
+ // that will return a no-scale.
  mlMemoryTracker.asyncRefresh();
- listener.onResponse(new Response(MlAutoscalingResourceTracker.noScaleStats(state)));
  }
+
+ MlAutoscalingResourceTracker.getMlAutoscalingStats(
+ state,
+ clusterService.getClusterSettings(),
+ mlMemoryTracker,
+ settings,
+ ActionListener.wrap(autoscalingResources -> listener.onResponse(new Response(autoscalingResources)), listener::onFailure)
+ );
  }
 
  @Override