-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Add new serve autoscaling parameter scaling_function
#47837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c3c380b to
f87ff28
Compare
scaling_function
Signed-off-by: Kyle <kyle.robinson@voyagerx.com>
Signed-off-by: Kyle <kyle.robinson@voyagerx.com>
|
@Stack-Attack Could you help me understand the PR better? My understanding is this PR does two things:
However, for (1), there is an existing field named As for (2), it is true we currently don't support min/max as functions for deciding the number of replicas to scale to. Could you go into more detail on what kind of use case you are looking to support? |
Good questions! (1) look_back_period_s is a moving average on the onging requests, which inherently means the longer you make it the less sensitive the autoscaler becomes for all scaling decisions (up or down). In general, I think this is an important metric, but it's not a solution to the problem. It could be feasible to add additional aggregate function options here, but since this is calculated upstream of all scaling logic, it doesn't seem preferable. Currently all of the important autoscaling configs (i.e upscaling delay/smoothing) are calculated based on replica count decisions, and I think this is a good pattern. (2) The most basic use case would be to easilly ensure you always scale to accomodate the maximum load. If the decision is based on the maximum value, you can more easilly ensure availability. Alternatively, avg and min could be used for more conservative scaling strategies. |
|
The intended goal here is to simplify auto-scaling configs for deployments where you aren't sure what the traffic pattern is yet. A common problem we've been having when going to production with Serve. Also, I think the fact that downscale_delay does not consider interim decisions is rather unintuitive. If you have any suggustions I'm happy to make some changes! |
|
Ah, one more comment on (1). Increasing look_back_period_s also makes tuning the target_num_ongoing_requests very difficult. Usually you can calculate a good value based on the processing time of one request, but as the look_back_period increases this doesn't neccessarily keep. |
|
@zcin Checking in again. This PR would greatly improve our production deployments. |
|
@kyle-v6x Sorry for the delay!
Hmm, if increasing
This makes sense. I also did some research into how Kubernetes does horizontal pod autoscaling, and they also implement a stabilization window of default 5 minutes, during which they select the max value calculated and use that for the next scaling decision. I think it is reasonable to implement this for Serve, and will provide more options for users. What I'm not sure about though, is whether we should use the min/max of the averaged metrics sent from the replicas to the controller, which could be strange to reason about, or take the min/max of the raw request metrics un-averaged. I think the latter might make more sense.
By interim decisions I assume you mean the calculations that are made every control loop cycle for the desired number of replicas. I think what's confusing here is that although we are currently only using the "most recent decision", that decision is incorporating request metrics that were sampled over the past (e.g.) 30 seconds. So you can actually view it as we made decisions each time we sampled from the replicas, then averaged all those decisions for the final decision at the end of that 30-second window. I do believe it is useful to separate |
|
@zcin No worries! Some good points.
I hadn't put much thought on the double average use case as I tunnel visioned on the other functions a bit. Fair point. However, the raw values are extremely volotile, which could result in a huge upscaling/downscaling decisions when using max/min. It seems like there has to be some averaging before applying such functions. Actually, I like being able to control the I think there is a strong case for averaging before taking the min or max, meaning we would have to do it somewhere anyway. |
For this reason, and because it couples smoothing for upscaling and downscaling, I actually find that |
|
Bumping this again in hopes we can get it merged this month. I believe the tests failed due to un-related reasons, so I'll push a commit to have them re-run. |
|
Bumping again. Happy to complete ane work needed to get this merged ASAP, but I'm helpless without approval from a code owner. |
True, I think request metrics can be pretty volatile. My only concern is that if we ever extend autoscaling to include more metrics apart from request metrics, like CPU/memory resource usage, it's more standard to not do any averaging before applying the chosen aggregation function. And it's not as easy to reason with if we have a double layer of aggregation (min/max over an average). Maybe we can try to do a simple test for both options, and see if removing the averaging will actually lead to a lot of volatility? |
|
@kyle-v6x Giving this some more thought, and I am still leaning more towards not having nested aggregation. I was previously thinking about the number of ongoing/queued requests at a single replica can be very volatile if we're just getting the min/max since routing is random, but we should be getting the total number of requests, at any single point in time, across all replicas. This is simply the total number of requests being processed in the system, which should be much less volatile and simply indicative of the traffic being sent to the system, and how well the system is handling that traffic. To implement this, we will have to change the way autoscaling metrics are collected. Currently metrics are aggregated at the deployment handle first, and the aggregated results are sent to the controller. This was fine when the only aggregation method was averaging. However to do min/max, we will have to send the raw metrics from the deployment handle to the controller, and perform the aggregation at the controller. Let me know what you think, if we go down this route I can help support you. |
|
@zcin Can you give an example of future hardware metric processing? Since Overall, whichever method you think is best and gives a clear path to merging these features, I'll try to make some time to impliment! I think the choice now is to add a simple solution and maintain full compatability for current autoscaling settings, or do some major refactoring to the aggregation. If we do refactor aggregation, it will be difficult to maintain current autoscaler settings 1-to-1 (the max of the average over Whichever decision, just happy to get this moving as we could really use it now more than ever. |
@kyle-v6x Okay, if it works for your use case then, let's do a single layer of aggregation at the controller then! Then we can define an experimental API for specifying min/max instead of averaging, and that can be swapped out in the controller. I've prototyped the part that requires switching to sending raw metrics from the handles to the controller - essentially, instead of keeping an in-memory metrics store at the handles, keep one per-handle at the controller, and push lists of raw metrics collected at the handles, instead of a single aggregated metric, from handle -> controller. #49412 let me know if this helps! |
|
@zcin I have some core work that will take me to the end of the month, so I plan to work on this in early February. Thanks for the draft PR, it will be very helpful. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
|
Still planning to work on this, but haven't had time. |
|
Finally hopping back on this. I'll close this PR as it's no longer the planned implimentation. |
Why are these changes needed?
Currently, the serve autoscaler makes scaling decisions only based on the most recent Serve Controller computation, even if the serve controller has made many scaling calculations over the scaling delay period. This results in poor autoscaling when clusters utilize long upscale/downscale delays.
Related issue number
#46497
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.