Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to calendar based auto scaler + Short term options to handle the support requests #4289

Closed
balajialg opened this issue Feb 23, 2023 · 12 comments
Assignees

Comments

@balajialg
Copy link
Contributor

balajialg commented Feb 23, 2023

Summary

Calendar based scaler which was scheduled for Pol Sci 3 class between 4.30 - 6.30 PM today did not work as expected causing a lot of grief for the instructor and the students wrt server startup times (Ref #4009). @rylo reported that the logs did not have any information about the scheduled event which highlights that the calendar scaler did not work as expected.

Seems like there was a rush around 5 - 5.30 PM when 190+ students accessed R hub.
image

We provisioned 4 additional placeholders for this class using calendar scaler. Node count went from 1 to 3 during the scheduled time (which means we did over allocate for this class)

image
image

We should explore alternatives for servicing these time-bound resource increase requests in the short term till we figure out a way to make auto scaler's behavior consistent. @rylo had the following suggestion,

I'd favor storing scale up events in a simple yaml file in the datahub repo. I think we have to think about what data we should put in the file and how to represent it then alter the current scaler to parse the data and just emit logs about what it would try to do if it was going to do it. If it seemed like it was working properly, switch the scaler from using the google calendar as the primary source of record to the yaml file.

Proposed Solutions

Long Term

  • Explore some k8s primitives, like a CronJob

Medium Term

  • Store config changes in a yaml file inside datahub repository

Short Term

  • Manually scale up and scale down resources at 4.30 PM and 5.30 PM respectively (Monday/Wednesday)
  • Permanently bump up resources to 3 nodes for R hub (which was the max nodes required during scale up).

Task to be performed

Action Status Notes
Debug script. Check whether bumping dependencies fixes known problems. Also check whether there are better dependencies, e.g. the ical library. Not started Ryan Lovett thinks that fixing the script is the easiest and best option.
Determine if there's a better option than using Google Calendar. Not started Use schedule Google Cloud Function to scale. Ryan Lovett has a modified scaler.py that could be run this way. This is not cloud agnostic however. Another option is to just have a YAML file listing events. The scaler could fetch it from our git repo.

@balajialg balajialg self-assigned this Feb 23, 2023
@ryanlovett
Copy link
Collaborator

What I wrote in slack was imagining a short term fix. Long term I think it'd be better to use some k8s primitives, like a CronJob or some such.

@ryanlovett
Copy link
Collaborator

ryanlovett commented Feb 23, 2023

If one downloads the calendar (link via the calendar settings page), it has the events, but according to the logs the scaler is only seeing one event, the evening cool off.

@balajialg balajialg assigned felder and shaneknapp and unassigned balajialg Feb 23, 2023
@ryanlovett
Copy link
Collaborator

As requested, for this incident it did not "see" an event that was happening now and on another occasion it thought the previous day's evening cool off event was happening now even though it wasn't.

@ryanlovett
Copy link
Collaborator

Alternative library: https://pypi.org/project/ical/

@balajialg
Copy link
Contributor Author

balajialg commented Mar 2, 2023

The calendar based auto-scaler did not work today (atleast from what I observed in grafana)

From 4.30 PM today, I manually updated the number of nodes in R nodepool so that the there isn't any pending pods. I went to the Google Cloud console UI and updated the number of nodes to 4 (based on @ryanlovett's input). Even when I had allocated 4 nodes (this class always needed 3 nodes previously based on grafana data), there still were 50+ pending pods for a certain duration of time. I will share the metrics observed during the last 30 minutes

image
image
image

@ryanlovett Is it possible to update the placeholders directly instead of the actual nodes? Whenever I made edits to the actual number of nodes, the autoscaler brought it back to the required nodes based on the current demand in 2 minutes. It will make it easy for me to support this requests manually by updating placeholders over nodes in the upcoming days.

@ryanlovett
Copy link
Collaborator

@balajialg Sounds like the script was using bad values again. There is a way to update the placeholders, but it is the hub config, https://github.com/berkeley-dsep-infra/datahub/blob/staging/node-placeholder/values.yaml#L186. Rather than commit and run through CI, it might be best to make changes and manually chartpress --push then manually invoke hubploy the same way it is done in the CI config.

Another option would be to disable the scaler entirely and then just use the cloud console.

@balajialg
Copy link
Contributor Author

balajialg commented Mar 2, 2023

@ryanlovett Yes, I waited till 4.45 PM to see if the calendar based auto scaler had any impact on the node count. Unfortunately, I couldn't see any change.

I haven't set up hubploy on my local device as I ran into issues with sops. I need help fixing that but don't want to waste dev cycles doing that when it is already scarce.

I would like to explore your other option of disabling scaler entirely and use cloud console. Can you expand on how would that look like?

@ryanlovett
Copy link
Collaborator

@balajialg Some ways to temporarily disable the scaler:

  1. Create a new calendar with no events on it, then change the calendarUrl in node-placeholder/values.yml. Then run through CI.
  2. Modify the script to do nothing and run chartpress/hubploy to push out the new image.
  3. I think you could edit the node-placeholder chart to deploy a do-nothing image.

The first one is probably way easier.

@balajialg
Copy link
Contributor Author

@ryanlovett This is great. Lets explore option 1 during our 4 PM meeting today. Thanks

@balajialg balajialg changed the title Explore short term alternatives for servicing requests fulfilled through calendar based auto scaler Improvements to Calendar based auto scaler + Short term options to handle the support requests Mar 2, 2023
@balajialg balajialg changed the title Improvements to Calendar based auto scaler + Short term options to handle the support requests Improvements to calendar based auto scaler + Short term options to handle the support requests Mar 2, 2023
@balajialg
Copy link
Contributor Author

@shaneknapp
Copy link
Contributor

calendar scaler is fixed and deployed... i believe that we can close this issue!

@balajialg
Copy link
Contributor Author

Closing this issue. Fantastic work @shaneknapp @ryanlovett !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants