Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase compute resources for Course PS 3 January - April 2023 Mondays/Wednesdays 5-6:30 pm #4009

Closed
dbroockman opened this issue Dec 19, 2022 · 43 comments
Assignees
Labels

Comments

@dbroockman
Copy link

Course Name

PS 3

Detailed Requirements

350 students will be using R.Datahub simultaneously from 5-6:30 pm on Mondays and Wednesdays during spring 2023

Semester Details

No

Request Deadline

First class January 23
Last class April 26

@balajialg
Copy link
Contributor

balajialg commented Jan 5, 2023

@dbroockman We are making major upgrades to the infrastructure which might affect the current functionality of the calendar-based auto scaler. We are looking at a 2-3 week timeline to make these changes. We should be able to look into this request and resolve it once the upgrades are completed.

@balajialg
Copy link
Contributor

@shaneknapp Can we increase resources for this class? or do we need to tweak the script for auto scaler further?

@balajialg
Copy link
Contributor

balajialg commented Jan 18, 2023

@felder @shaneknapp @ryanlovett Created a recurring calendar event for Pol Sci 3 between 4.30 - 6.30 PM every monday and wednesday and allocated 8 nodes for this class of 350 students (~45 pods per node amounting to ~8 nodes) in the R hub. Let me know if you think the allocation makes sense?

@ryanlovett
Copy link
Collaborator

ryanlovett commented Jan 18, 2023

@balajialg That number should not be set to how many total nodes should be allocated. It is how many available spares ("hotspares" or "placeholders") there should be. It is to address the rate of growth so if a lot of people login in a short amount of time then the hot spare value is higher. If all 350 login at exactly 4:30p, try setting the value to be 2 starting at 4:20p. The cluster will make two additional empty nodes available which can handle 90 logins (based on your pod density figure), not counting those that are already running user pods. The cluster will continue to ramp up nodes, always keeping "+2" empty. Once class is underway and the rate of growth has slowed, the hotspare/placeholder count can be set back to 1 or 0.

If 2 proves to be insufficient it can be scaled up higher, or if it is more than enough it can be set lower. You can verify by seeing how many users are in pending and how long server startup takes.

@balajialg
Copy link
Contributor

balajialg commented Jan 18, 2023

Thanks, @ryanlovett for the detailed explanation! It makes complete sense. I misunderstood that google calendar-based auto-scaler events allocate the exact nodes highlighted in the calendar event. This belief got strengthened after I saw that 10 hot spares got allocated for this class during Spring 2022 (Snapshot attached). Changed the calendar entry to two nodes for now and let's tweak it based on real-time metrics (as you highlighted).

image

@ryanlovett
Copy link
Collaborator

@balajialg I made the same assumption the first time I adjusted the calendar. :)

@balajialg
Copy link
Contributor

balajialg commented Jan 19, 2023

@dbroockman Created a calendar event to handle this request. Please do keep us posted about your experience during the classes. We can increase resources if you have issues during the classes

@dbroockman
Copy link
Author

Will do. I would ask that you err on the side of more resources for our first class this coming Monday because if everything fails on the first day it will set a really bad tone for this large class of students who have never done coding/data science before.

@balajialg
Copy link
Contributor

Thanks for the heads up @dbroockman! I will check with the team (once again) if further tweaks are required for the current allocation which seems sufficient. Thanks!

@balajialg
Copy link
Contributor

@dbroockman How did the classes go yesterday? Did students have a smooth experience using Datahub?

@dbroockman
Copy link
Author

I think it went fine, thanks for checking in. FYI the class alternates between days when they do group work (so only 1/3 of the class uses Datahub) and when they all do, so please don't reduce resources further by looking at performance tomorrow for example :).

@balajialg
Copy link
Contributor

balajialg commented Jan 25, 2023

@dbroockman Good to hear. Yes - We won't scale down the allocated resources till the end of the semester. Closing this issue as the request seems to be fulfilled by the creation of calendar event. Please re-open this issue if you need anything else. Thanks

@dbroockman
Copy link
Author

I'm hearing from students it usually takes around 4-5 minutes for notebooks to load at the beginning of class when all the students are logging in at once (~300 students). Is it possible to up the resources a bit?

@balajialg
Copy link
Contributor

balajialg commented Feb 8, 2023

@dbroockman Is it happening across any one of the instruction days or across both Mondays and Wednesdays?

@dbroockman
Copy link
Author

dbroockman commented Feb 8, 2023 via email

@balajialg
Copy link
Contributor

@dbroockman Got it. I bumped up the resources for your class today. Can you let me know how it goes? I will replicate the same changes for both days if it works well today or else bump it up further.

@balajialg balajialg reopened this Feb 8, 2023
@dbroockman
Copy link
Author

dbroockman commented Feb 9, 2023 via email

@balajialg
Copy link
Contributor

@dbroockman Sounds good.

Modified the calendar events to have placeholder nodes for Monday's event equivalent to what we had yesterday. I will keep the issue open to see the startup times of the R hub on Monday.

@shaneknapp
Copy link
Contributor

@dbroockman our monitoring is lying to us. i'll be adding some functionality to allow us to delay certain hubs from launching servers, and we will be able to use this to hunt down why we're not seeing it in grafana, and then what we can do to increase performace and monitoring/alerting.

issue: #4237
PR: #4241

@balajialg
Copy link
Contributor

balajialg commented Feb 14, 2023

@dbroockman How did the class go yesterday? I see that almost 300+ students logged into the R hub and there was a corresponding increase in the resource allocation.

image
image

@dbroockman
Copy link
Author

dbroockman commented Feb 14, 2023 via email

@balajialg
Copy link
Contributor

balajialg commented Feb 14, 2023

that's great to hear @dbroockman. Thanks!

@balajialg
Copy link
Contributor

balajialg commented Feb 16, 2023

@dbroockman Quick clarification, We are debugging an issue raised by another instructor which involves the slow launch of the Jupyter application and wanted to check whether what you reported few weeks back is similar to what has been reported. When students reported server startup times were almost ~4 to 5 minutes previously, Did they report based on the timer that loads when you log in to Datahub (something like the snapshot below) which reports the total time taken to load the server? This would help us troubleshoot this issue further. Thanks!

image

@dbroockman
Copy link
Author

dbroockman commented Feb 17, 2023 via email

@balajialg
Copy link
Contributor

@dbroockman Sounds good. Thanks for your inputs.

@dbroockman
Copy link
Author

Around 1/3 students are experiencing very long waits to get access today. Here are examples of what they're seeing.

image
image
image

@balajialg
Copy link
Contributor

Sorry @dbroockman for the issue! Seems like there was a rush around 5 - 5.30 PM when 190+ students accessed R hub.
image

We provisioned 4 additional placeholders for this class using calendar scaler. Node count went from 1 to 3 during the scheduled time (which means we did over allocate for this class) . @ryanlovett @felder Any idea what might be going wrong here.

image
image

@dbroockman
Copy link
Author

dbroockman commented Feb 23, 2023 via email

@balajialg
Copy link
Contributor

balajialg commented Feb 23, 2023

@dbroockman Really sorry about the experience today. We need to do an estimate on the cost for allocating each node. Having said that, We don't have any problem allocating more resources from the next class. From my understanding, we did allocate more than the required resources for this class (based on the node count), and still, the slow server start-up time has persisted. I am trying to understand whether this is an issue with the calendar scaler or if is it something else that is resulting in slow startup times.

@ryanlovett
Copy link
Collaborator

@balajialg The calendar event for scale up starts at 4:30p, but the graph shows that the nodes don't scale up until after class started. The logs for the pod don't show that it ever parsed the PS3 calendar event. It is not working properly.

I just killed the existing node-placeholder pod (running for 30d).

@balajialg
Copy link
Contributor

balajialg commented Feb 23, 2023

FWIW, All the users in the R hub got populated in the 3 nodes,

image
image
image

@balajialg
Copy link
Contributor

@dbroockman - Going forward, We plan to stop increasing resources for your class through the calendar events to avoid issues like today where the calendar event did not work as expected. We are currently exploring alternatives as of now which might involve a) configuring these values within our code base or b) bumping up resources for the R hub overall. Will get back to you either tomorrow or the day after about the plan of action so that you have some clarity before your next class on the 27th.

@dbroockman
Copy link
Author

dbroockman commented Feb 23, 2023 via email

@balajialg
Copy link
Contributor

@dbroockman Manually provisioned resources for your class today (~4 nodes). Going forward, that seems to be the path forward till we find a better solution.
image

@dbroockman
Copy link
Author

dbroockman commented Feb 28, 2023 via email

@balajialg
Copy link
Contributor

@dbroockman Thanks for the clarity! That's helpful.

@balajialg
Copy link
Contributor

balajialg commented Mar 2, 2023

@dbroockman Manually configured the required resources for the class today by monitoring the metrics. Looks like students are having a smooth experience during today's class (Making this assumption by looking at the metrics and trying to login to R hub without any delay). Please let us know if there are any issues.

@balajialg
Copy link
Contributor

balajialg commented Mar 9, 2023

@dbroockman Closing this issue as the improvements to calendar based scaler seem to work during the last two classes. Please feel free to re-open if you have any specific queries. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants