-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase compute resources for Course PS 3 January - April 2023 Mondays/Wednesdays 5-6:30 pm #4009
Comments
@dbroockman We are making major upgrades to the infrastructure which might affect the current functionality of the calendar-based auto scaler. We are looking at a 2-3 week timeline to make these changes. We should be able to look into this request and resolve it once the upgrades are completed. |
@shaneknapp Can we increase resources for this class? or do we need to tweak the script for auto scaler further? |
@felder @shaneknapp @ryanlovett Created a recurring calendar event for Pol Sci 3 between 4.30 - 6.30 PM every monday and wednesday and allocated 8 nodes for this class of 350 students (~45 pods per node amounting to ~8 nodes) in the R hub. Let me know if you think the allocation makes sense? |
@balajialg That number should not be set to how many total nodes should be allocated. It is how many available spares ("hotspares" or "placeholders") there should be. It is to address the rate of growth so if a lot of people login in a short amount of time then the hot spare value is higher. If all 350 login at exactly 4:30p, try setting the value to be 2 starting at 4:20p. The cluster will make two additional empty nodes available which can handle 90 logins (based on your pod density figure), not counting those that are already running user pods. The cluster will continue to ramp up nodes, always keeping "+2" empty. Once class is underway and the rate of growth has slowed, the hotspare/placeholder count can be set back to 1 or 0. If 2 proves to be insufficient it can be scaled up higher, or if it is more than enough it can be set lower. You can verify by seeing how many users are in pending and how long server startup takes. |
Thanks, @ryanlovett for the detailed explanation! It makes complete sense. I misunderstood that google calendar-based auto-scaler events allocate the exact nodes highlighted in the calendar event. This belief got strengthened after I saw that 10 hot spares got allocated for this class during Spring 2022 (Snapshot attached). Changed the calendar entry to two nodes for now and let's tweak it based on real-time metrics (as you highlighted). |
@balajialg I made the same assumption the first time I adjusted the calendar. :) |
@dbroockman Created a calendar event to handle this request. Please do keep us posted about your experience during the classes. We can increase resources if you have issues during the classes |
Will do. I would ask that you err on the side of more resources for our first class this coming Monday because if everything fails on the first day it will set a really bad tone for this large class of students who have never done coding/data science before. |
Thanks for the heads up @dbroockman! I will check with the team (once again) if further tweaks are required for the current allocation which seems sufficient. Thanks! |
@dbroockman How did the classes go yesterday? Did students have a smooth experience using Datahub? |
I think it went fine, thanks for checking in. FYI the class alternates between days when they do group work (so only 1/3 of the class uses Datahub) and when they all do, so please don't reduce resources further by looking at performance tomorrow for example :). |
@dbroockman Good to hear. Yes - We won't scale down the allocated resources till the end of the semester. Closing this issue as the request seems to be fulfilled by the creation of calendar event. Please re-open this issue if you need anything else. Thanks |
I'm hearing from students it usually takes around 4-5 minutes for notebooks to load at the beginning of class when all the students are logging in at once (~300 students). Is it possible to up the resources a bit? |
@dbroockman Is it happening across any one of the instruction days or across both Mondays and Wednesdays? |
Both, but especially Mondays. After President's day that will be
Wednesday's (the format of the class alternates between days).
…On Wed, Feb 8, 2023 at 3:02 PM Balaji Alwar ***@***.***> wrote:
@dbroockman <https://github.com/dbroockman> Is it happening across any
one of the instruction days or both Mondays and Wednesdays?
—
Reply to this email directly, view it on GitHub
<#4009 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABS6KME3CPJPHY7C5NAFOLTWWQQZJANCNFSM6AAAAAATC5UBTE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dbroockman Got it. I bumped up the resources for your class today. Can you let me know how it goes? I will replicate the same changes for both days if it works well today or else bump it up further. |
Class was better today. Let's see how Monday goes, though. Thank you!
…On Wed, Feb 8, 2023 at 3:16 PM Balaji Alwar ***@***.***> wrote:
@dbroockman <https://github.com/dbroockman> Got it. I bumped up the
resources for your class today. Can you let me know how it goes? I will
replicate the same changes for both days if it works well today or else
bump it up further.
—
Reply to this email directly, view it on GitHub
<#4009 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABS6KMHRTT3FJWRTH6IZGCTWWQSMBANCNFSM6AAAAAATC5UBTE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dbroockman Sounds good. Modified the calendar events to have placeholder nodes for Monday's event equivalent to what we had yesterday. I will keep the issue open to see the startup times of the R hub on Monday. |
@dbroockman our monitoring is lying to us. i'll be adding some functionality to allow us to delay certain hubs from launching servers, and we will be able to use this to hunt down why we're not seeing it in grafana, and then what we can do to increase performace and monitoring/alerting. |
@dbroockman How did the class go yesterday? I see that almost 300+ students logged into the R hub and there was a corresponding increase in the resource allocation. |
No issues yesterday, thank you!
…On Tue, Feb 14, 2023 at 10:40 AM Balaji Alwar ***@***.***> wrote:
@dbroockman <https://github.com/dbroockman> How did the class go
yesterday? I see that almost 300+ students logged into the hub and there
was a corresponding increase in the resource allocation.
[image: image]
<https://user-images.githubusercontent.com/2306166/218827496-5d7a94a9-7a77-4fa5-99d0-fade9726d962.png>
[image: image]
<https://user-images.githubusercontent.com/2306166/218827779-51730e49-ce19-4225-9e21-709dbcbe38e8.png>
—
Reply to this email directly, view it on GitHub
<#4009 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABS6KMCL6FEV46OORDFCQBLWXPGRXANCNFSM6AAAAAATC5UBTE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
that's great to hear @dbroockman. Thanks! |
@dbroockman Quick clarification, We are debugging an issue raised by another instructor which involves the slow launch of the Jupyter application and wanted to check whether what you reported few weeks back is similar to what has been reported. When students reported server startup times were almost ~4 to 5 minutes previously, Did they report based on the timer that loads when you log in to Datahub (something like the snapshot below) which reports the total time taken to load the server? This would help us troubleshoot this issue further. Thanks! |
Unfortunately they told me how long they had to wait but I don't know which
waiting screen they were stuck on. My next class is Wednesday at 5:00 so if
you try to log in at Wednesday around 5:05 you'll probably experience what
they experience.
…On Thu, Feb 16, 2023 at 12:34 PM Balaji Alwar ***@***.***> wrote:
@dbroockman <https://github.com/dbroockman> Quick clarification, We are
debugging the slow launch of the Jupyter interface application issue raised
by another instructor and wanted to check whether what you reported few
weeks back falls within the same criteria. When students reported server
startup times were almost ~4 to 5 minutes previously, Did they report based
on the timer that loads when you log in to Datahub (something like the
snapshot below) which reports the total time taken to load the server? This
would help us troubleshoot this issue further. Thanks!
[image: image]
<https://user-images.githubusercontent.com/2306166/219480177-761dd4b4-3f20-4a23-895a-0f145d504f2e.png>
—
Reply to this email directly, view it on GitHub
<#4009 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABS6KMFOS5FZ5OTDBJT37FDWX2FNFANCNFSM6AAAAAATC5UBTE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dbroockman Sounds good. Thanks for your inputs. |
Sorry @dbroockman for the issue! Seems like there was a rush around 5 - 5.30 PM when 190+ students accessed R hub. We provisioned 4 additional placeholders for this class using calendar scaler. Node count went from 1 to 3 during the scheduled time (which means we did over allocate for this class) . @ryanlovett @felder Any idea what might be going wrong here. |
Can we just up the node count to 10 or something? How much does each node
cost…?
…On Wed, Feb 22, 2023 at 5:39 PM Balaji Alwar ***@***.***> wrote:
Sorry @dbroockman <https://github.com/dbroockman> for the issue! Seems
like there was a rush around 5 - 5.30 PM when 190+ students accessed R hub.
[image: image]
<https://user-images.githubusercontent.com/2306166/220802535-218d89e4-a82d-47d6-aeb7-98fb39aefcc3.png>
We provisioned 4 additional placeholders for this class using calendar
scaler. Node count went from 1 to 3 during the scheduled time (which means
we did over allocate for this class) . @ryanlovett
<https://github.com/ryanlovett> @felder <https://github.com/felder> Any
idea what might be going wrong here.
[image: image]
<https://user-images.githubusercontent.com/2306166/220802712-0143657b-502e-4e9b-b71f-ba1c285c8547.png>
[image: image]
<https://user-images.githubusercontent.com/2306166/220802781-433c42f0-7ea3-4919-b1ae-bb5bfbbee361.png>
—
Reply to this email directly, view it on GitHub
<#4009 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABS6KMARX3DVOS2PPA3FPB3WY25ULANCNFSM6AAAAAATC5UBTE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dbroockman Really sorry about the experience today. We need to do an estimate on the cost for allocating each node. Having said that, We don't have any problem allocating more resources from the next class. From my understanding, we did allocate more than the required resources for this class (based on the node count), and still, the slow server start-up time has persisted. I am trying to understand whether this is an issue with the calendar scaler or if is it something else that is resulting in slow startup times. |
@balajialg The calendar event for scale up starts at 4:30p, but the graph shows that the nodes don't scale up until after class started. The logs for the pod don't show that it ever parsed the PS3 calendar event. It is not working properly. I just killed the existing node-placeholder pod (running for 30d). |
@dbroockman - Going forward, We plan to stop increasing resources for your class through the calendar events to avoid issues like today where the calendar event did not work as expected. We are currently exploring alternatives as of now which might involve a) configuring these values within our code base or b) bumping up resources for the R hub overall. Will get back to you either tomorrow or the day after about the plan of action so that you have some clarity before your next class on the 27th. |
Thanks - appreciated!
…On Wed, Feb 22, 2023 at 6:24 PM Balaji Alwar ***@***.***> wrote:
@dbroockman <https://github.com/dbroockman> - Going forward, We plan to
stop increasing resources for your class through the calendar events to
avoid issues like today where the calendar event did not work as expected.
We are currently exploring alternatives as of now which might involve a)
configuring these values within our code base or b) bumping up resources
for the R hub overall. Will get back to you either tomorrow or the day
after about the plan of action so that you have some clarity before your
next class on the 27th.
—
Reply to this email directly, view it on GitHub
<#4009 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABS6KMAVAEMHHHDNEJ6FZS3WY3C6LANCNFSM6AAAAAATC5UBTE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dbroockman Manually provisioned resources for your class today (~4 nodes). Going forward, that seems to be the path forward till we find a better solution. |
Thanks, appreciate it. Going forward, Mondays are moderate-high utilization
(they have group assignments in small groups) and Wednesdays are high
utilization (individual assignments so everyone’s machine is on). It was
previously the reverse but after we missed a Monday due to President’s day
things switched.
…On Mon, Feb 27, 2023 at 4:48 PM Balaji Alwar ***@***.***> wrote:
@dbroockman <https://github.com/dbroockman> Manually provisioned
resources for your class today (~4 nodes). Going forward, that seems to be
the path forward till we find a better solution.
[image: image]
<https://user-images.githubusercontent.com/2306166/221723073-86fd00fd-6647-4f7b-82e6-1a21610d6ae7.png>
—
Reply to this email directly, view it on GitHub
<#4009 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABS6KMHU5CSSDV2A7YU3FOTWZVDPFANCNFSM6AAAAAATC5UBTE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@dbroockman Thanks for the clarity! That's helpful. |
@dbroockman Manually configured the required resources for the class today by monitoring the metrics. Looks like students are having a smooth experience during today's class (Making this assumption by looking at the metrics and trying to login to R hub without any delay). Please let us know if there are any issues. |
@dbroockman Closing this issue as the improvements to calendar based scaler seem to work during the last two classes. Please feel free to re-open if you have any specific queries. Thanks! |
Course Name
PS 3
Detailed Requirements
350 students will be using R.Datahub simultaneously from 5-6:30 pm on Mondays and Wednesdays during spring 2023
Semester Details
No
Request Deadline
First class January 23
Last class April 26
The text was updated successfully, but these errors were encountered: