-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce cost of existing hub infra #338
Comments
#235 is relevant |
/cc @damianavila, who was interested in looking through this. I added you to the projects as well :) |
It looks like for the cloudbank hubs, the cost has increased from mid March: And this is probably related to the number of core nods currently running (4): What's curious is that there is no user node. |
Thanks for looking into it, @GeorgianaElena. Can you investigate reducing the number of core nodes for now? I'd suggest identifying nodes that can be eliminated (look for active users in hubs, any core pods in hubs without users are fair game), cordoning them, and then deleting all the pods in them. #278 does look suspicious. Let's revert that. |
Quick question here... (maybe not that quick 😉) So in the pilot-hubs deployment, you have 2 pools (actually 3, I think, but let's focus on the 2 ones involved here), the core and the users one, right? With #278 you were trying to put user pods into the core nodes until you really need to spin up the user's one, correct? @GeorgianaElena implemented the extra affinity and she said in one of her comments:
I would say that is expected...
I would say what Georgiana was thinking was correct... if the preferred afinity was toward the user-poll why would be scheduling the pod on the core-poll But then @yuvipanda replied with:
Why the autoscaler will spin up a user node when the pressure is on the core node? I may be missing a lot of things here because I just started looking into the pilot-hubs deployment, and I have not played with affinity stuff before (the deployments I have worked out do not differentiate between core and users pods, which I think is a nice idea although you could end up with empty - user - nodes, which I presume was the cause that kicked #278). |
Great questions, @damianavila!
Yes. I tried to explain these node pools choice a bit, here.
Exactly! We noticed that spinning up a new node takes a considerable time. And we thought that this is what made the timeout of the tests be hit pretty often.
Because the scale-up event would have been triggered by a user pod spawn and the user pod "prefers" a user node. At least, this how I understood things. In practice, in the cc @yuvipanda, who can confirm, or not if I missed anything or said anything off 😅 |
But is the affinity "strong" enough to trigger the creation of a user node instead of a core one?
But since you are not using the NodeSelector anymore in that PR, can that pod be "named/classified" as a user pod now? I guess the |
I'll go for a close on this in favor of the more concrete #4024 |
Description
As of Feb 2021, our GCP hub infrastructure seems to cost around ~400$ a month. This is too much. We should investigate what is costing this much, and reduce it.
Expected timeline
I suspect there will be some 'simple fixes' we can do that will bring us to ~300$ a month, and then go down to $200.
Tasks to complete
The text was updated successfully, but these errors were encountered: