Reduce cost of existing hub infra #338

yuvipanda · 2021-04-06T15:29:51Z

Description

As of Feb 2021, our GCP hub infrastructure seems to cost around ~400$ a month. This is too much. We should investigate what is costing this much, and reduce it.

Expected timeline

I suspect there will be some 'simple fixes' we can do that will bring us to ~300$ a month, and then go down to $200.

Tasks to complete

Figure out why our clusters cost what they do and create issues to do specific cost reduction
Minimize base cost of our clusters #235
Reduce costs on our GKE clusters by switching to n1 instances #715
...add issues as needed until we're happy with our costs

yuvipanda · 2021-04-06T15:30:18Z

#235 is relevant

yuvipanda · 2021-04-06T15:30:31Z

/cc @damianavila, who was interested in looking through this. I added you to the projects as well :)

GeorgianaElena · 2021-04-13T16:42:30Z

It looks like for the cloudbank hubs, the cost has increased from mid March:

And this is probably related to the number of core nods currently running (4):

What's curious is that there is no user node.
@yuvipanda, do you think this way of scheduling all pods on core nodes can be related to this PR: #278?

yuvipanda · 2021-04-13T16:45:05Z

Thanks for looking into it, @GeorgianaElena. Can you investigate reducing the number of core nodes for now? I'd suggest identifying nodes that can be eliminated (look for active users in hubs, any core pods in hubs without users are fair game), cordoning them, and then deleting all the pods in them.

#278 does look suspicious. Let's revert that.

damianavila · 2021-04-15T14:50:52Z

Quick question here... (maybe not that quick 😉)

So in the pilot-hubs deployment, you have 2 pools (actually 3, I think, but let's focus on the 2 ones involved here), the core and the users one, right?

With #278 you were trying to put user pods into the core nodes until you really need to spin up the user's one, correct?

@GeorgianaElena implemented the extra affinity and she said in one of her comments:

I started a user server when only one core node and no user node was available.
the user server got scheduled on the core node (✔️ this is what should have happened, right?)

I would say that is expected...

I started a user server when only one core node and one user node were available.
the user server got scheduled on the core node again (❌ It should have preferred the user node instead, right?)

I would say what Georgiana was thinking was correct... if the preferred afinity was toward the user-poll why would be scheduling the pod on the core-poll

But then @yuvipanda replied with:

I think the two scenarios you tested are right, @GeorgianaElena! The third one is that core node is full, new node needs to come up. It should spawn a user node, not a core node.

Why the autoscaler will spin up a user node when the pressure is on the core node?

I may be missing a lot of things here because I just started looking into the pilot-hubs deployment, and I have not played with affinity stuff before (the deployments I have worked out do not differentiate between core and users pods, which I think is a nice idea although you could end up with empty - user - nodes, which I presume was the cause that kicked #278).

GeorgianaElena · 2021-04-16T09:26:36Z

Great questions, @damianavila!
I tried to provide some answers, to the best of my knowledge :)

So in the pilot-hubs deployment, you have 2 pools (actually 3, I think, but let's focus on the 2 ones involved here), the core and the users one, right?

Yes. I tried to explain these node pools choice a bit, here.

With #278 you were trying to put user pods into the core nodes until you really need to spin up the user's one, correct?

Exactly! We noticed that spinning up a new node takes a considerable time. And we thought that this is what made the timeout of the tests be hit pretty often.
To solve this, because there is always at least one core node available, it seemed like a good idea to piggy back on that and schedule user pods there too, until there are no resources available on that core node. When that node would have become full, then future user-pods spawn would have triggered a new user node creation (and not a core node, because of the "affinity" we had just set)

Why the autoscaler will spin up a user node when the pressure is on the core node?

Because the scale-up event would have been triggered by a user pod spawn and the user pod "prefers" a user node. At least, this how I understood things. In practice, in the cloudbank cluster, we noticed there were 4 core nodes and no user node running... And this is why the PR was reverted. (Also, the core nodes machines cost more than the user ones)

cc @yuvipanda, who can confirm, or not if I missed anything or said anything off 😅

damianavila · 2021-04-16T15:59:09Z

When that node would have become full, then future user-pods spawn would have triggered a new user node creation (and not a core node, because of the "affinity" we had just set)

But is the affinity "strong" enough to trigger the creation of a user node instead of a core one?
I guess it could be enough if it is required but maybe not if it is preferred, right?

Because the scale-up event would have been triggered by a user pod spawn and the user pod "prefers" a user node.

But since you are not using the NodeSelector anymore in that PR, can that pod be "named/classified" as a user pod now? I guess the preferred affinity in that PR could place the pod into an already existing user pool but if that user node does not exist, why this affinity-enable user pod will trigger the creation of a new user node?
Again, I might be missing a lot of stuff, sorry in advance if the questions have obvious answers (feel free to point me at code/docs to look at).

consideRatio · 2024-05-02T19:32:11Z

I'll go for a close on this in favor of the more concrete #4024

yuvipanda added the goal label Apr 6, 2021

choldgraf added Enhancement An improvement to something or creating something new. and removed type: goal labels Apr 15, 2021

GeorgianaElena mentioned this issue Apr 15, 2021

Document how to free up the resouces on a node #348

Merged

damianavila mentioned this issue Apr 19, 2021

Team Sync - Apr 19, 2021 2i2c-org/team-compass#70

Closed

damianavila mentioned this issue Apr 26, 2021

Team Sync - Apr 26, 2021 2i2c-org/team-compass#75

Closed

consideRatio closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce cost of existing hub infra #338

Reduce cost of existing hub infra #338

yuvipanda commented Apr 6, 2021 •

edited by consideRatio

Loading

yuvipanda commented Apr 6, 2021

yuvipanda commented Apr 6, 2021 •

edited

Loading

GeorgianaElena commented Apr 13, 2021

yuvipanda commented Apr 13, 2021

damianavila commented Apr 15, 2021

GeorgianaElena commented Apr 16, 2021

damianavila commented Apr 16, 2021

consideRatio commented May 2, 2024

Reduce cost of existing hub infra #338

Reduce cost of existing hub infra #338

Comments

yuvipanda commented Apr 6, 2021 • edited by consideRatio Loading

Description

Expected timeline

Tasks to complete

yuvipanda commented Apr 6, 2021

yuvipanda commented Apr 6, 2021 • edited Loading

GeorgianaElena commented Apr 13, 2021

yuvipanda commented Apr 13, 2021

damianavila commented Apr 15, 2021

GeorgianaElena commented Apr 16, 2021

damianavila commented Apr 16, 2021

consideRatio commented May 2, 2024

yuvipanda commented Apr 6, 2021 •

edited by consideRatio

Loading

yuvipanda commented Apr 6, 2021 •

edited

Loading