Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce cost of existing hub infra #338

Closed
2 of 4 tasks
yuvipanda opened this issue Apr 6, 2021 · 8 comments
Closed
2 of 4 tasks

Reduce cost of existing hub infra #338

yuvipanda opened this issue Apr 6, 2021 · 8 comments
Labels
Enhancement An improvement to something or creating something new.

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Apr 6, 2021

Description

As of Feb 2021, our GCP hub infrastructure seems to cost around ~400$ a month. This is too much. We should investigate what is costing this much, and reduce it.

Expected timeline

I suspect there will be some 'simple fixes' we can do that will bring us to ~300$ a month, and then go down to $200.

Tasks to complete

@yuvipanda yuvipanda added the goal label Apr 6, 2021
@yuvipanda
Copy link
Member Author

#235 is relevant

@yuvipanda
Copy link
Member Author

yuvipanda commented Apr 6, 2021

/cc @damianavila, who was interested in looking through this. I added you to the projects as well :)

@GeorgianaElena
Copy link
Member

It looks like for the cloudbank hubs, the cost has increased from mid March:
cb-costs

And this is probably related to the number of core nods currently running (4):
core-pool-cb

What's curious is that there is no user node.
@yuvipanda, do you think this way of scheduling all pods on core nodes can be related to this PR: #278?

@yuvipanda
Copy link
Member Author

Thanks for looking into it, @GeorgianaElena. Can you investigate reducing the number of core nodes for now? I'd suggest identifying nodes that can be eliminated (look for active users in hubs, any core pods in hubs without users are fair game), cordoning them, and then deleting all the pods in them.

#278 does look suspicious. Let's revert that.

@choldgraf choldgraf added Enhancement An improvement to something or creating something new. and removed type: goal labels Apr 15, 2021
@damianavila
Copy link
Contributor

Quick question here... (maybe not that quick 😉)

So in the pilot-hubs deployment, you have 2 pools (actually 3, I think, but let's focus on the 2 ones involved here), the core and the users one, right?

With #278 you were trying to put user pods into the core nodes until you really need to spin up the user's one, correct?

@GeorgianaElena implemented the extra affinity and she said in one of her comments:

I started a user server when only one core node and no user node was available.
the user server got scheduled on the core node (✔️ this is what should have happened, right?)

I would say that is expected...

I started a user server when only one core node and one user node were available.
the user server got scheduled on the core node again (❌ It should have preferred the user node instead, right?)

I would say what Georgiana was thinking was correct... if the preferred afinity was toward the user-poll why would be scheduling the pod on the core-poll

But then @yuvipanda replied with:

I think the two scenarios you tested are right, @GeorgianaElena! The third one is that core node is full, new node needs to come up. It should spawn a user node, not a core node.

Why the autoscaler will spin up a user node when the pressure is on the core node?

I may be missing a lot of things here because I just started looking into the pilot-hubs deployment, and I have not played with affinity stuff before (the deployments I have worked out do not differentiate between core and users pods, which I think is a nice idea although you could end up with empty - user - nodes, which I presume was the cause that kicked #278).

@GeorgianaElena
Copy link
Member

Great questions, @damianavila!
I tried to provide some answers, to the best of my knowledge :)

So in the pilot-hubs deployment, you have 2 pools (actually 3, I think, but let's focus on the 2 ones involved here), the core and the users one, right?

Yes. I tried to explain these node pools choice a bit, here.

With #278 you were trying to put user pods into the core nodes until you really need to spin up the user's one, correct?

Exactly! We noticed that spinning up a new node takes a considerable time. And we thought that this is what made the timeout of the tests be hit pretty often.
To solve this, because there is always at least one core node available, it seemed like a good idea to piggy back on that and schedule user pods there too, until there are no resources available on that core node. When that node would have become full, then future user-pods spawn would have triggered a new user node creation (and not a core node, because of the "affinity" we had just set)

Why the autoscaler will spin up a user node when the pressure is on the core node?

Because the scale-up event would have been triggered by a user pod spawn and the user pod "prefers" a user node. At least, this how I understood things. In practice, in the cloudbank cluster, we noticed there were 4 core nodes and no user node running... And this is why the PR was reverted. (Also, the core nodes machines cost more than the user ones)

cc @yuvipanda, who can confirm, or not if I missed anything or said anything off 😅

@damianavila
Copy link
Contributor

When that node would have become full, then future user-pods spawn would have triggered a new user node creation (and not a core node, because of the "affinity" we had just set)

But is the affinity "strong" enough to trigger the creation of a user node instead of a core one?
I guess it could be enough if it is required but maybe not if it is preferred, right?

Because the scale-up event would have been triggered by a user pod spawn and the user pod "prefers" a user node.

But since you are not using the NodeSelector anymore in that PR, can that pod be "named/classified" as a user pod now? I guess the preferred affinity in that PR could place the pod into an already existing user pool but if that user node does not exist, why this affinity-enable user pod will trigger the creation of a new user node?
Again, I might be missing a lot of stuff, sorry in advance if the questions have obvious answers (feel free to point me at code/docs to look at).

@consideRatio
Copy link
Contributor

I'll go for a close on this in favor of the more concrete #4024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement An improvement to something or creating something new.
Projects
None yet
Development

No branches or pull requests

5 participants