Reduce GCP Fixed Costs by 50% #2453

Adam-D-Lewis · 2024-05-07T19:43:07Z

Reference Issues or PRs

Fixes #2452

What does this implement/fix?

Put a x in the boxes that apply

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds a feature)
Breaking change (fix or feature that would cause existing features not to work as expected)
Documentation Update
Code style update (formatting, renaming)
Refactoring (no functional changes, no API changes)
Build related changes
Other (please describe): Make cost optimized E2 instances on GCP the default for node groups.

Testing

Did you test the pull request locally?
Did you add new tests?

Any other comments?

dcmcand

❓ Are we sure that the e2 is suitable for the user and worker node groups? It seems like for the worker group especially, they might handycap performance. Additionally, they do not support GPU's. I think it might be better to run the General node group on the e2-highmem-4 and the user and worker node groups on the n4-standard-4. Especially with them scaling down to zero now, I think that would be an acceptable tradeoff for performance vs price.

Adam-D-Lewis · 2024-05-07T20:57:16Z

❓ Are we sure that the e2 is suitable for the user and worker node groups? It seems like for the worker group especially, they might handicap performance. Additionally, they do not support GPU's. I think it might be better to run the General node group on the e2-highmem-4 and the user and worker node groups on the n4-standard-4. Especially with them scaling down to zero now, I think that would be an acceptable tradeoff for performance vs price.

Great points @dcmcand , I looked into it a bit more.

It looks like CPU performance (Coremark score) is roughly equal between the 2 types.

Also, for GPU instances, we usually create new node groups specifically for the gpu profiles although it is not fully documented at the moment e.g. (image below) so I don't see that as an issue for the user or worker node defaults since they would not use these default node groups for their gpu instances.

The only disadvantage I see is the maximum egress is down from 10 to 8 Gbps, but I believe the cost savings is worth the 20% reduction in bandwidth for most users though it's just a hunch.

Adam-D-Lewis · 2024-05-09T12:38:11Z

@dcmcand any further concerns or comments?

dcmcand · 2024-05-09T12:50:05Z

I still feel a bit of concern, specifically around the dask worker. However, I have no data to actually justify by concern.

If someone upgrades and then applies the new config, it will result in the nodes being replaced. Do we have any concerns about that?

I also feel like we should make sure we document this change and how to restore the original functionality. Maybe in the FAQ? But probably also in the release notes.

Adam-D-Lewis · 2024-05-09T14:51:34Z

I still feel a bit of concern, specifically around the dask worker. However, I have no data to actually justify by concern.
If someone upgrades and then applies the new config, it will result in the nodes being replaced. Do we have any concerns about that?

We've added node types to the nebari config that is created when running nebari init so if people have node types in their config (as is the default), then they won't be affected by this change. Assuming they don't have node types in their config, the nodes will be replaced. This causes Nebari to be unusable for about ~15 minutes as the nodes are switched out, but shouldn't cause a problem otherwise. I tested this on a deployment and it worked as expected.

I also feel like we should make sure we document this change and how to restore the original functionality. Maybe in the FAQ? But probably also in the release notes.

I'll document it in the Nebari upgrade command so that users will be notified if this will affect them and what they need to add to their config so that it won't affect them, and we can copy something similar to the release notes.

dcmcand · 2024-05-09T20:02:27Z

sounds good. thanks @Adam-D-Lewis

Adam-D-Lewis · 2024-05-14T18:12:49Z

We don't know what the next Nebari version will be at the moment (2024.5.2 vs 2024.6.1) so I opened a separate PR and assigned it to the 2024.5.2 milestone. #2466. My thought is that we merge this as is and make sure to merge the other in with the appropriate version number during the next release step.

change default gcp instances to cost optimized e2 family instances

2ff3012

Adam-D-Lewis requested review from dcmcand, viniciusdc and marcelovilla May 7, 2024 19:44

dcmcand reviewed May 7, 2024

View reviewed changes

Adam-D-Lewis requested a review from dcmcand May 7, 2024 20:58

dcmcand approved these changes May 9, 2024

View reviewed changes

Adam-D-Lewis added 6 commits May 14, 2024 08:47

Merge branch 'develop' into reduce_gcp_costs_by_50_percent

516899b

add upgrade message

6bbb3b7

make upgrade step 2024.5.2

75270ba

add upgrade message

c45b083

add link

6471048

remove upgrade step

58e05c4

Adam-D-Lewis mentioned this pull request May 14, 2024

upgrade instructions for PR 2453 #2466

Merged

10 tasks

Adam-D-Lewis merged commit 2a2f2ee into develop May 14, 2024
25 of 26 checks passed

Adam-D-Lewis deleted the reduce_gcp_costs_by_50_percent branch May 14, 2024 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce GCP Fixed Costs by 50% #2453

Reduce GCP Fixed Costs by 50% #2453

Adam-D-Lewis commented May 7, 2024 •

edited

Loading

dcmcand left a comment

Adam-D-Lewis commented May 7, 2024 •

edited

Loading

Adam-D-Lewis commented May 9, 2024

dcmcand commented May 9, 2024

Adam-D-Lewis commented May 9, 2024 •

edited

Loading

dcmcand commented May 9, 2024

Adam-D-Lewis commented May 14, 2024

Reduce GCP Fixed Costs by 50% #2453

Reduce GCP Fixed Costs by 50% #2453

Conversation

Adam-D-Lewis commented May 7, 2024 • edited Loading

Reference Issues or PRs

What does this implement/fix?

Testing

Any other comments?

dcmcand left a comment

Choose a reason for hiding this comment

Adam-D-Lewis commented May 7, 2024 • edited Loading

Adam-D-Lewis commented May 9, 2024

dcmcand commented May 9, 2024

Adam-D-Lewis commented May 9, 2024 • edited Loading

dcmcand commented May 9, 2024

Adam-D-Lewis commented May 14, 2024

Adam-D-Lewis commented May 7, 2024 •

edited

Loading

Adam-D-Lewis commented May 7, 2024 •

edited

Loading

Adam-D-Lewis commented May 9, 2024 •

edited

Loading