[Proposal] Cloud cost estimation for JupyterHubs on shared clusters #277

choldgraf · 2021-10-28T00:28:52Z

Description

In our alpha service doc (#262) we need to be able to estimate the cloud resource usage for each community's hub. This is more difficult on shared clusters, because we do not have a 1-to-1 mapping of billing account to hub community.

Here is a proposal to estimate cloud costs for our alpha service, I'd love comments and ideas from others:

Link to cloud costs proposal section

It makes the following assumptions, and I'd love feedback about whether these are correct:

The largest driver of cost for most communities will be interactive user sessions
All other cloud costs like support infrastructure is shared between user communities and thus fairly low
RAM is the biggest driver of cloud costs, because it is generally the thing that triggers node scale-up events.
We have the ability to track the RAM requested for user pods for each hub by the hour. (I looked at our Prometheus metrics and I think this is possible)

Finally, because I'm not sure how to estimate cloud usage for Dask Gateway hubs yet, I suggest that we only run these hubs on dedicated clusters for the alpha. We can change this if we come up with a good way to estimate their usage on a shared cluster (but not for the alpha).

Value / benefit

This is designed to be a balance between what is possible right now and what is reasonable given our costs. So the main benefit is that it accurately reflects our actual costs, and is possible to calculate in a sustainable way given the Prometheus metrics we're generating.

Implementation details

No response

Tasks to complete

Take a look at the proposal and let me know if you think it should be amended in a significant way (feel free to suggest or make edits!)

Updates

No response

yuvipanda · 2021-10-28T12:05:26Z

Finally, because I'm not sure how to estimate cloud usage for Dask Gateway hubs yet, I suggest that we only run these hubs on dedicated clusters for the alpha. We can change this if we come up with a good way to estimate their usage on a shared cluster (but not for the alpha).

If we're using prometheus to measure memory requested, we can easily count dask pods as well. I'd say we should just count all memory used in the hub's namespace, which will include most of the shared infra as well as jupyter / dask pods. So I don't think we need to exclude dask-gateway enabled hubs.

choldgraf · 2021-10-28T13:58:06Z

Sounds good - if we agree that it is possible to measure memory in this way, and that memory is a good proxy for cost according to the rough calculations above, then let's just go with the "total memory requested per hour in a namespace" suggestion of @yuvipanda . Any objections to that?

choldgraf · 2021-10-28T20:08:43Z

OK I've updated the language in there to reflect the "total amount of RAM requested by a hub per hour" approach.

If I don't hear any objections by Friday, I'll close this issue and we can go with this approach for now, pending feedback from community representatives!

Here's the relevant explanation:

When a user starts their interactive session on a hub, or when a collection of Dask Workers is requested by Dask Gateway, their hub requests space on a node for this work to happen. Nodes are like computers in the cloud, and “time on a node” is the thing we pay for. More nodes means higher costs.

The limiting factor that requires new nodes is Memory (RAM). When there is not enough free RAM on an existing node, a new node is requested (and the cloud costs go up). Thus, a good proxy for cloud costs is the total amount of RAM requested by a hub. So, we use the following steps to calculate monthly cloud costs per hub:

Begin with a fixed cost to cover “support infrastructure” that is shared across all hubs - this is a fixed monthly cost, divided by the number of hubs on the cluster. It is relatively small compared to other cloud costs.

Then, calculate the cloud costs requested by a hub in the following way:

Calculate the total amount of RAM requested in an hour by the hub.
Convert this into a % of a node’s total RAM capacity. So if a node has 20GB of RAM, and a hub requests 10GB of RAM (say, for 5 user sessions at 2GB of RAM each), then the hub has requested 50% (or 10GB/20GB) of a node for that hour.
Convert this into an hourly cost by multiplying the % by the hourly rate of the node. The hourly rate depends on the type of machine used for these nodes. In the case of Google Cloud, it is an n1-highmem-4 node. Here's the cloud pricing for these types of nodes.
Calculate the monthly cloud costs by summing hourly costs across the whole month and adding the flat "support infrastructure" cost to it.

yuvipanda · 2021-10-29T11:16:27Z

The extra markup should also account for the fact that our nodes will never be 100% full. If we don't do that, we're guaranteed to always undercharge and never recoup 100% of cloud costs. So I'd suggest that whenever we are actually doing the math to figure out the formula / prometheus queries, we try to make it so that we'll get back 100% of what we're paying in aggregate.

We should also charge for disk space used by user home directories. Here's a snapshot of costs for the UC Berkeley DataHubs:

You'll notice that storage actually overall costs more money than RAM or CPU! That's because you're paying for home directory provisioned storage regardless of wether it is actively being used or not - unlike RAM, which is only paid for when instances are actually running.

The final thing to consider is what we mean when we say 'memory'. There are three things we can use here:

memory guarantee - the minimum amount of RAM each user on a hub is guaranteed
memory use - the actual amount of RAM each user on a hub uses
memory limit - the maximum amount of RAM each user on a hub can use

I've some more info about the difference between these in 2i2c-org/infrastructure#666 (comment). For us, I suggest we count max(memory_guarantee, memory_use) as the memory we charge for?

yuvipanda · 2021-10-29T11:18:12Z

To our users, we should commit to making sure the formula and metrics we use to calculate this is open and transparent, but change as we tune it. I don't think we can or should lose money on the cloud in the long run.

There's also a lot of addon charges clouds charge us that we must somehow incorporate. Here's a list that the UC Berkeley DataHubs pay for:

I think ultimately our goal is to make sure that if we charge all users on a hub, it brings back 100% of what we pay to the cloud vendors.

yuvipanda · 2021-10-29T11:19:42Z

So, to recap:

We should add cost for disk use as well,
Some suggestions on how memory is priced
Idea that we should add enough overhead for all the other things that cloud providers charge us for
We should try make sure we recover 100% of our cloud costs from our users

choldgraf · 2021-10-29T13:58:40Z

Hmm, can you help me understand what of these ideas should be blockers for the alpha, vs what we should shoot for in general but not now? Should I just include something hand wavy in the document, with the confidence that we will figure it out by the time we need to send somebody an invoice?

yuvipanda · 2021-10-29T21:32:52Z

Should I just include something hand wavy in the document, with the confidence that we will figure it out by the time we need to send somebody an invoice?

I think that's the way to go. State that we generally commit to just passing through cloud costs based on our calculation of resources your hub is using on a monthly basis, and handwave on the details.

I think the way to develop cost model is to try to put a dollar number on a particular hub, and then generalize that. I think that will work better than going the other way around. So if you know someone we are working with who is in a position to start paying, we can work on figuring out cloud cost for that user and tweak the formula as we go. We can and should provide some guarantees that we aren't going to be wildly changing it each month, but it is going to change anyway as we make optimizations and rejig infrastructure.

choldgraf · 2021-10-29T22:18:06Z

@yuvipanda sounds good - this is what I'll do then. The thing I worry about is that other communities may want a specific number, but let's start with the more hand-wavy approach and we can cross that bridge when we get to it.

yuvipanda · 2021-10-29T22:21:49Z

@choldgraf yeah - I think picking a community and giving them a specific number is something we should prioritize.

choldgraf · 2021-10-30T01:27:13Z

To make sure we are on the same page, you're suggesting:

In the alpha service doc, we include language that says we'll estimate cloud costs based on hub usage, and give general ideas of what this means (RAM, storage, etc) but without adding specific numbers or equations.
When we actually work with a specific community, at the end of a month we'll go through the process of estimating their costs, and use this as a starting point for how to do the same for other communities.
Use this to iteratively build up a model for passing-through cloud costs

yuvipanda · 2021-10-30T07:03:56Z

@choldgraf correct!

choldgraf · 2021-11-01T22:05:24Z

OK, I've updated our alpha rationale etc with this new language, and we can track the improvements to tracking hub usage costs in 2i2c-org/infrastructure#730

will close this one, thanks all for the helpful explanations!

choldgraf added the Enhancement An improvement to something or creating something new. label Oct 28, 2021

choldgraf added this to Sprint Board Oct 28, 2021

choldgraf mentioned this issue Oct 28, 2021

Alpha service offering description and rationale #262

Closed

8 tasks

choldgraf moved this to Needs discussion/decision 💬 in Sprint Board Oct 28, 2021

choldgraf self-assigned this Oct 28, 2021

choldgraf moved this from Needs discussion 💬 to Needs review 👀 in Sprint Board Oct 28, 2021

choldgraf moved this from Needs input 🙌 to In Progress ⚡ in Sprint Board Oct 29, 2021

yuvipanda mentioned this issue Nov 1, 2021

Team Sync - Monday, November 1st #279

Closed

choldgraf closed this as completed Nov 1, 2021

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Nov 1, 2021

choldgraf mentioned this issue Nov 1, 2021

Formula for calculating hub-specific cloud costs on a shared cluster 2i2c-org/infrastructure#730

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Cloud cost estimation for JupyterHubs on shared clusters #277

[Proposal] Cloud cost estimation for JupyterHubs on shared clusters #277

choldgraf commented Oct 28, 2021 •

edited

Loading

yuvipanda commented Oct 28, 2021 •

edited

Loading

choldgraf commented Oct 28, 2021

choldgraf commented Oct 28, 2021 •

edited

Loading

yuvipanda commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

choldgraf commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

choldgraf commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

choldgraf commented Oct 30, 2021

yuvipanda commented Oct 30, 2021

choldgraf commented Nov 1, 2021

[Proposal] Cloud cost estimation for JupyterHubs on shared clusters #277

[Proposal] Cloud cost estimation for JupyterHubs on shared clusters #277

Comments

choldgraf commented Oct 28, 2021 • edited Loading

Description

Value / benefit

Implementation details

Tasks to complete

Updates

yuvipanda commented Oct 28, 2021 • edited Loading

choldgraf commented Oct 28, 2021

choldgraf commented Oct 28, 2021 • edited Loading

yuvipanda commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

choldgraf commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

choldgraf commented Oct 29, 2021

yuvipanda commented Oct 29, 2021

choldgraf commented Oct 30, 2021

yuvipanda commented Oct 30, 2021

choldgraf commented Nov 1, 2021

choldgraf commented Oct 28, 2021 •

edited

Loading

yuvipanda commented Oct 28, 2021 •

edited

Loading

choldgraf commented Oct 28, 2021 •

edited

Loading