Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Cloud cost estimation for JupyterHubs on shared clusters #277

Closed
1 task
Tracked by #262
choldgraf opened this issue Oct 28, 2021 · 13 comments
Closed
1 task
Tracked by #262

[Proposal] Cloud cost estimation for JupyterHubs on shared clusters #277

choldgraf opened this issue Oct 28, 2021 · 13 comments
Assignees
Labels
Enhancement An improvement to something or creating something new.

Comments

@choldgraf
Copy link
Member

choldgraf commented Oct 28, 2021

Description

In our alpha service doc (#262) we need to be able to estimate the cloud resource usage for each community's hub. This is more difficult on shared clusters, because we do not have a 1-to-1 mapping of billing account to hub community.

Here is a proposal to estimate cloud costs for our alpha service, I'd love comments and ideas from others:

It makes the following assumptions, and I'd love feedback about whether these are correct:

  1. The largest driver of cost for most communities will be interactive user sessions
  2. All other cloud costs like support infrastructure is shared between user communities and thus fairly low
  3. RAM is the biggest driver of cloud costs, because it is generally the thing that triggers node scale-up events.
  4. We have the ability to track the RAM requested for user pods for each hub by the hour. (I looked at our Prometheus metrics and I think this is possible)

Finally, because I'm not sure how to estimate cloud usage for Dask Gateway hubs yet, I suggest that we only run these hubs on dedicated clusters for the alpha. We can change this if we come up with a good way to estimate their usage on a shared cluster (but not for the alpha).

Value / benefit

This is designed to be a balance between what is possible right now and what is reasonable given our costs. So the main benefit is that it accurately reflects our actual costs, and is possible to calculate in a sustainable way given the Prometheus metrics we're generating.

Implementation details

No response

Tasks to complete

  • Take a look at the proposal and let me know if you think it should be amended in a significant way (feel free to suggest or make edits!)

Updates

No response

@choldgraf choldgraf added the Enhancement An improvement to something or creating something new. label Oct 28, 2021
@choldgraf choldgraf moved this to Needs discussion/decision 💬 in Sprint Board Oct 28, 2021
@choldgraf choldgraf self-assigned this Oct 28, 2021
@yuvipanda
Copy link
Member

yuvipanda commented Oct 28, 2021

Finally, because I'm not sure how to estimate cloud usage for Dask Gateway hubs yet, I suggest that we only run these hubs on dedicated clusters for the alpha. We can change this if we come up with a good way to estimate their usage on a shared cluster (but not for the alpha).

If we're using prometheus to measure memory requested, we can easily count dask pods as well. I'd say we should just count all memory used in the hub's namespace, which will include most of the shared infra as well as jupyter / dask pods. So I don't think we need to exclude dask-gateway enabled hubs.

@choldgraf
Copy link
Member Author

Sounds good - if we agree that it is possible to measure memory in this way, and that memory is a good proxy for cost according to the rough calculations above, then let's just go with the "total memory requested per hour in a namespace" suggestion of @yuvipanda . Any objections to that?

@choldgraf choldgraf moved this from Needs discussion 💬 to Needs review 👀 in Sprint Board Oct 28, 2021
@choldgraf
Copy link
Member Author

choldgraf commented Oct 28, 2021

OK I've updated the language in there to reflect the "total amount of RAM requested by a hub per hour" approach.

If I don't hear any objections by Friday, I'll close this issue and we can go with this approach for now, pending feedback from community representatives!

Here's the relevant explanation:


When a user starts their interactive session on a hub, or when a collection of Dask Workers is requested by Dask Gateway, their hub requests space on a node for this work to happen. Nodes are like computers in the cloud, and “time on a node” is the thing we pay for. More nodes means higher costs.

The limiting factor that requires new nodes is Memory (RAM). When there is not enough free RAM on an existing node, a new node is requested (and the cloud costs go up). Thus, a good proxy for cloud costs is the total amount of RAM requested by a hub. So, we use the following steps to calculate monthly cloud costs per hub:

Begin with a fixed cost to cover “support infrastructure” that is shared across all hubs - this is a fixed monthly cost, divided by the number of hubs on the cluster. It is relatively small compared to other cloud costs.

Then, calculate the cloud costs requested by a hub in the following way:

  • Calculate the total amount of RAM requested in an hour by the hub.
  • Convert this into a % of a node’s total RAM capacity. So if a node has 20GB of RAM, and a hub requests 10GB of RAM (say, for 5 user sessions at 2GB of RAM each), then the hub has requested 50% (or 10GB/20GB) of a node for that hour.
  • Convert this into an hourly cost by multiplying the % by the hourly rate of the node. The hourly rate depends on the type of machine used for these nodes. In the case of Google Cloud, it is an n1-highmem-4 node. Here's the cloud pricing for these types of nodes.
  • Calculate the monthly cloud costs by summing hourly costs across the whole month and adding the flat "support infrastructure" cost to it.

@yuvipanda
Copy link
Member

The extra markup should also account for the fact that our nodes will never be 100% full. If we don't do that, we're guaranteed to always undercharge and never recoup 100% of cloud costs. So I'd suggest that whenever we are actually doing the math to figure out the formula / prometheus queries, we try to make it so that we'll get back 100% of what we're paying in aggregate.

We should also charge for disk space used by user home directories. Here's a snapshot of costs for the UC Berkeley DataHubs:

image

You'll notice that storage actually overall costs more money than RAM or CPU! That's because you're paying for home directory provisioned storage regardless of wether it is actively being used or not - unlike RAM, which is only paid for when instances are actually running.

The final thing to consider is what we mean when we say 'memory'. There are three things we can use here:

  1. memory guarantee - the minimum amount of RAM each user on a hub is guaranteed
  2. memory use - the actual amount of RAM each user on a hub uses
  3. memory limit - the maximum amount of RAM each user on a hub can use

I've some more info about the difference between these in 2i2c-org/infrastructure#666 (comment). For us, I suggest we count max(memory_guarantee, memory_use) as the memory we charge for?

@yuvipanda
Copy link
Member

To our users, we should commit to making sure the formula and metrics we use to calculate this is open and transparent, but change as we tune it. I don't think we can or should lose money on the cloud in the long run.

There's also a lot of addon charges clouds charge us that we must somehow incorporate. Here's a list that the UC Berkeley DataHubs pay for:

image

I think ultimately our goal is to make sure that if we charge all users on a hub, it brings back 100% of what we pay to the cloud vendors.

@yuvipanda
Copy link
Member

So, to recap:

  1. We should add cost for disk use as well,
  2. Some suggestions on how memory is priced
  3. Idea that we should add enough overhead for all the other things that cloud providers charge us for
  4. We should try make sure we recover 100% of our cloud costs from our users

@choldgraf
Copy link
Member Author

Hmm, can you help me understand what of these ideas should be blockers for the alpha, vs what we should shoot for in general but not now? Should I just include something hand wavy in the document, with the confidence that we will figure it out by the time we need to send somebody an invoice?

@choldgraf choldgraf moved this from Needs input 🙌 to In Progress ⚡ in Sprint Board Oct 29, 2021
@yuvipanda
Copy link
Member

Should I just include something hand wavy in the document, with the confidence that we will figure it out by the time we need to send somebody an invoice?

I think that's the way to go. State that we generally commit to just passing through cloud costs based on our calculation of resources your hub is using on a monthly basis, and handwave on the details.

I think the way to develop cost model is to try to put a dollar number on a particular hub, and then generalize that. I think that will work better than going the other way around. So if you know someone we are working with who is in a position to start paying, we can work on figuring out cloud cost for that user and tweak the formula as we go. We can and should provide some guarantees that we aren't going to be wildly changing it each month, but it is going to change anyway as we make optimizations and rejig infrastructure.

@choldgraf
Copy link
Member Author

@yuvipanda sounds good - this is what I'll do then. The thing I worry about is that other communities may want a specific number, but let's start with the more hand-wavy approach and we can cross that bridge when we get to it.

@yuvipanda
Copy link
Member

@choldgraf yeah - I think picking a community and giving them a specific number is something we should prioritize.

@choldgraf
Copy link
Member Author

To make sure we are on the same page, you're suggesting:

  • In the alpha service doc, we include language that says we'll estimate cloud costs based on hub usage, and give general ideas of what this means (RAM, storage, etc) but without adding specific numbers or equations.
  • When we actually work with a specific community, at the end of a month we'll go through the process of estimating their costs, and use this as a starting point for how to do the same for other communities.
  • Use this to iteratively build up a model for passing-through cloud costs

@yuvipanda
Copy link
Member

@choldgraf correct!

@choldgraf
Copy link
Member Author

OK, I've updated our alpha rationale etc with this new language, and we can track the improvements to tracking hub usage costs in 2i2c-org/infrastructure#730

will close this one, thanks all for the helpful explanations!

Repository owner moved this from In Progress ⚡ to Done 🎉 in Sprint Board Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement An improvement to something or creating something new.
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants