-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Cloud cost estimation for JupyterHubs on shared clusters #277
Comments
If we're using prometheus to measure memory requested, we can easily count dask pods as well. I'd say we should just count all memory used in the hub's namespace, which will include most of the shared infra as well as jupyter / dask pods. So I don't think we need to exclude dask-gateway enabled hubs. |
Sounds good - if we agree that it is possible to measure memory in this way, and that memory is a good proxy for cost according to the rough calculations above, then let's just go with the "total memory requested per hour in a namespace" suggestion of @yuvipanda . Any objections to that? |
OK I've updated the language in there to reflect the "total amount of RAM requested by a hub per hour" approach. If I don't hear any objections by Friday, I'll close this issue and we can go with this approach for now, pending feedback from community representatives! Here's the relevant explanation: When a user starts their interactive session on a hub, or when a collection of Dask Workers is requested by Dask Gateway, their hub requests space on a node for this work to happen. Nodes are like computers in the cloud, and “time on a node” is the thing we pay for. More nodes means higher costs. The limiting factor that requires new nodes is Memory (RAM). When there is not enough free RAM on an existing node, a new node is requested (and the cloud costs go up). Thus, a good proxy for cloud costs is the total amount of RAM requested by a hub. So, we use the following steps to calculate monthly cloud costs per hub: Begin with a fixed cost to cover “support infrastructure” that is shared across all hubs - this is a fixed monthly cost, divided by the number of hubs on the cluster. It is relatively small compared to other cloud costs. Then, calculate the cloud costs requested by a hub in the following way:
|
The extra markup should also account for the fact that our nodes will never be 100% full. If we don't do that, we're guaranteed to always undercharge and never recoup 100% of cloud costs. So I'd suggest that whenever we are actually doing the math to figure out the formula / prometheus queries, we try to make it so that we'll get back 100% of what we're paying in aggregate. We should also charge for disk space used by user home directories. Here's a snapshot of costs for the UC Berkeley DataHubs: You'll notice that storage actually overall costs more money than RAM or CPU! That's because you're paying for home directory provisioned storage regardless of wether it is actively being used or not - unlike RAM, which is only paid for when instances are actually running. The final thing to consider is what we mean when we say 'memory'. There are three things we can use here:
I've some more info about the difference between these in 2i2c-org/infrastructure#666 (comment). For us, I suggest we count max(memory_guarantee, memory_use) as the memory we charge for? |
To our users, we should commit to making sure the formula and metrics we use to calculate this is open and transparent, but change as we tune it. I don't think we can or should lose money on the cloud in the long run. There's also a lot of addon charges clouds charge us that we must somehow incorporate. Here's a list that the UC Berkeley DataHubs pay for: I think ultimately our goal is to make sure that if we charge all users on a hub, it brings back 100% of what we pay to the cloud vendors. |
So, to recap:
|
Hmm, can you help me understand what of these ideas should be blockers for the alpha, vs what we should shoot for in general but not now? Should I just include something hand wavy in the document, with the confidence that we will figure it out by the time we need to send somebody an invoice? |
I think that's the way to go. State that we generally commit to just passing through cloud costs based on our calculation of resources your hub is using on a monthly basis, and handwave on the details. I think the way to develop cost model is to try to put a dollar number on a particular hub, and then generalize that. I think that will work better than going the other way around. So if you know someone we are working with who is in a position to start paying, we can work on figuring out cloud cost for that user and tweak the formula as we go. We can and should provide some guarantees that we aren't going to be wildly changing it each month, but it is going to change anyway as we make optimizations and rejig infrastructure. |
@yuvipanda sounds good - this is what I'll do then. The thing I worry about is that other communities may want a specific number, but let's start with the more hand-wavy approach and we can cross that bridge when we get to it. |
@choldgraf yeah - I think picking a community and giving them a specific number is something we should prioritize. |
To make sure we are on the same page, you're suggesting:
|
@choldgraf correct! |
OK, I've updated our alpha rationale etc with this new language, and we can track the improvements to tracking hub usage costs in 2i2c-org/infrastructure#730 will close this one, thanks all for the helpful explanations! |
Description
In our alpha service doc (#262) we need to be able to estimate the cloud resource usage for each community's hub. This is more difficult on shared clusters, because we do not have a 1-to-1 mapping of billing account to hub community.
Here is a proposal to estimate cloud costs for our alpha service, I'd love comments and ideas from others:
It makes the following assumptions, and I'd love feedback about whether these are correct:
Finally, because I'm not sure how to estimate cloud usage for Dask Gateway hubs yet, I suggest that we only run these hubs on dedicated clusters for the alpha. We can change this if we come up with a good way to estimate their usage on a shared cluster (but not for the alpha).
Value / benefit
This is designed to be a balance between what is possible right now and what is reasonable given our costs. So the main benefit is that it accurately reflects our actual costs, and is possible to calculate in a sustainable way given the Prometheus metrics we're generating.
Implementation details
No response
Tasks to complete
Updates
No response
The text was updated successfully, but these errors were encountered: