Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Support attributing costs to individual hubs automatically on Openscapes #4453

Closed
51 of 54 tasks
Tracked by #4872
yuvipanda opened this issue Jul 19, 2024 · 9 comments
Closed
51 of 54 tasks
Tracked by #4872

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Jul 19, 2024

As part of [Initiative] Hub Scale Cost Monitoring #4384, we want to support attributing costs to individual hubs on AWS.

We don't want to do this on all hubs on all clusters, but need to pick a cluster that has multiple hubs already in it to attribute costs. Let's pick openscapes - it has a staging and prod hub, but also a workshop hub!

While this EPIC is focused on openscapes, at the end of it, it would allow us to know exactly what we would need to do to do the same on any other cluster.

Tasks

  1. GeorgianaElena
  2. GeorgianaElena
  3. GeorgianaElena
  4. yuvipanda
  5. sgibson91
  6. sgibson91
  7. sgibson91
  8. consideRatio
  9. consideRatio
  10. consideRatio
  11. 4 of 4
  12. consideRatio
  13. consideRatio
  14. consideRatio
  15. consideRatio
  16. consideRatio
  17. consideRatio
  18. consideRatio
  19. consideRatio
  20. consideRatio
  21. consideRatio
  22. consideRatio
  23. consideRatio
  24. consideRatio
  25. consideRatio
  26. consideRatio
  27. consideRatio
  28. consideRatio
  29. consideRatio
  30. consideRatio
  31. consideRatio
  32. consideRatio
  33. consideRatio
  34. consideRatio
  35. consideRatio
  36. consideRatio
  37. consideRatio
  38. consideRatio
  39. consideRatio

To meet the definition of done

Definition of done

  • A cost attribution system that works for openscapes specifically, and is sufficiently robust
  • Insights into what is required to scale this to other hubs
@yuvipanda yuvipanda changed the title [EPIC] Support attributing costs to individual hubs automatically on AWS [EPIC] Support attributing costs to individual hubs automatically on Openscapes Jul 20, 2024
@yuvipanda
Copy link
Member Author

For storage costs, we will switch to one EFS per hub. This doesn't particularly have cost implications, because AWS EFS is per use.

I was going to suggest we move to multiple nodepools for cost monitoring, but turns out AWS actually has done a pretty decent job of 'splitting costs' per namespace! https://aws.amazon.com/blogs/aws-cloud-financial-management/improve-cost-visibility-of-amazon-eks-with-aws-split-cost-allocation-data/. I'll have a spike specc'd out soon to determine how to do this.

@yuvipanda
Copy link
Member Author

yuvipanda commented Jul 24, 2024

The spike was completed in #4453, with the outcome that:

  1. We can use AWS Athena for these queries, so yay.
  2. We can not use the split cost allocation feature, because it doesn't cover a couple of resources important to us (disk, primarily)
  3. For clusters where we want to offer 'per hub cost tracking', this means each hub must be on its own tagged nodepool.

I've refined and added tasks to move each hub to its own dedicated nodepool.

@ateucher
Copy link

This is great @yuvipanda - let me know how I can help!

@yuvipanda
Copy link
Member Author

yuvipanda commented Aug 20, 2024

Instead of drilling down this further, I have written out a more detailed definition of done, and will work with @consideRatio in having him do just enough refinement to complete the tasks.

Definition Of done

There exists a grafana dashboard that looks like this:

image

Details

Numbers in purple indicate priority ordering, helpful for scoping conversations.

Fixed costs include core nodepool, any PV needed for support chart or hub databases. Kubernetes master API costs and cost for any load balancer services if they lost money). Note that tagging the EKS cluster itself requires recreating it, which we don’t wanna do. Other active tags can be used to include that information though.

Object storage is all S3 related cost from the scratch and persistent buckets, not counting requestor pays.

"Compute" is all ec2 cost, including root disks, networking and gpu.

Home directory should include home directory and backup costs.

Total cost should include all 2i2c managed infrastructure.

Validation

Each of these graphs need to be validated so we can trust them and find pieces we have missed, as well as spot bugs in the Athena query.

  1. Sum of time series in graphs 1 and 2 should equal graph 4, since summing cost of each hub + fixed cost or each component should yield total cost of 2i2c managed infrastructure
  2. Sum of time series in graph 3 for each hub should equal the hub’s value in graph 1.
  3. Each graph should have a written description of how the AWS cost reporting UI can be used to get the same values we have here
  4. For openscapes, graph 4 should mostly match total cloud spend, although they do have some coiled usage.

Timeline

I would like this to be done within the next 3 sprints (so 2 full sprints with Erik available). We can cut scope as needed.

Next steps

  • Yuvi and Erik meet to discuss this plan. Timing tbd.
  • Erik splits this out into tasks with just enough detail so others can monitor progress by looking at them without having to intentionally ask him. Existing issues can be closed or edited as needed.
  • Yuvi unblocks Erik with clarification and prioritization questions as quickly as possible
  • We check in to see where we are at the end of this sprint

@yuvipanda
Copy link
Member Author

yuvipanda commented Aug 21, 2024

@ateucher today pointed me to https://docs.aws.amazon.com/cost-management/latest/userguide/ce-api.html, which I had totally missed while doing #4465. I think the lesson for me is that I should hand off at the level in #4453 (comment) earlier, and rely on others to do such spikes.

Regardless, I think it's early enough that we should investigate this alternative to Athena.

It would involve:

  1. https://docs.aws.amazon.com/cost-management/latest/userguide/ce-api.html as the source of data.
  2. An intermediate python web server, that talks to the Cost Explorer API
  3. https://grafana.com/grafana/plugins/yesoreyeram-infinity-datasource/ for connecting this from Grafana. This is recommended by grafana as the replacement for https://github.com/grafana/grafana-json-datasource

There are a few major advantages over using Athena:

  1. Much easier to validate, as we aren't writing complex SQL queries but translating what we can visually do in the cost explorer into API calls.
  2. Athena is not per AWS account but at the AWS organization level, so we would have needed an intermediate layer anyway for cases when we use the 2i2c AWS organization. We wouldn't have needed this for Openscapes, but trying to use it for any of our other AWS accounts would've required an intermediate python layer for access control (so different communities can't see ach other's data).

So if possible, we should prefer this method.

We can resuse all the work we had done, except for some parts of #4546.

Next step here is to design a spike to validate this (instead of #4544). The athena specific issues that are subtasks of this can be closed if we are going to take this approach.

Instead of doing the refinement work myself, I'm going to take a slightly different approach here, and not write out the spike myself. Instead I'll work with @consideRatio in helping him both scope out and accomplish this work.

@yuvipanda
Copy link
Member Author

It does have this limitation:

The Cost Explorer API can access up to 13 months of historical data and data for the current month. It can also provide 3 months of cost forecast data at the daily level of granularity and 12 months of cost forecast data at the monthly level of granularity.

While athena does not.

@consideRatio
Copy link
Contributor

While working #4713 and #4712, I've taken these notes:

Summary

  • I've come up with a clear sense on how to capture costs combinations of tags, see the heading Accounting for known 2i2c infra total below for that.
  • I've decided that its reasonable to split costs further between hubs solely by the use of the 2i2c:hub-name tag.
  • I've formulated a strategy on splitting costs further
  • I've observed a few costs related to some Public IPs and EFS backups, and concluded it requires further investigation on how to get them accounted and considered it not important enough to do at the moment.

Notes

Wanted accounting details

  • AWS account total
    • everything not being captured by 2i2c total
  • Known 2i2c infra total
    • only includes what is known to be pure 2i2c infra costs
    • divides into hub specific costs
  • Known 2i2c infra total divided into hubs
  • Known 2i2c infra total divided into hubs and service types, with service types
    combined into user friendly labels

Overview of tags

Use of the AWS tag editor helped figure these things out:
https://us-west-2.console.aws.amazon.com/resource-groups/tag-editor/find-resources.

aws:eks:cluster-name=<cluster-name>

  • less useful than kubernetes.io/cluster/<cluster-name>=owned, because that
    includes all costs captured by this tag as well.
  • doesn't include k8s created storage
  • doesn't include k8s created load balancer services

kubernetes.io/cluster/<cluster-name>=owned

  • This is a critical tag, because we won't have other tags for dynamically
    created resources such as EBS storage volumes, ELB load balancers, and
    potentially other things.

    If we aren't to use this, it would make sense to try configure the
    aws-ebs-csi-driver addon to provide extra tags for the volumes
    (https://github.com/kubernetes-sigs/aws-ebs-csi-driver/tree/master), but
    this fails to capture the load balancers for example.

    It seems like a good call to instead lean on this tag to capture dynamically
    created resources by various AWS specific k8s controllers.

  • A subset of resources tagged by this tag, is tagged by:

    • kubernetes.io/created-for/pvc/name
    • kubernetes.io/created-for/pvc/namespace

alpha.eksctl.io/cluster-name=<cluster-name>

  • Includes the EKS cluster itself and associated VPC network for example, this
    makes us capture costs for the EKS service itself and VPC networking things
    like NAT Gateway.

2i2c:hub-name=<namespace>

  • Overlaps quite well with kubernetes.io/cluster/<cluster-name>=owned, but
    does not fully cover it.
  • 2i2c:hub-name tags cost incurring resources entirely untagged by
    kubernetes.io/cluster/<cluster-name>=owned, such as:
    • EFS FileSystems
    • S3 Buckets

2i2c:node-purpose=<any value>

  • It appears that node groups "elastic network interfaces" (ENI) incurring
    costs via their public IPs, or the node groups storage volumes aren't tagged
    with alpha.eksctl.io/cluster-name=<cluster-name> for example, so we only
    capture them via tags applied to our node groups. Due to that, we need to
    include 2i2c:node-purpose as well for now to capture 2i2c infra costs.

2i2c.org/cluster-name=<cluster-name>

ManagedBy=2i2c

  • Not initially setup in openscapes, but like 2i2c.org/cluster-name will be
    used for new hubs.

Accounting for known 2i2c infra total

Based on a given cluster name, such as openscapeshub, the known 2i2c infra
total can be calculated using the tag filter:

  • alpha.eksctl.io/cluster-name=<cluster-name>
  • kubernetes.io/cluster/<cluster-name>=owned
  • 2i2c.org/cluster-name (for openscapes this needs to be 2i2c:node-purpose=<any value> until k8s upgrades re-creates all nodes)
  • 2i2c:hub-name=<any value>

Still not accounted costs

These costs for openscapes August month 2024, greater than 1 USD, aren't
accounted for yet in openscapes:

USW2-PublicIPv4:InUseAddress: $11.68

We have public IPs from three sources:

  • A node's public IP. Everything related to these IPs are tagged, so we should
    account for cost correctly.
  • A NAT gateway's public IP. The Network Interface is not tagged, but its
    associated with a "Elastic IP" that is tagged with
    alpha.eksctl.io/cluster-name.
  • A AWS specific k8s controller has created a LoadBalancer that is tagged with
    kubernetes.io/cluster/<cluster-name>=owned, but network interfaces of that
    LB aren't tagged. I expect this to incur cost we fail to track.

Public IPs costs $0.005/hour, so this becomes 24*0.005 == 0.12 per public IP
constantly used during a day, and I saw that the cost for a recent Sunday was
0.36, so three IPs aren't paid for it seems.

My guess is that we aren't attributing costs for the NAT Gateway IP, or the k8s
Service resources of type LoadBalancers' associated public IPs.

eksctl config doesn't help us get the network interface tagged for the NAT
gateway, and I'm not sure how to make the AWS specific k8s controller running in
EKS managed control plane provide a tag for the Public IP associated network
interfaces either.

USW2-WarmStorage-ByteHrs-EFS: $3.76

This seems associated with backup, because there is a concept between warm /
cold there.

We have an automated backup vault, but it isn't tagged by anything. At the same
time, we didn't create this vault and it can be used by other people. We did
create a job to schedule backups to get done etc. The "restore point" resources
in the vault are tagged.

Anyhow, I think this isn't worth further investigation.

Accounting for hub attributed costs

  1. Filter by known 2i2c infra costs, and group by 2i2c:hub-name tag

    NOTE: We could also try group by kubernetes.io/created-for/pvc/namespace, but
    for now we avoid this complexity and treat all storage volumes as
    shared costs. Almost all storage costs stems from the prometheus server
    though, which lives in support namespace anyhow and not a hub
    specific namespace.

  2. Track the remaining hub unattributed costs separatly

Accounting for hub attributed costs

Like for hub attributed costs, but also grouping by service types and then
combining various service types into user friendlier categories.

@consideRatio
Copy link
Contributor

consideRatio commented Sep 20, 2024

This is now in a sufficiently functional state for openscapes people to start looking at I think. It can be viewed at https://grafana.openscapes.2i2c.cloud/d/edw06h7udjwg0b/cloud-cost-attribution?orgId=1.

openscapes-cost-attribution-is-up

@consideRatio
Copy link
Contributor

Closing as completed, this is functional for openscapes, documentation on scaling this to other hubs was something I considered not to be part of the openscapes focused epic when being asked to provide a definition of done for this. Such future steps are now tracked in #4872.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants