Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud usage monitoring and alerting infrastructure and process #328

Closed
9 of 10 tasks
Tracked by #919
yuvipanda opened this issue Mar 27, 2021 · 20 comments
Closed
9 of 10 tasks
Tracked by #919

Cloud usage monitoring and alerting infrastructure and process #328

yuvipanda opened this issue Mar 27, 2021 · 20 comments
Assignees
Labels
Enhancement An improvement to something or creating something new.

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Mar 27, 2021

Description of problem and opportunity to address it

Problem description
In #908 we ran into a case where a user was abusing the JupyterHub for crypto mining. This resulted in a lot of stress and high costs for the hub's community. Part of the problem was that we did not detect the mining activity for several weeks. This activity was basically:

  • The steady creation of new users on the hub
  • Each user maxing out their CPU and never shutting down their session

Proposed solution
We should create a mechanism for automatically monitoring statistics around hub usage, and triggering notifications that suggest something nefarious is happening. Ideally, this would be a single process for all of our clusters, not one process for each cluster.

We need a quick way to:

  1. Keep an eye on all these projects in one place
  2. Have automated alerts for abnormal costs
  3. Do rounds of cost optimizations

What's the value and who would benefit
This would allow us to minimize the risk of abuse if somebody did try to use a hub for the wrong purposes. It would give our team more confidence that something isn't happening without us knowing about it, and would give communities more confidence that they won't have an unexpected spike in their cloud bill.

Implementation guide and constraints

A rough idea of what to try:

  • Set up a Grafana dashboard that aggregates activity across all of our clusters (this will be tricky because the Prometheus instances are private for our clusters, not public like the Binder ones).

  • Define a few metrics that are particularly useful for identifying abuse and problematic abnormal behavior. For example, here are two images from the openscapes grafana that were particularly useful:

    • Users over time
    • CPU usage histogram over time
    • And noting 5xx errors from user pods in general is a good indication that something is wrong.
  • Define some thresholds for these metrics, and create a reporting mechanism to ping support@2i2c.org when it thinks something problematic is going on.

Issues where we have been bitten by this

Updates and ongoing work

2022-01-06

@GeorgianaElena is going to work on these things for one week:

See #328 (comment) for more details!

2022-01-19

Some meeting notes around here: #328 (comment)

We agreed that the best way forward is to start by implementing option 1 from the HackMD above, which is to follow the mybinder.org model of one Grafana with multiple data sources.

Our next steps here are to:

2022-03-30

From #328 (comment)

@yuvipanda yuvipanda added the goal label Mar 27, 2021
@yuvipanda
Copy link
Member Author

We could have a centralized organizational grafana board, that can pull in data from differnet sources. Since GCP exports billing data to bigquery, we can use a bigquery datasource in this grafana to display these graphs.

@choldgraf
Copy link
Member

choldgraf commented Jan 3, 2022

We had an incident that is related to this issue: #908

I think that we should prioritize this one at least at an MVP level, so that we can give communities some assurance that they won't incur huge cloud costs.

@choldgraf choldgraf moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Jan 3, 2022
@choldgraf choldgraf changed the title Establish process for keeping an eye on cloud costs Cloud cost monitoring infrastructure and process Jan 3, 2022
@choldgraf choldgraf changed the title Cloud cost monitoring infrastructure and process Cloud cost monitoring and notification infrastructure and process Jan 4, 2022
@choldgraf
Copy link
Member

Update

@GeorgianaElena and I discussed this one a bit today, and she'd be interested in giving it a shot to build out an MVP. We discussed two options:

  1. Build a simple reporting mechanism for each of our cluster grafana boards (e.g. send an email to support@ when a specific metric hits some threshold)
  2. Aggregate the prometheus feeds into a single Grafana, and use one or two graphs there to do the emailing from a single place, rather than from each cluster-specific Grafana.

We agreed that number 2 would be preferred, as long as there wasn't too much complexity in aggregating the prometheus feeds from each cluster.

Plan

@GeorgianaElena would like to spend a week answering these questions:

  1. How complex will it be to aggregate feeds from each cluster's Prometheus?
  2. What are 1 or 2 graphs / metrics to use for our reporting?

In a week, we can re-convene and decide whether to take approach 1 or approach 2 for now.

@GeorgianaElena
Copy link
Member

How complex will it be to aggregate feeds from each cluster's Prometheus?

Not ready yet to provide a super clear path forward for this but I'll leave here a few ideas that I'm planning to re-iterate on Monday with pros and cons:

*Decide between:

  1. Using a Prometheus federation setup and have an aggregator prometheus instance + central Grafana deployment in a new cluster, or use something like Thanos.

  2. The private Prometheus instances could be authenticated either by

I noticed that mybinder grafana pulls data from a Gesis prometheus privtate instance. Or so I think. This is the line of code that https://github.com/jupyterhub/mybinder.org-deploy/blob/master/grafana-data/datasources.json#L4 implies that prometheus instance uses basic auth.

However, I don't think it works, as I don't see any data in the mybinder dashboard for the gesis cluster 😕

@damianavila
Copy link
Contributor

damianavila commented Jan 18, 2022

@GeorgianaElena, what would be the pros and cons for those 2 options?

I can guess but you surely have more context to perform that comparison.

@choldgraf
Copy link
Member

I had a quick conversation with @GeorgianaElena today about this, I think her plan is to share a short write-up about these options and the research she's done, with the goal of discussing as a team tomorrow what would be a good step forward.

Some major things to include:

  • Pros / cons (as best we understand it) of the options
  • Any unanswered questions we think are important and should discuss

I think tomorrow we should decide if we have enough information to just move forward and try implementing something.

@GeorgianaElena
Copy link
Member

More info about the reading I did here ➡️ https://hackmd.io/HqE3RgjtTBq1MuofvAiLlQ?view

@yuvipanda
Copy link
Member Author

Wow, thank you so much for doing this research, @GeorgianaElena.

I love idea 1, which would be to use a central grafana that can talk to all the prometheuses. Prometheus supports basic auth (https://prometheus.io/docs/guides/basic-auth/) and grafana supports using that (https://grafana.com/docs/grafana/latest/datasources/prometheus/). So perhaps in our prometheus helm chart config in our support chart, we can setup an ingress (to allow traffic in) as well as basic authentication, and use a central grafana to access that via individual prometheus data sources. All alerts could also live in this central grafana.

Thank you for doing all this research! TIL about Thanos and Cortex :)

@choldgraf
Copy link
Member

Planning Meeting / Next Steps

  • Noticed that there's no Grafana reporting for GESIS, did this used to work? GESIS was authenticated so maybe we could follow that pattern.
  • The "aims of the project" for Thanos seem to support our general use case of having many distributed prometheus instances.
  • We should keep the per-cluster Grafana reporting for communities, so that they have the ability to re-use the same infrastructure if they wanted to move away from 2i2c (from a right to replicate perspective)

Next Steps

  • Try implementing Georgiana's proposal number 1 in the hackmd (centralized grafana that pulls in many prometheus sources)
  • Decide if this is the right approach when we've implemented it and can understand its complexity a bit better.

@choldgraf
Copy link
Member

A quick note here - I believe @yuvipanda is planning to work on #730 soon, and we thought it'd be good for him and @GeorgianaElena to coordinate a bit, since it's related to this one too. For example, we might want to use the same centralized Grafana dashboard to do reporting both for "usage alerts" and cost "cost reports".

@choldgraf choldgraf changed the title Cloud cost monitoring and notification infrastructure and process Cloud usage monitoring and alerting infrastructure and process Mar 12, 2022
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 12, 2022
This is the beginning of implementing idea 1 from the
list @georiganaelena made in
2i2c-org#328 (comment).

We have one prometheus running per cluster, but manage many clusters.
A single grafana that can connect to all such prometheus clusters
will help with monitoring as well as reporting. So we need to expose
it as securely as possible to the external world, as it can contain
private information.

In this case, we're using https + basic auth provided by
nginx-ingress
(https://kubernetes.github.io/ingress-nginx/examples/auth/basic/)
to safely expose prometheus to the outside world. We can then
use a grafana that knows these username / passwords to access this
prometheus instance. Each cluster needs its own username / password
(generated with pwgen 64 1), so users in one cluster can not access
prometheus for another cluster.

Ref 2i2c-org#328
@GeorgianaElena
Copy link
Member

@yuvipanda, now that #1091 was deployed the next step would be to list those prometheus instances as datasources for a central grafana, right? A few questions/thoughts about this:

  • Should we reuse an existing Grafana for example, the 2i2c one as @consideRatio proposed/assumed in Expose prometheus with basic auth #1091 (review)?
  • Setting up prometheus datasources would require:
    • Use the Grafana UI to list each cluster's prometheus isntances as datasources
    • Export that config and store it to our repo
    • Make the config reproductible and persistent between deploys by creating a script that allows exporting/importing that config into that central grafana (maybe similar with the mybinder grafana-export) either here or upstream in jupyterhub/grafana-dashboards

@choldgraf
Copy link
Member

Update: we'd like to prioritize this!

We discussed this topic in our team meeting today, and there was general agreement that improving our reporting and alerting infrastructure would be a good investment of our time. Essentially the argument boiled down to this:

  • The most stressful time for our team is when there are major incidents that require immediate action.
  • The best way to deal with this is to prevent incidents from happening in general
  • A "problem" only becomes an "incident" when a user is actually affected by it.
  • We can potentially resolve "problems" before they become "incidents" by catching them ahead of time.
  • If we improving our reporting infrastructure, we can be alerted to "problems" before they become "incidents"
  • This would hopefully significantly reduce the stress associated with support/operations and major incidents.

@yuvipanda
Copy link
Member Author

@GeorgianaElena yeah designating the existing grafana as a 'central grafana' seems the way to go.

I think next steps here are:

  1. Write a script that'll read all the encrypted grafana secrets, and put them in the centralized grafana as data sources via the grafana API
  2. Update the upstream jupyterhub/grafana-dashboard repo to support multiple datasources, via a datasource template variable. I removed that as part of commit 763c28acad89c9d7e95a860c54f004b6bf738240 in that repo - it just needs to be put back.
  3. Deploy support charts in the few clusters where we don't currently have them deployed! I think that's meom-ige and farallon? We will need to tune their resource requests to match the smaller clusters.

@GeorgianaElena
Copy link
Member

@yuvipanda thanks a lot for the details 🚀 ! I think I have bandwidth to start working on this, using the steps you provided. But I will probably need some help/input from time to time. Do you think you have bandwidth to help out with this one or split the work somehow?

cc @damianavila

@yuvipanda
Copy link
Member Author

@GeorgianaElena absolutely have the bandwidth to help out :)

@GeorgianaElena
Copy link
Member

This issue has become quite big, so I'm going to close it now since the monitoring infra is mostly in place and track the alerting part in different issues.

Get context and track progress

The Updates and ongoing work section in the initial comment has info about what has been achieved and links to still opened issues. There's also a project board for this project here Updates and ongoing work .

Repository owner moved this from Todo to Done in Cloud usage monitoring and alerting infrastructure and process Jul 7, 2022
Repository owner moved this from In progress to Complete in DEPRECATED Engineering and Product Backlog Jul 7, 2022
@damianavila
Copy link
Contributor

Thank you for all the hard work you have done on this one, @GeorgianaElena!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement An improvement to something or creating something new.
Projects
No open projects
Development

No branches or pull requests

4 participants