Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q4 2022 Goal: Monitoring & alerting for our infrastructure #1804

Closed
3 of 6 tasks
yuvipanda opened this issue Oct 20, 2022 · 3 comments
Closed
3 of 6 tasks

Q4 2022 Goal: Monitoring & alerting for our infrastructure #1804

yuvipanda opened this issue Oct 20, 2022 · 3 comments
Assignees

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Oct 20, 2022

I want us to feel confident that our infrastructure exists in a state that is not on fire. This means we need monitoring and alerting in place to make sure we have evidence to believe that - that 'everything is ok'.

For this, we need to have good monitoring and alerting that we trust. We should be able to realistically say 'there are currently no alerts, so we believe there are no outages right now'. And have enough trust in the process to believe that.

Here's a bunch of things that will help get us there:

Alerts are of two kinds - immediate outage (delivered via pagerduty) and a 'cliff alert' (something will go bad in a few days if you do not deal with something now) (delivered via freshdesk). Each alert should be clearly actionable - it's better to not have an alert at all than one we ignore.

@jmunroe
Copy link
Contributor

jmunroe commented Oct 24, 2022

(I have a clear memory of already commenting here last week .. but perhaps I had neglected to click 'Comment' and lost my thought! Apologizes if I already typed this up somewhere else.)

We've received a request to provide more analytics to our community partners. Specifically:

It would be great if 2i2c could deliver monthly usage reports in the form of an email summarizing:

  • Number of active users in the month
  • Average number of hours per user session in the month
  • Breakdown of machine types used
  • Total cost of the hub in the month

This information will help our project decide whether usage of our hub is growing and will help us justify whether to continue with this service.

Since this Q4 2022 Goal is tackling 'monitoring', could it be expanded in scope to include analytics as well?

I don't want to download extra work to @2i2c-org/tech-team so my request is to understand more about the scope of work is planned here and to offer my assistance in the development work, if appropriate.

@damianavila
Copy link
Contributor

Since this Q4 2022 Goal is tackling 'monitoring', could it be expanded in scope to include analytics as well?

I would avoid expanding the scope.

I don't want to download extra work to https://github.com/orgs/2i2c-org/teams/tech-team so my request is to understand more about the scope of work is planned here and to offer my assistance in the development work, if appropriate.

Your help will be always welcome, @jmunroe.
I would suggest opening a new goal issue for "Analytics and Reporting" and keeping it separated from this one.

@damianavila
Copy link
Contributor

damianavila commented Feb 27, 2023

During Q4 health checks were deployed.
I have captured the remaining items in the top message as dedicated child issues to deal with in the next monitoring theme iteration (realistically, proto-Q2 content).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants