-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research O&M Resource Tracking Paradigms #4240
Comments
From my (initial?) findings the main things that I would like to see implemented related to:
I'm going to start with SRE (site reliability engineering) and our state dashboards.
|
Okay, I think I was putting too much effort into trying to actively apply the things I was researching. Refocusing to answer the acceptance criteria, I'll just focus on SRE as a framework to improve O&M. As an explanation of how the rest of this comment is organized, there are two sections: (1) Statements to ponder and (2) Statements that we already imbibe. Statements to ponder are a collection of concepts that we can consider implementing and/or organizing to help rate the health of our systems on a daily basis. Statements that we already imbibe are mostly things that we've implemented but have not accepted as truths for how our systems are performing. We just need to calibrate them to make them useful for the purpose of O&M assessment. After reviews from the team, I think we can start to dig into the actual implementation of some of these ideas. I think the point of everything I've laid out here is that we want to say, "Here is the dashboard that shows that our applications have been healthy" as opposed to us loosely saying that we've checked a bunch of different items and have not witnessed any incidents. Statements to ponder:
Statements that we already imbibe (but need to continuously implement):
A list of references for different perspectives on SRE:
|
I think that the SLO (Service Level Objectives) we want to focus on are the following (by process):
|
We have accepted the details above as the initial starting point to improve O&M processes. This will be an ongoing effort as part of the O&M shift over the next few weeks. I'll make sure to tag it in the new issues, so that we don't lose it. |
User Story
In order to make the O&M Role better focused, the Data.gov O&M Team wants to research auditing/reporting/metric/managing frameworks that capture the nature of the serverless cloud environment that we support.
Acceptance Criteria
WHEN I look at this ticket
THEN I see a list of (or maybe just one) framework that we can use to track O&M tasks/priorities/metrics.
Background
O&M tasking is a bit of a mess. We don't have a baseline for normal operations and we don't have a list of resources that need a baseline.
Security Considerations (required)
Improving team mental health helps make us more prepared to be proactive and reactive in future incidents.
Sketch
TBD.
The text was updated successfully, but these errors were encountered: