Research O&M Resource Tracking Paradigms #4240

nickumia-reisys · 2023-03-17T21:49:39Z

User Story

In order to make the O&M Role better focused, the Data.gov O&M Team wants to research auditing/reporting/metric/managing frameworks that capture the nature of the serverless cloud environment that we support.

Acceptance Criteria

GIVEN research has been completed
WHEN I look at this ticket
THEN I see a list of (or maybe just one) framework that we can use to track O&M tasks/priorities/metrics.
There are supporting details and/or a comparison between each framework.

Background

O&M tasking is a bit of a mess. We don't have a baseline for normal operations and we don't have a list of resources that need a baseline.

Security Considerations (required)

Improving team mental health helps make us more prepared to be proactive and reactive in future incidents.

Sketch

TBD.

nickumia-reisys · 2023-03-31T13:22:41Z

From my (initial?) findings the main things that I would like to see implemented related to:

Abstraction: Hardware vs. Software vs. Networking
- For Hardware: Better inventory tracking processes
- For Software/Networking: Better site reliability processes
- For Software: Better state dashboards
Resiliency: If our technology stacks changes, updating these processes and policies should take minimal effort.
Prioritization:
- A list of our critical failure indicators
  - Downtime
  - Compromised data/system
  - ... what else?
- Fostering a proactive vs. reactive mindset
  - Because we don't know what we're looking at, we are inherently reactive. By organizing our state and resource tracking, we'll be able to be more proactive than reactive.
Continuous Baseline Analysis: @FuhuXia has the best idea of what the baseline is for our systems (having been on O&M the longest)
- What tools can we use to gather our baseline?
  - NR has it's AI-based analysis, maybe investigate it more. If it's missing data, that should be a different issue.
  - How can we use cloud.gov based software/technology to gain insight into application performance?
- When assessing these tools, be sure to see how baselines can be tracked and updated relative to a new variable that may be introduced in the future (i.e. A baseline may consist of 50 parameters/metrics, how would an additional parameter effect the overall computation of baseline).

I'm going to start with SRE (site reliability engineering) and our state dashboards.

@btylerburton already did some work on a NR dashboard during O+M 30-03-2023 #4238

nickumia-reisys · 2023-04-03T15:00:02Z

Okay, I think I was putting too much effort into trying to actively apply the things I was researching. Refocusing to answer the acceptance criteria, I'll just focus on SRE as a framework to improve O&M. As an explanation of how the rest of this comment is organized, there are two sections: (1) Statements to ponder and (2) Statements that we already imbibe. Statements to ponder are a collection of concepts that we can consider implementing and/or organizing to help rate the health of our systems on a daily basis. Statements that we already imbibe are mostly things that we've implemented but have not accepted as truths for how our systems are performing. We just need to calibrate them to make them useful for the purpose of O&M assessment. After reviews from the team, I think we can start to dig into the actual implementation of some of these ideas.

I think the point of everything I've laid out here is that we want to say, "Here is the dashboard that shows that our applications have been healthy" as opposed to us loosely saying that we've checked a bunch of different items and have not witnessed any incidents.

Statements to ponder:

Operations is a software problem
"SREs have skillsets of both Dev and Ops and share “Wisdom of Production” to the development team"
What are some SLOs that we care about?

Catchpoint SRE Survey Report 2019
Availability — 72%
Response Time — 47%
Latency — 46%
We do not have SLOs — 27%
- Important key points about SLOs
  - Have as few SLOs as possible
  - Perfection can wait
- To implement SLOs, we need to define what a good classification of our system is.
  - Catalog is a metadata repository + (soon to be) metadata reporting system
    - What qualities do we want to track that will assess success as this type of system?
    - Availability (uptime, latency, error rates), Data Integrity (known problem, but we need to track it somehow) and Searchability (how many clicks between entering the site and finding a desired dataset or how long do users spend on the site) can be a few. Starting out, Availability + Data Integrity are more important than Searchability.
  - Inventory is a metadata creation system
  - I believe both can be classified more broadly as "User-facing serving systems"
DevOps focuses on moving through the development pipeline efficiently, while SRE focuses on balancing site reliability with creating new features. O&M is more SRE than DevOps.
Implementation of SRE requires planning and consensus.
- This ticket is just an initial planning one. But more is needed to be able to make this a reality. I would encourage that we foster an SRE mindset during O&M, so that the O&M role build the initial foundation of this framework.
Gain system observability
- SRE teams use metrics to determine if the software consumes excessive resources or behaves abnormally.
- SRE software generates detailed, timestamped information called logs in response to specific events.
- Traces are observations of the code path of a specific function in a distributed system. (i.e. how long does a specific function call/process take)
Implement system monitoring
- Latency
- Traffic (number of requests)
- Errors
- Saturation (realtime load of application)

Statements that we already imbibe (but need to continuously implement):

Automate as much as possible
NR tracks Latency, Traffic and Errors, but let's define a metric so that we can say
- "Application Latency has been average for this time period" or
- "We've seen 3x as much traffic as our average for this time period"
- If our systems are performing well in average AND edge cases, then O&M can feel like the job is getting done.
I'm not sure which uptime we want to actually track, application status on cloud.gov or NR synthetic monitoring on cloudfront distributions.. It could be a combination of them; but to one of the points above, it's not about perfection.

A list of references for different perspectives on SRE:

jbrown-xentity · 2023-04-07T19:38:38Z

I think that the SLO (Service Level Objectives) we want to focus on are the following (by process):

Catalog Web: Uptime (this should factor in down time if a solr server is unresponsive, and a % of our requests are failing)
Catalog Harvest: Count of jobs that are "clean" (no errors, only additions, updates, and removals) vs count of jobs that contain errors
- Ideally this is rolled up by harvest source/organization, but if a job runs daily and always fails this should be a signal for our attention
- Not sure what the target value (Service Level Agreement) is at this time, but let's start capturing and figure that out later
- Will be helpful information to have on hand, when we start harvesting 2.0
Inventory: Count of user reported issues/bugs
- Would like to consider automated testing so we aren't reliant on users, but not possible at this time

nickumia-reisys · 2023-04-13T13:40:31Z

We have accepted the details above as the initial starting point to improve O&M processes. This will be an ongoing effort as part of the O&M shift over the next few weeks. I'll make sure to tag it in the new issues, so that we don't lose it.

nickumia-reisys added this to data.gov team board Mar 17, 2023

hkdctol moved this to 📔 Product Backlog in data.gov team board Mar 23, 2023

hkdctol moved this from 📔 Product Backlog to 📟 Sprint Backlog [7] in data.gov team board Mar 23, 2023

hkdctol moved this from 📟 Sprint Backlog [7] to 📔 Product Backlog in data.gov team board Mar 23, 2023

nickumia-reisys self-assigned this Mar 27, 2023

nickumia-reisys moved this from 📔 Product Backlog to 🏗 In Progress [8] in data.gov team board Mar 27, 2023

nickumia-reisys mentioned this issue Mar 31, 2023

O&M Status Dashboard #4269

Merged

nickumia-reisys moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Apr 3, 2023

nickumia-reisys closed this as completed Apr 13, 2023

github-project-automation bot moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Apr 13, 2023

nickumia-reisys mentioned this issue Apr 16, 2023

Implement Solr Uptime/Request Satisfaction Metric #4283

Closed

7 tasks

nickumia-reisys added O&M Operations and maintenance tasks for the Data.gov platform Mission & Vision labels Oct 9, 2023

nickumia-reisys moved this from ✔ Done to 🗄 Closed in data.gov team board Oct 9, 2023

nickumia-reisys added the Explore label Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research O&M Resource Tracking Paradigms #4240

Research O&M Resource Tracking Paradigms #4240

nickumia-reisys commented Mar 17, 2023 •

edited

Loading

nickumia-reisys commented Mar 31, 2023

nickumia-reisys commented Apr 3, 2023

jbrown-xentity commented Apr 7, 2023

nickumia-reisys commented Apr 13, 2023

Research O&M Resource Tracking Paradigms #4240

Research O&M Resource Tracking Paradigms #4240

Comments

nickumia-reisys commented Mar 17, 2023 • edited Loading

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

nickumia-reisys commented Mar 31, 2023

nickumia-reisys commented Apr 3, 2023

Statements to ponder:

Statements that we already imbibe (but need to continuously implement):

jbrown-xentity commented Apr 7, 2023

nickumia-reisys commented Apr 13, 2023

nickumia-reisys commented Mar 17, 2023 •

edited

Loading