Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O+M 2024-10-14 #4933

Closed
5 of 14 tasks
btylerburton opened this issue Oct 15, 2024 · 1 comment
Closed
5 of 14 tasks

O+M 2024-10-14 #4933

btylerburton opened this issue Oct 15, 2024 · 1 comment
Assignees
Labels
O&M Operations and maintenance tasks for the Data.gov platform

Comments

@btylerburton
Copy link
Contributor

btylerburton commented Oct 15, 2024

As part of day-to-day operation of Data.gov, there are many Operation and Maintenance (O&M) responsibilities. Instead of having the entire team watching notifications and risking some notifications slipping through the cracks, we have created an O&M Triage role. One person on the team is assigned the Triage role which rotates each sprint. This is not meant to be a 24/7 responsibility, only East Coast business hours. If you are unavailable, please note when you will be unavailable in Slack and ask for someone to take on the role for that time.

Check the O&M Rotation Schedule for future planning.

Acceptance criteria

You are responsible for all O&M responsibilities this week. We've highlighted a few so they're not forgotten. You can copy each checklist into your daily report.

Daily Checklist

Note: Catalog Auto Tasks
You will need to update the chart values manually. Click the Action link in each issue and grab the values from monitor task output and check runtime.

Weekly Checklist

Monthly Checklist

ad-hoc checklist

  • audit/review applications on cloud foundry and determine what can be stopped and/or deleted.

Reference

@btylerburton btylerburton self-assigned this Oct 15, 2024
@btylerburton btylerburton added the O&M Operations and maintenance tasks for the Data.gov platform label Oct 15, 2024
@btylerburton btylerburton moved this from 📥 Queue to 🏗 In Progress [8] in data.gov team board Oct 15, 2024
@FuhuXia
Copy link
Member

FuhuXia commented Oct 18, 2024

From the previous observation, we kind of knew solr mem-leak/restart is related to harvesting activity. Now we have more evidences.

Comparing a week of data on solr memory usage and harvest activity (from by catalog-fetch log), it shows a spike of harvesting activity is always followed by a solr memory usage increase.

So if we can control the harvesting activity and make harvesting only happen during off hours, we can control when solr restart will happen, therefore minimize the catalog down time during business hours.

image
image

@Bagesary Bagesary closed this as completed by moving to ✔ Done in data.gov team board Oct 21, 2024
@Bagesary Bagesary moved this from ✔ Done to 🗄 Closed in data.gov team board Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O&M Operations and maintenance tasks for the Data.gov platform
Projects
Archived in project
Development

No branches or pull requests

3 participants