Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleted PagerDuty services causes maintenance windows to not work #89547

Open
3 of 6 tasks
rmtolmach opened this issue Jul 29, 2024 · 1 comment
Open
3 of 6 tasks

Deleted PagerDuty services causes maintenance windows to not work #89547

rmtolmach opened this issue Jul 29, 2024 · 1 comment
Assignees
Labels
2024 backend bug Something isn't working engineering Engineering topics platform-product-team zero-silent-failures Work related to eliminating silent failures

Comments

@rmtolmach
Copy link
Contributor

rmtolmach commented Jul 29, 2024

User Story

As a developer on VA.gov, I want to delete a PagerDuty service without breaking maintenance windows for everyone.

Issue Description

The Platform was notified via a support request that maintenance windows in Staging weren't working anymore. After some research, we determined that if there was a service ID listed that was bad (i.e. didn't exist, went to a 404 page), PollMaintenanceWindows would fail and no maintenance windows would be set for any service, effectively breaking maintenance windows in the environment it was removed from.

PR https://github.com/department-of-veterans-affairs/vsp-infra-application-manifests/pull/3027 resolved the problem by removing the two bad IDs, but there is nothing stopping this from happening again. If someone deletes a PagerDuty Service that is also listed in values.yml (search for maintenance: and the list of services is nested under there), ALL maintenance windows in that env will stop working (this would be really bad for prod).

It took a while to figure out the issue because errors were obfuscated.

irb(main):002> PagerDuty::PollMaintenanceWindows.new.perform
{"host":"vets-api-web-5bb6ddc48b-md425","application":"vets-api-server","environment":"production","timestamp":"2024-07-26T15:44:45.108318Z","level":"error","level_index":4,"pid":21323,"thread":"12320","file":"/app/lib/common/client/base.rb","line":110,"named_tags":{"dd":{"env":"eks-staging","service":"vets-api","version":"0e7023936b369513a1ac69b3808b1d8cf06ce531","trace_id":"0","span_id":"0"},"ddsource":"ruby"},"name":"Rails","message":"BackendServiceException: {:status=>400, :detail=>nil, :code=>\"VA900\", :source=>nil}","payload":{"title":"Operation failed","detail":"Operation failed","code":"VA900","status":"400"}}

VA900 is a generic error that doesn't mean anything. In order to debug we changed conn.response :raise_custom_error, error_prefix: service_name to conn.response :raise_error, error_prefix: service_name, include_request: true in lib/pagerduty/configuration.rb which uncovered the real error: :body=>{"error"=>{"message"=>"Service Not Found", "code"=>5002}

Note: a service id correlates to the id in the PagerDuty URL (PY7573H and https://dsva.pagerduty.com/service-directory/PY7573H, for example).

Tasks

  • Add error handling so that if a maintenance service id is not valid, it doesn't ruin it for everyone else.
  • Write unit tests
  • Bonus: make the error handling more transparent so you don't have to hunt so hard for the error and/or can easily search for it in Datadog or Sentry. BUT, keep in mind whether any PII could be in the response body.
  • Make Datadog Monitor to watch for the specific log.

Acceptance Criteria

  • All tasks complete
  • Pass validation on to BE eng.

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

  1. After PR is merged, pull in latest code from master.
  2. in the services section nested under maintenance in config/settings.yml, change an id to make it invalid (PXXXXXX for example).
  3. in the rails console, run PagerDuty::PollMaintenanceWindows.new.perform

Expected outcome: No error in the console.

@rmtolmach rmtolmach added platform-product-team bug Something isn't working labels Jul 29, 2024
@jennb33 jennb33 added backend needs-grooming Use this to designate any issues that need grooming from the team engineering Engineering topics needs-refinement Identifies tickets that need to be refined 2024 labels Aug 14, 2024
@jennb33 jennb33 removed needs-grooming Use this to designate any issues that need grooming from the team needs-refinement Identifies tickets that need to be refined labels Aug 22, 2024
@jennb33 jennb33 assigned rmtolmach and unassigned rmtolmach Aug 22, 2024
@rmtolmach rmtolmach added the zero-silent-failures Work related to eliminating silent failures label Oct 14, 2024
@rmtolmach
Copy link
Contributor Author

PR is up. I added a task to the description for making a Log Monitor in Datadog. My initial assumption that we could ignore bad service ids won't work because the pd API receives the ids in bulk. I talked over a few possible solutions with Eric and I landed on this one. Catch the error and alert on the error. I need to clean up the monitor a bit on Monday and get a review of the PR + monitor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024 backend bug Something isn't working engineering Engineering topics platform-product-team zero-silent-failures Work related to eliminating silent failures
Projects
None yet
Development

No branches or pull requests

2 participants