Deleted PagerDuty services causes maintenance windows to not work #89547
Labels
2024
backend
bug
Something isn't working
engineering
Engineering topics
platform-product-team
zero-silent-failures
Work related to eliminating silent failures
User Story
As a developer on VA.gov, I want to delete a PagerDuty service without breaking maintenance windows for everyone.
Issue Description
The Platform was notified via a support request that maintenance windows in Staging weren't working anymore. After some research, we determined that if there was a service ID listed that was bad (i.e. didn't exist, went to a 404 page),
PollMaintenanceWindows
would fail and no maintenance windows would be set for any service, effectively breaking maintenance windows in the environment it was removed from.PR https://github.com/department-of-veterans-affairs/vsp-infra-application-manifests/pull/3027 resolved the problem by removing the two bad IDs, but there is nothing stopping this from happening again. If someone deletes a PagerDuty Service that is also listed in values.yml (search for
maintenance:
and the list ofservices
is nested under there), ALL maintenance windows in that env will stop working (this would be really bad for prod).It took a while to figure out the issue because errors were obfuscated.
VA900
is a generic error that doesn't mean anything. In order to debug we changedconn.response :raise_custom_error, error_prefix: service_name
toconn.response :raise_error, error_prefix: service_name, include_request: true
inlib/pagerduty/configuration.rb
which uncovered the real error::body=>{"error"=>{"message"=>"Service Not Found", "code"=>5002}
Note: a service id correlates to the id in the PagerDuty URL (
PY7573H
and https://dsva.pagerduty.com/service-directory/PY7573H, for example).Tasks
Acceptance Criteria
Validation
Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.
master
.services
section nested undermaintenance
inconfig/settings.yml
, change an id to make it invalid (PXXXXXX
for example).PagerDuty::PollMaintenanceWindows.new.perform
Expected outcome: No error in the console.
The text was updated successfully, but these errors were encountered: