Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ExportBreakerStatus Sidekiq job #99419

Open
12 tasks
rmtolmach opened this issue Dec 18, 2024 · 0 comments
Open
12 tasks

Refactor ExportBreakerStatus Sidekiq job #99419

rmtolmach opened this issue Dec 18, 2024 · 0 comments

Comments

@rmtolmach
Copy link
Contributor

rmtolmach commented Dec 18, 2024

Refinement note: this feels big and could be split into two tickets. One for discovery and one for the work. Or not. 😄

Initial investigation

What does the job do?
This runs every minute. It reports a status of "up" or "down" to statsd. Originally added in department-of-veterans-affairs/vets-api#1358.

Is the job still needed?
Maybe. In this Job, the status of the service is being reported with the key: api.external_service.#{service.name}.up. But in Datadog, we access the status of an external service via vets_api.statsd.api_external_service.up. So it seems like we don't use it anymore. I don't see any metrics starting with api in Datadog. We have this config in statsd-exporter-mapping.conf in the devops repo but... not sure if that's in use (file hasn't been updated in 3 years).

Does the job need exception handling added?
Yes.

Any other modifications needed?

  • The line StatsD.gauge("api.external_service.#{service.name}.up", up) in the job might need to be updated to the newer, vets_api.statsd....
  • Add a comment to ExportBreakerStatus line in periodic_jobs.rb to document what it does ("Send breaker status to statsd").
  • Add sidekiq_options(retry: false). This job runs every minute, so it doesn't make sense to have it retry. (This job doesn't specify any retries meaning it hits the sidekiq retry default of 25.)

User Story

As a vets-api owner, I want to understand the purpose and functionality of the ExportBreakerStatus job, so that it can be refactored to add error handling and make it useful.

Issue Description

The ExportBreakerStatus Sidekiq job runs every minute via cron and reports an external service's status ("up" or "down") to StatsD under the key api.external_service.<service_name>.up. However, after investigation (detailed above and in this Slack thread), it is unclear if this job is still used:

  1. Metrics starting with api are not visible in Datadog.
  2. Current external service statuses are accessed in Datadog using vets_api.statsd.api_external_service.up, which suggests a different mechanism is in use.
  3. statsd-exporter-mapping.conf references api.external_service, but is that file still relevant?

This raises several questions:

  • Is the job deprecated, or is there a misconfiguration causing metrics to be dropped?
  • Should the metric key in the job be updated to align with the newer vets_api.statsd.api_external_service.up?
  • Is this the cause of breakers acting funny?

Additionally, the job lacks exception handling and retries are enabled by default, which is unnecessary for a job running every minute.

Tasks

  • Investigate whether metrics from ExportBreakerStatus (api.external_service.<service_name>.up) are being forwarded to StatsD and Datadog or are being dropped.
  • Determine if the statsd-exporter-mapping.conf file is still in use and relevant for metric key translation.
  • Confirm whether vets_api.statsd.api_external_service.up metrics are being generated elsewhere or if this job needs to be updated to use this key format.
  • Add exception handling to the job.
  • Update the job configuration to add sidekiq_options retry: false.
  • Add a comment to the ExportBreakerStatus line in periodic_jobs.rb to document its purpose.
  • Based on findings:
    • Update the job to use vets_api.statsd... if necessary.
    • If the job is redundant, deprecate it.

Acceptance Criteria

  • The connection between Breakers, this job, and external service monitoring metrics is understood and validated.
  • If the job is still relevant, metrics from it are being forwarded to Datadog and monitored effectively.
  • If the job is redundant, it is safely deprecated, and its removal does not impact existing monitoring functionality.
  • Exception handling is added to the job.
  • Retries are disabled for the job.

Validation

Assignee to add steps to this section. List the actions that need to be taken to confirm this issue is complete. Include any necessary links or context. State the expected outcome.

@rmtolmach rmtolmach changed the title ExportBreakerStatus Refactor ExportBreakerStatus Sidekiq job Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant