Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LH Doc Upload Migration] Fix Lighthouse Upload Failure Metrics Logging #19466

Conversation

NB28VT
Copy link
Contributor

@NB28VT NB28VT commented Nov 14, 2024

WORKING PR: I am out for a week and wasn't able to reliably accomplish the level of test coverage I was hoping for, so I am handing this off to @ajones446 in case we end up completing it this week. It is lower priority than some of the other todo items remaining in the migration

Summary

  • *This work is behind a feature toggle (flipper):

No, errors for the upload jobs that are turned on for Lighthouse in production are not getting logged the way we want them to for our team's internal stats keeping. However, there are dashboards in place that log these errors so we are aware of them. They just don't neatly increment the statsd metric we will be using to quickly reference the breakdown of attempts, successes and failures when uploading document to Lighthouse

  • (Summarize the changes that have been made to the platform)

Documents submitted via our Lighthouse API client that fail to return a success response raise custom exceptions defined in the Lighthouse::ServiceException class. As such, the previous strategy of capturing and logging the Lighthouse API response directly in LighthouseSupplementalDocumentUploadProvider will never work - these exceptions will be raised by our exisitng Lighthouse API client before we have a chance to inspect the response here.

  • (What is the solution, why is this the solution?)

Updates the provider to rescue from any one of the possible API exceptions raised in this case, properly log the upload as a failure, and re-raise the exception to maintain the current behavior

  • (Which team do you work for, does your team own the maintenance of this component?)

Disability benefits team 2, we own the maintenace

Related issue(s)

  • More information on the issue is described in this ticket

Testing done

NOTE: tests are not yet passing, further work is required

  • New code is covered by unit tests
  • Describe what the old behavior was prior to the change
  • Describe the steps required to verify your changes are working as expected. Exclusively stating 'Specs run' is NOT acceptable as appropriate testing
  • If this work is behind a flipper:
    • Tests need to be written for both the flipper on and flipper off scenarios. Docs.
    • What is the testing plan for rolling out the feature?

Acceptance criteria

  • I fixed|updated|added unit tests and integration tests for each feature (if applicable).
  • No error nor warning in the console.
  • Events are being sent to the appropriate logging solution
  • Documentation has been updated (link to documentation)
  • No sensitive information (i.e. PII/credentials/internal URLs/etc.) is captured in logging, hardcoded, or specs
  • Feature/bug has a monitor built into Datadog (if applicable)
  • If app impacted requires authentication, did you login to a local build and verify all authenticated routes work as expected
  • I added a screenshot of the developed feature

Requested Feedback

This was just the first approach I thought of for this problem, its not ideal we have layers and layers of redundant logging and exception handling across our API client codebase, but this is cleanest way I can think of to increment our failure metrics for the Lighthouse migration based on the custom client exceptions paradigm defined in this exception handler class

My main reservations about the approach I took are:

  1. Needing to maintain an array of exception classes in LIGHTHOUSE_RESPONSE_EXCEPTION_CLASSES that matches the potential exceptions raised by our service exception code. I don't think we should raise and log any exception here, as this metric is meant to indicate a failure response from Lighthouse explicitly. So we may not have much choice, it just feels dirty/the provider knows too much about the exception classes

  2. The associated testing approach which loops through these exceptions and uses RSpec's shared examples. I hate the shared example DSL, it's confusing to read and kind of clunky. But it probably makes sense here given we have to test explicit exception handling for specific exceptions and there is a whole list of them.

…load provider

Documents submitted via our Lighthouse API client that fail to return a success response raise custom exceptions defined in the Lighthouse::ServiceException class. As such, the previous strategy of capturing and logging the Lighthouse API response directly in LighthouseSupplementalDocumentUploadProvider will never work - these exceptions will be raised by our exisitng Lighthouse API client before we have a chance to inspect the response here.

Updates the provider to rescue from any one of the possible API exceptions raised in this case, properly log the upload as a failure, and re-raise the exception to maintain the current behavior
expect(StatsD).to receive(:increment).with(
'my_stats_metric_prefix.lighthouse_supplemental_document_upload_provider.upload_failure'
)
describe 'service exceptions' do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't yet been able to get all of these tests passing with the Rails logger logging, some work and some do not which is confusing since they are all just custom exception classes. The metrics increment and re-raising the error behavior does seem to work for all of them.

We may not want to use the Rails logger for capturing exceptions, as this information is already logged elsewhere (you can see the custom exceptions showing up in the "catch all" widget on our migration dashboard such as this one)

The purpose of logging the exception here would just be to have a unified system of logging events in the upload providers, as this logging matches how we log attempts and successes, in addition to the metrics, which are more helpful for aggregating data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option to just get this working is to just increment the metric and not worry about a redundant call to the rails logger

@va-vfs-bot va-vfs-bot temporarily deployed to 96928-nb-disability-benefits-fix-lighthouse-response-metrics/main/main November 14, 2024 20:20 Inactive
@va-vfs-bot va-vfs-bot temporarily deployed to 96928-nb-disability-benefits-fix-lighthouse-response-metrics/main/main November 19, 2024 22:53 Inactive
log_upload_failure(e)
raise e
end

handle_lighthouse_response(api_response)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename hand success response

@NB28VT
Copy link
Contributor Author

NB28VT commented Nov 25, 2024

Closing this as we're going to take a slightly different approach and I'd rather just start a clean branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants