Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Form 526 Document Upload - Lighthouse status polling system #15964

Closed
wants to merge 19 commits into from

Conversation

NB28VT
Copy link
Contributor

@NB28VT NB28VT commented Mar 15, 2024

Summary

Implements a system for periodically polling the Lighthouse uploads/status endpoint to check on the status of any supplemental documents that were submitted to Lighthouse as part of a Form 526 Submission.

The purpose of this system is to surface successes and failures that happen in the Form 526 process after we hand them off to Lighthouse.

NOTE THIS PR IS AN MEANT AS AN EARLY STAGE CODE REVIEW THERE ARE STILL SOME REMAINING ITEMS:

Remaining items:

  • Full e2e testing and VCR cassette recording using documents in Lighthouse's QA environment; I still need to get the right combination of environment/credentials working for this.
  • Error handling/logging for non-200 responses from Lighthouse
  • Lighthouse has added a 'requestIdsNotFound' array to the metadata in case we request a document that doesn't exist/was deleted. Unsure yet how we want to handle that situation but worth discussing.
  • Adding the chron job to the job scheduler
  • Adding new file paths to the CODEOWNERS file
  • Fix db/schema diff

Overview

Following the Lighthouse Document Upload migration (paused pending this polling implementation), we will begin uploading suppemental forms and documents for Form 526 submissions to the new Lighthouse /uploads POST endpoint.

That endpoint returns a requestId which corresponds to the document in Lighthouse.

At the request of OCTO, Lighthouse has provided us an additional endpoint, POST /uploads/status, which we can use to check on the progress of these documents after we upload them to Lighthouse. After Lighthouse has received them, they do the following:

  1. Forward the document on to the VBMS
  2. If that succeeds, send the document to BGS as well.

The uploads/status endpoint takes an array of requestIds and gives us the current status of all of those documents as they move throught Lighthouse > VBMS/Lighthouse > BGS. The status response looks like this:

{
  "data": {
    "statuses": [
      {
        "requestId": 600000001,
        "time": {
          "startTime": 1502199000,
          "endTime": 1502199000
        },
        "status": "IN_PROGRESS",
        "steps": [
          {
            "name": "BENEFITS_GATEWAY_SERVICE",
            "nextStepName": "BENEFITS_GATEWAY_SERVICE",
            "description": "string",
            "status": "NOT_STARTED"
          }
        ],
        "error": {
          "detail": "string",
          "step": "BENEFITS_GATEWAY_SERVICE"
        }
      }
    ]
  }
}

This polling architecture is designed to query that endpoint and track the success and failures of documents we upload and save their state in our database for visibility purposes. It is also designed to drive DataDog dashboards and alerts that reflect this information

(both endpoints are documented here)

Design

I've split this polling behavior across a handful of components, each of which I will summarize here:

Lighthouse526DocumentUpload

  • Database model for representing a document we've uploaded to Lighthouse
  • Includes a number of validations/associations I thought made sense for this class but that may be a little defensive/overengineered for V1.
  • For example, all Veteran Upload document types must have an associated FormAttachment record, since this is where we store the S3 pointer for this document and we may need to retrieve it if it has issues.
  • This is where we store the requestId Lighthouse returns, as lighthouse_document_request_id

BenefitsDocuments::Form526::UploadStatusUpdater

  • This is the primary service class for parsing Lighthouse metadata from the status endpoint and updating the status of Lighthouse526DocumentUpload records in our database.
  • Because the metadata is relatively intricate/nested, I wanted to have a single class that handles the parsing and reconciling updates with our records, to hide these nitty gritty details from the wider polling flow.

BenefitsDocuments::Form526::UpdateDocumentsStatusService

  • This is the service class that orchestrates the polling behavior
  • This interacts at a high level with BenefitsDocuments::Form526::UploadStatusUpdater to log status updates on documents to DataDog via StatsD
  • This data will be used to drive dashboards on DataDog.

Lighthouse::Form526DocumentUploadPollingJob

  • The actual chron job that runs
  • Includes our obligatory retry exhaustation handling

BenefitsDocuments::Form526::DocumentsStatusPollingService

  • Configuration for the status endpoint polling, modeled on the mobile team's approach
  • Note this includes changes to BenefitsDocuments::Configuration, which is mobile team's client wrapper for the Lighthouse Benefits Documents API (so we don't have to reinvent the wheel with their complex auth flow)

Behavior Summary

  1. When we upload a document, create a Lighthouse526DocumentUpload record and save the lighthouse_document_request_id they give us.
  2. Mark this document's state as pending
  3. Every day, run the Lighthouse::Form526DocumentUploadPollingJob. This will poll all pending documents that haven't been polled in the past 24 hours
  4. For each document Lighthouse gives us a status update on:

If the status has changed since the last time we polled this document:

  • Log the status metadata from Lighthouse to the Rails logger
  • Save the same data in the last_status_response field on the Lighthouse526DocumentUpload

If Lighthouse tells us the Document is complete (meaning it was sent to both VBMS and BGS):

  • Change the Lighthouse526DocumentUpload state to completed
  • Saves the time Lighthouse told us processing ended
  • Increment a StatsD complete metric with the type of document (example for BDD Instructions: api.form_526.lighthouse_document_upload_processing_status.bdd_instructions.complete)

If the Lighthouse tells the Document failed somewhere:

  • Change the Lighthouse526DocumentUpload state to failed
  • Saves the time Lighthouse told us processing ended
  • Save the error message to the Lighthouse526DocumentUpload record
  • Increment a failed StatsD metric with the type of document AND the ["error"]["step"] in the lighthouse status (see above) ex. api.form_526.lighthouse_document_upload_processing_status.bdd_instructions.failed.claims_evidence
  • Since Lighthouse could always change the step names, we have made the end of that metric key dynamic
  • We will likely create a failure alert for this in DataDog

Regardless of whether the document completed, failed or is still in progress:

  • Update the document's status_last_polled_at timestamp to reflect the fact we just polled this.

  • This will ensure whenever the job runs it only polls documents that haven't been polled in the last 24 hours.

  • *This work is behind a feature toggle (flipper): NO - this code is isolated until we start creating Lighthouse526DocumentUpload records in the Lighthouse migration, which will be behind a feature flag

  • (Summarize the changes that have been made to the platform) See above

  • (If bug, how to reproduce) N/A

  • (What is the solution, why is this the solution?) See above

  • (Which team do you work for, does your team own the maintenance of this component?) I'm on the Benefits and Disabilites team

  • (If introducing a flipper, what is the success criteria being targeted?) N/A

Testing done

Extensive unit tests here; working on end to end tests using the QA environment Lighthouse has provided.

Acceptance criteria

  • I fixed|updated|added unit tests and integration tests for each feature (if applicable).
  • No error nor warning in the console.
  • Events are being sent to the appropriate logging solution
  • Documentation has been updated (link to documentation)
  • No sensitive information (i.e. PII/credentials/internal URLs/etc.) is captured in logging, hardcoded, or specs
  • Feature/bug has a monitor built into Datadog or Grafana (if applicable)
  • If app impacted requires authentication, did you login to a local build and verify all authenticated routes work as expected
  • I added a screenshot of the developed feature

Requested Feedback

I'd recommended reviewing the design and behavior summary above before looking at the code; there are a lot of unit tests which should give an indication of all of the edge cases I anticipate. So that may help guide reviewing the structure overall.

Questions I would have for a reviewer:

  1. Is there anything missing from the "remaining items" list above you can think of?
  2. Do you think the validations and callbacks around Lighthouse526DocumentUpload are a little over engineered for a V1 release?
  3. Something weird happened with my db/schema diff when I migrated recent schema changes - perhaps my Postgres version is different than the one in master? It made several unrelated changes that appear to relate to database configs. Will try and figure this out on my own unless you have some insight.
  4. Note my use of update_all to bulk update the status_last_polled_at field on a batch of Lighthouse526DocumentUpload records in Form526DocumentUploadPollingJob is a no-no - update_all skips Rails validations so it's generally frowned upon. As this is just updating a timestamp field with no other ramifcations I'm making this tradeoff for performance gains. Lmk if you disagree.

Creates a new model, LighthouseDocumentUpload, responsible for persisiting a record of a document that was submitted to Lighthouse's Benefits Document API. We will use this to track the status of a document as it makes its way through the VA's systems.

Document Uploads must be attached to a Form 526 submission, and if they reference a Veteran-submitted document that was forwarded on to Lighthouse, they must be attached to the FormAttachment record created for that upload as well.

A lighthouse_document_request_id is also required; this is returned from the Lighthouse Benefits Documents API as requestId in their response body. We will need this identifier to track the upload as it passes through Lighthouse's processing, since their status endpoint requires we pass it to look up the status of a document.
Adds an AASM state machine to LighthouseDocumentUpload that accounts for the various potential lifecycle events once a document reaches Lighthouse. Adds transition guards to ensure the proper metadata is stored before making certain transitions (e.g. if there is a failure makes sure we have saved an error_message)

Updates factories for LighthouseDocumentUpload and FormAttachment to support tests
Capture events metadata on LighthouseDocumentUpload state transitions and persist the actual response metadata from the Lighthouse /uploads/status API endpoint

These logs will drive DataDog alert events when there are submission failures
Creates a Plain Old Ruby Object to abstract out the parsing of the complex nested response from the uploads/status endpoint of the Lighthouse Benefits Documents API.

Provides convenience methods for checking whether a document has completed all processing, has failed processing, or has progressed to another step in the process
Begin Update documents status service
Implements a service class that encapsulates all of the parsing of a Lighthouse document status data structure, updates a supplied LighthouseDocumentUpload accordingly
Create table for tracking the status of a Form 526 supplemental document we have uploaded to Lighthouse
Simplifies state tracking and validaitons in the model we use to track Form526 document submissions to Lighthouse
Overhauls the service class for parsing a Lighthouse document status response and updating a Lighthouse526DocumentUpload record accordingly.

Simplifies the API for this class to reflect the simplified infrastructure of the service object that will be using this
Simplifies the UpdateDocumentsStatusService to use the simplified API of the UploadStatusUpdater. Makes the service responsible for passing the Lighthouse response metadata to the updater class, and logging to statsd metrics based on the outcome
Implements the actual job that orchestrates polling of the Lighthouse uploads/status endpoint to get the current status of a document we have uploaded to Lighthouse. Includes Sidekiq retry exhaustion handling
Update service and tests to take Lighthouse benefits response from the job
Freezing development for early draft PR feedback; latest changes:
- Account for documents that have not been polled for updates yet (status_last_polled_at is nil)
- Begin end to end testing using the Lighthouse QA domain (still debugging with Lighthouse)
Couple loose ends to clean up ahead of draft review
@NB28VT NB28VT requested review from a team as code owners March 15, 2024 18:35
@NB28VT NB28VT closed this Mar 15, 2024
@NB28VT NB28VT reopened this Mar 15, 2024
@NB28VT NB28VT marked this pull request as draft March 15, 2024 18:38
@NB28VT NB28VT requested review from shaunburdick and tblackwe March 15, 2024 18:41
@va-vfs-bot va-vfs-bot temporarily deployed to dbex-lighthouse-document-polling/main/main March 15, 2024 18:59 Inactive
Adds live VCR tests with the status polling endpoint using documents provided by Lighthouse in their QA environment.

Makes necessary changes across service components and in tests to reflect bugs that were found when testing with the actual endpoint
@va-vsp-bot va-vsp-bot requested a deployment to dbex-lighthouse-document-polling/main/main June 25, 2024 16:49 In progress
@tblackwe tblackwe reopened this Jun 25, 2024
@NB28VT
Copy link
Contributor Author

NB28VT commented Jun 25, 2024

@tblackwe @freeheeling ah I forgot there was still some follow on items here, some of which may have been addressed and some not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to new PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently in conflict, needs to be moved to a fresh PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to second PR, create and test the Model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to second PR, create and test the Model.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to second PR, create and test the Model.

Copy link

github-actions bot commented Aug 2, 2024

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 2, 2024
Copy link

github-actions bot commented Aug 9, 2024

This PR has been closed because it has had no activity for 37 days

@github-actions github-actions bot closed this Aug 9, 2024
@github-actions github-actions bot deleted the dbex-lighthouse-document-polling branch August 9, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants