Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API-26689: VBADocuments Data Migration to Remove doc_type from UploadSubmission Records #12843

Merged
merged 0 commits into from
May 31, 2023

Conversation

kristen-brown
Copy link
Contributor

@kristen-brown kristen-brown commented May 31, 2023

Summary

The doc_type stored in UploadSubmission records is a consumer-provided string. We previously stopped storing this key-value pair due to PII/PHI concerns, since we can't control what the consumer sends us in this field. To build upon that work, this PR scrubs the doc_type key and value from the uploaded_pdf jsonb column of all UploadSubmission records.

Related issue(s)

API-26689

Testing done

The rake task was run locally against my database containing a variety of UploadSubmission records. The contents of the rake task were also run on the development server and confirmed to work as expected to remove the doc_type from the uploaded_pdf jsonb column and leave the rest of the column's contents intact.

The task will be run in the following order on the vets-api environments:

  1. Development (6,415 records) ✅
  2. Staging (21,715 records)
  3. Sandbox (166,619 records)
  4. Production (2,577,864 records)

Screenshots

None

What areas of the site does it impact?

This PR impacts the data stored in UploadSubmission records (VBADocuments module). It also renames a couple of existing VBADocuments rake task files to move the date stamp to the front of the file name (for file sorting) and updates the namespace of the data migration tasks so that they're not under a generic temp namespace.

Acceptance criteria

  • I fixed|updated|added unit tests and integration tests for each feature (if applicable). – N/A
  • No error nor warning in the console.
  • Events are being sent to the appropriate logging solution – N/A
  • Documentation has been updated (link to documentation) – N/A
  • No sensitive information (i.e. PII/credentials/internal URLs/etc.) is captured in logging, hardcoded, or specs
  • Feature/bug has a monitor built into Datadog or Grafana (if applicable) – N/A
  • If app impacted requires authentication, did you login to a local build and verify all authenticated routes work as expected – N/A
  • I added a screenshot of the developed feature – N/A

Requested Feedback

Given the large difference in UploadSubmission record count between the lower environments and Production, I request that this PR be reviewed through the lens of performance, and I hope that the reviewer will be able to help identify if there are any data loss risks (beyond the intended data loss) to running the migration as written on the Production server. From my research, it appeared to be a performant query (and certainly more performant than updating the records via ActiveRecord), but a second set of eyes/expertise would be appreciated. Thanks!

@kristen-brown kristen-brown added Lighthouse lighthouse banana-peels Lighthouse Banana Peels Team labels May 31, 2023
@kristen-brown kristen-brown requested review from a team as code owners May 31, 2023 13:58
@va-vfs-bot va-vfs-bot temporarily deployed to API-26689-vba-documents-doc-type-data-migration/main/main May 31, 2023 13:59 Inactive
@va-vsp-bot va-vsp-bot requested a deployment to API-26689-vba-documents-doc-type-data-migration/main/main May 31, 2023 14:10 In progress
@kristen-brown kristen-brown merged commit 77207bb into master May 31, 2023
@kristen-brown kristen-brown deleted the API-26689-vba-documents-doc-type-data-migration branch May 31, 2023 14:33
task scrub_doc_type_from_metadata: :environment do
ActiveRecord::Base.connection.execute("
UPDATE vba_documents_upload_submissions
SET uploaded_pdf = uploaded_pdf::jsonb - 'doc_type';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great approach. The jsonb - operator is perfect. The only thing I am worried about is the number of records on Sandbox and Prod. 2+ million records will surely give the VM something to think about, especially since it is digging into the complex data structure for each row. What I have done in the past is to batch over the ID's in groups of a thousand or so.

ryan-mcneil pushed a commit that referenced this pull request Dec 11, 2023
…ploadSubmission` Records (#12843)

* API-26689: Add rake task for removing doc_type from UploadSubmission metadata

* API-26689: Updating namespacing of rake tasks

* API-26689: Fix indentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
banana-peels Lighthouse Banana Peels Team Lighthouse lighthouse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants