Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding tools for workbook generation and testing #1723

Merged
merged 5 commits into from
Aug 4, 2023

Conversation

jadudm
Copy link
Contributor

@jadudm jadudm commented Aug 4, 2023

This brings two tools into the tree that have been used for exploring workbook generation.

generate-sqlite-files takes public, pipe-delimited Census data and turns it into an SQLite3 database. This improves on previous tools that did this in the past.

workbook-generator is a script that takes a DBKEY and an SQLite3 database containing public Census data, and outputs a set of GFAC-style XLSX workbooks containing data.

To do this, workbook-generator

  1. Loads one of our templates using openpyxl
  2. Loads data from the SQLite3 database into the named ranges in the workbook template
  3. Saves the template out to a new filename, populated with data.

The generator attempts to do everything right. For example, it attempts to follow old linking IDs (e.g. ELECAUDITID) and replace them with AWARD-####-style references. Similarly, it unpacks the odd design of the Notes to SEFA tables to convert the Census data back into a functional workbook.

This is not a fully validated tool. However, it does generate workbooks with authentic data, and they have been used to drive upload processes into our system. Using them, we have found errors in our validations that we did not know about previously.

If we continue using this tool, we will likely want to build it into our testing automation process. In theory, we could have hundreds (or thousands) of workbooks generated and ready for testing. We can also generate workbooks that explicitly test (e.g.) having secondary auditors (or not), or that exhibit specific other properties... all from existing, previously-validated data.

The generator also spits out a JSON document. That document records:

  1. The table data was pulled from
  2. The fields pulled
  3. The values in those fields

The purpose of that document is to be able to do something like:

  1. Load the workbooks through our pipeline
  2. Have them undergo validation, cross-val, and ETL
  3. Use the JSON document to compare what we pulled from the SQLite DB to what ended up in our dissemination DB.

In other words, the JSON document is to enable end-to-end testing of the dissemination pipeline. Ideally, we would do that final check using the API. This would let us use the JSON document to generate API calls that query the DB (from the "outside"), and verify that the API produces data we expect.

This brings two tools into the tree that have been used for exploring
workbook generation.

`generate-sqlite-files` takes public, pipe-delimited Census data and
turns it into an SQLite3 database. This improves on previous tools that
did this in the past.

`workbook-generator` is a script that takes a DBKEY and an SQLite3
database containing public Census data, and outputs a set of GFAC-style
XLSX workbooks containing data.

To do this, `workbook-generator`

1. Loads one of our templates using `openpyxl`
2. Loads data from the SQLite3 database into the named ranges in the
workbook template
3. Saves the template out to a new filename, populated with data.

The generator attempts to do everything right. For example, it attempts
to follow old linking IDs (e.g. ELECAUDITID) and replace them with
AWARD-####-style references. Similarly, it unpacks the odd design of the
Notes to SEFA tables to convert the Census data back into a functional
workbook.

This is not a fully validated tool. However, it does generate workbooks
with authentic data, and they have been used to drive upload processes
into our system. Using them, we have found errors in our validations
that we did not know about previously.

If we continue using this tool, we will likely want to build it into our
testing automation process. In theory, we could have hundreds (or
thousands) of workbooks generated and ready for testing. We can also
generate workbooks that explicitly test (e.g.) having secondary auditors
(or not), or that exhibit specific other properties... all from
existing, previously-validated data.

The generator also spits out a JSON document. That document records:

1. The table data was pulled from
2. The fields pulled
3. The values in those fields

The purpose of that document is to be able to do something like:

1. Load the workbooks through our pipeline
2. Have them undergo validation, cross-val, and ETL
3. Use the JSON document to compare what we pulled from the SQLite DB to
what ended up in our dissemination DB.

In other words, the JSON document is to enable end-to-end testing of the
dissemination pipeline. Ideally, we would do that final check using the
API. This would let us use the JSON document to generate API calls that
query the DB (from the "outside"), and verify that the API produces data
we expect.
@jadudm jadudm temporarily deployed to dev August 4, 2023 10:56 — with GitHub Actions Inactive
@jadudm jadudm temporarily deployed to management August 4, 2023 10:56 — with GitHub Actions Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Terraform plan for dev

No changes. Your infrastructure matches the configuration.
No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

✅ Plan applied in Deploy to Development and Management Environment #74

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Terraform plan for management

No changes. Your infrastructure matches the configuration.
No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

📝 Plan generated in Pull Request Checks #288

Comment on lines +24 to +25
# Do everything in a temp dir.
# It will disappear when we hit the end of the with block.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Do everything in a temp dir.
# It will disappear when we hit the end of the with block.
# Do everything in a temp dir.
# It will disappear when we hit the end of the with block.

This is a non blocker, but what I would personally like, is to have comments removed from file, and put in the tools/generate-sqlite-files/readme.md as code snippets with these comments.

so in readme.md
with tempfile.TemporaryDirectory('_fac') as tdir: will create a temp dir for running this.. etc etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets pick this up in an iteration. I have a bunch of questions, actually, as to how we would even work some of this into a GH workflow. That would necessarily drive changes to this. For example, we might want to make it a Django command, so it is part of the FAC app.

The number of changes coming for this script could be many. Or, we might dump it. So, at that point, we have opportunities to iterate/reshape the whole thing. Which we might have to.

Co-authored-by: Alex Steel <130377221+asteel-gsa@users.noreply.github.com>
@jadudm jadudm temporarily deployed to dev August 4, 2023 13:20 — with GitHub Actions Inactive
Co-authored-by: Alex Steel <130377221+asteel-gsa@users.noreply.github.com>
@jadudm jadudm temporarily deployed to management August 4, 2023 13:20 — with GitHub Actions Inactive
@jadudm jadudm temporarily deployed to management August 4, 2023 13:21 — with GitHub Actions Inactive
@jadudm jadudm temporarily deployed to dev August 4, 2023 13:21 — with GitHub Actions Inactive
Co-authored-by: Alex Steel <130377221+asteel-gsa@users.noreply.github.com>
@jadudm jadudm temporarily deployed to dev August 4, 2023 13:22 — with GitHub Actions Inactive
@jadudm jadudm temporarily deployed to management August 4, 2023 13:22 — with GitHub Actions Inactive
Co-authored-by: Alex Steel <130377221+asteel-gsa@users.noreply.github.com>
@jadudm jadudm temporarily deployed to dev August 4, 2023 13:22 — with GitHub Actions Inactive
@jadudm jadudm temporarily deployed to management August 4, 2023 13:22 — with GitHub Actions Inactive
@asteel-gsa
Copy link
Contributor

The move of comments into readme is very much a personal thing, it is not at all a necessity and can be ignored. I like the idea of dropping stackoverflow links for references, and its really up to you if you feel that those comments can be documented in a readme vs inside code. Who you ask will tell you yes/no, so i think what im trying to say, if you feel those are worth expanding upon in a readme, great, if not, leave them in if you feel they are necessary to the code.

Otherwise, LGTM @jadudm

@jadudm jadudm merged commit 1bda9c6 into main Aug 4, 2023
@jadudm jadudm deleted the jadudm/workbook-testing-tools branch August 4, 2023 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants