-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding tools for workbook generation and testing #1723
Conversation
This brings two tools into the tree that have been used for exploring workbook generation. `generate-sqlite-files` takes public, pipe-delimited Census data and turns it into an SQLite3 database. This improves on previous tools that did this in the past. `workbook-generator` is a script that takes a DBKEY and an SQLite3 database containing public Census data, and outputs a set of GFAC-style XLSX workbooks containing data. To do this, `workbook-generator` 1. Loads one of our templates using `openpyxl` 2. Loads data from the SQLite3 database into the named ranges in the workbook template 3. Saves the template out to a new filename, populated with data. The generator attempts to do everything right. For example, it attempts to follow old linking IDs (e.g. ELECAUDITID) and replace them with AWARD-####-style references. Similarly, it unpacks the odd design of the Notes to SEFA tables to convert the Census data back into a functional workbook. This is not a fully validated tool. However, it does generate workbooks with authentic data, and they have been used to drive upload processes into our system. Using them, we have found errors in our validations that we did not know about previously. If we continue using this tool, we will likely want to build it into our testing automation process. In theory, we could have hundreds (or thousands) of workbooks generated and ready for testing. We can also generate workbooks that explicitly test (e.g.) having secondary auditors (or not), or that exhibit specific other properties... all from existing, previously-validated data. The generator also spits out a JSON document. That document records: 1. The table data was pulled from 2. The fields pulled 3. The values in those fields The purpose of that document is to be able to do something like: 1. Load the workbooks through our pipeline 2. Have them undergo validation, cross-val, and ETL 3. Use the JSON document to compare what we pulled from the SQLite DB to what ended up in our dissemination DB. In other words, the JSON document is to enable end-to-end testing of the dissemination pipeline. Ideally, we would do that final check using the API. This would let us use the JSON document to generate API calls that query the DB (from the "outside"), and verify that the API produces data we expect.
Terraform plan for dev No changes. Your infrastructure matches the configuration.
✅ Plan applied in Deploy to Development and Management Environment #74 |
Terraform plan for management No changes. Your infrastructure matches the configuration.
📝 Plan generated in Pull Request Checks #288 |
# Do everything in a temp dir. | ||
# It will disappear when we hit the end of the with block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Do everything in a temp dir. | |
# It will disappear when we hit the end of the with block. | |
# Do everything in a temp dir. | |
# It will disappear when we hit the end of the with block. |
This is a non blocker, but what I would personally like, is to have comments removed from file, and put in the tools/generate-sqlite-files/readme.md
as code snippets with these comments.
so in readme.md
with tempfile.TemporaryDirectory('_fac') as tdir:
will create a temp dir for running this.. etc etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets pick this up in an iteration. I have a bunch of questions, actually, as to how we would even work some of this into a GH workflow. That would necessarily drive changes to this. For example, we might want to make it a Django command, so it is part of the FAC app.
The number of changes coming for this script could be many. Or, we might dump it. So, at that point, we have opportunities to iterate/reshape the whole thing. Which we might have to.
Co-authored-by: Alex Steel <130377221+asteel-gsa@users.noreply.github.com>
Co-authored-by: Alex Steel <130377221+asteel-gsa@users.noreply.github.com>
Co-authored-by: Alex Steel <130377221+asteel-gsa@users.noreply.github.com>
Co-authored-by: Alex Steel <130377221+asteel-gsa@users.noreply.github.com>
The move of comments into readme is very much a personal thing, it is not at all a necessity and can be ignored. I like the idea of dropping stackoverflow links for references, and its really up to you if you feel that those comments can be documented in a readme vs inside code. Who you ask will tell you yes/no, so i think what im trying to say, if you feel those are worth expanding upon in a readme, great, if not, leave them in if you feel they are necessary to the code. Otherwise, LGTM @jadudm |
This brings two tools into the tree that have been used for exploring workbook generation.
generate-sqlite-files
takes public, pipe-delimited Census data and turns it into an SQLite3 database. This improves on previous tools that did this in the past.workbook-generator
is a script that takes a DBKEY and an SQLite3 database containing public Census data, and outputs a set of GFAC-style XLSX workbooks containing data.To do this,
workbook-generator
openpyxl
The generator attempts to do everything right. For example, it attempts to follow old linking IDs (e.g. ELECAUDITID) and replace them with AWARD-####-style references. Similarly, it unpacks the odd design of the Notes to SEFA tables to convert the Census data back into a functional workbook.
This is not a fully validated tool. However, it does generate workbooks with authentic data, and they have been used to drive upload processes into our system. Using them, we have found errors in our validations that we did not know about previously.
If we continue using this tool, we will likely want to build it into our testing automation process. In theory, we could have hundreds (or thousands) of workbooks generated and ready for testing. We can also generate workbooks that explicitly test (e.g.) having secondary auditors (or not), or that exhibit specific other properties... all from existing, previously-validated data.
The generator also spits out a JSON document. That document records:
The purpose of that document is to be able to do something like:
In other words, the JSON document is to enable end-to-end testing of the dissemination pipeline. Ideally, we would do that final check using the API. This would let us use the JSON document to generate API calls that query the DB (from the "outside"), and verify that the API produces data we expect.