Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output Parquet files as well as SQLite in PUDL ETL #3296
Output Parquet files as well as SQLite in PUDL ETL #3296
Changes from 37 commits
8b4fed4
b1ac18a
0a78a94
c9aa6f0
b6136c7
83492a0
83650d5
39b72e5
9043255
4ab92f2
c49d129
afed63d
d88c1ef
4ab01d5
cd98379
a400446
df5e12d
c3d4bb3
9081388
273741a
06571ac
cd2dcd0
c7ce623
ab66c2a
013f4a5
3408333
91b4ff9
01ae240
21e0dbf
ee2a533
2fceea6
1cf1fbd
dd6e395
192a569
a27c5f1
f47a9d1
a961480
46dd5a2
e8ebdd7
06aa9fe
5b82d91
aaa46cb
0492659
e5825fb
6fd541e
cc3644a
7269bf7
cf1d5c5
f9ed62b
f2c9f25
41eb6d3
27555ae
827dc8b
16b4585
b54e915
5b1a6f5
ef74271
4ed6f90
53a3fd4
40e58a6
b922919
d59f795
5fffc65
16fb36d
6c23e79
1aa55b9
6ca051b
0883831
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulled the foreign key check logic out of the io-manager and put it here directly, which seemed as good a place as anywhere to put it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that was my original idea as well. I do not like the name clash in that you end up with
pudl.etl.check_foreign_keys.check_foreign_keys()
.In the end, I moved this to
helpers.py
. That said, there doesn't really seem to be a good place in the codebase for miscellaneous stuff and that might be useful, e.g.pudl.misc
where you can have assortment of small modules that don't naturally fit elsewhere. Putting this intocheck_foreign_keys.py
is a bit awkward IMO, because of the naming as well as the fact that this module is CLI and I think that mixing CLI/binary capabilities and library capabilities (something you'd be importing elsewhere) seems a little dirty.This is just a thought and I would not block this PR for this, as this can be easily refactored in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree, and had some of the same thoughts, but I checked and right now
check_foreign_keys
is only being called from the CLI, and in testing, so I decided it would be ok to leave here for now. Maybe it would make sense to change the function name to_check_foreign_keys
to make it clear it's not meant to be public library code right now?Part of the reason I put it here is because I think there's a much bigger conversation about code organization that I didn't want to tackle here (should probably try to reignite the github discussion on code org/structure).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for now. I think I added
check_foreign_keys
to theSQLiteIOManager
class because I imaged we might have different methods for checking referential integrity for each IO Manager. For example, if we decided to drop sqlite and rely on a combination of parquet files and duckdb we'd need to create a new method of enforcing referential integrity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's also some actual data transformation creeping into the
etl
subpackage which I thought was just supposed to be laying out the Dagster assets / config stuff (seeglue_assets.py
)