-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to write data to both Parquet and SQLite during ETL #3232
Conversation
Use pyarrow.parquet for writing/reading, add BaseSettings to configure input/output behaviors. By default, let both modes be disabled but allow overrides via PUDL_WRITE_TO_PARQUET and PUDL_READ_FROM_PARQUET env variables.
For more information, see https://pre-commit.ci
@zaneselvans once we actually support dual output formats (sqlite+parquet) it might be good time to do a little cleanup in the io manager codebase as this has been clearly designed with sqlite only in mind, and the class structure seems little clunky here, esp. when it comes to overlap/differences between pudl/ferc io managers (the latter being only partial and fixed to sqlite formats) and pudl/epacems (that might now share more due to both writing to parquet). I will think about how we could improve the situation, but this could get us going in the meantime. |
Conversion to pyarrow table was necessary before writing to parquet.
For more information, see https://pre-commit.ci
@@ -66,6 +66,10 @@ def sqlite_db_uri(self, name: str) -> str: | |||
# sqlite://{credentials}/{db_path} | |||
return f"sqlite:///{self.sqlite_db_path(name)}" | |||
|
|||
def parquet_path(self, db_name: str, table_name: str) -> Path: | |||
"""Return path to parquet file for given databae and table.""" | |||
return self.output_dir / "parquet" / f"{table_name}.parquet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this could be a natural place to abstract the naming away from specific sites like io managers, and might allow for some degree of flexibility if needed. For now, db_name
may not be necessary here.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #3232 +/- ##
=======================================
- Coverage 92.6% 92.5% -0.1%
=======================================
Files 143 143
Lines 12979 13090 +111
=======================================
+ Hits 12025 12114 +89
- Misses 954 976 +22 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fuzzy on where the configuration should go, but I'm pretty sure that Dagster already has a place where it expects to receive this kind of resource configuration, including via environment variables. But maybe @bendnorman knows off the top of his head.
It feels like it would be cleaner to keep the SQLite IO Manager dedicated to SQLite, and have another IO Manager that is dedicated to the Parquet logic, and then compose them together into a hybrid that can do both / either depending on how it's configured. Does that seem reasonable?
src/pudl/io_managers.py
Outdated
# TODO(rousik): now that this experimentally supports also writing to parquet | ||
# as an alternative storage format, we should probably rename this class | ||
# to be less sqlite-centric. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that in Dagster the IO Managers are primarily designed to allow switching the whole system between say, writing the outputs to SQLite for testing vs. writing them to BigQuery in production, and that this dual-output use case is a little weird.
Could we stick with a pure PudlSQLiteIOManager
, and a pure PudlParquetIOManager
, and then create a dual PudlSQLiteAndParquetIOManager
that is a composition of the two simple ones? It seems wrong to be mixing in this entire other kind of output in the most generic SQLiteIOManager that we've defined.
In pudl.etl
we define the default resources which are used to construct the jobs.
default_resources = {
"datastore": datastore,
"pudl_sqlite_io_manager": pudl_sqlite_io_manager,
"ferc1_dbf_sqlite_io_manager": ferc1_dbf_sqlite_io_manager,
"ferc1_xbrl_sqlite_io_manager": ferc1_xbrl_sqlite_io_manager,
"dataset_settings": dataset_settings,
"ferc_to_sqlite_settings": ferc_to_sqlite_settings,
"epacems_io_manager": epacems_io_manager,
}
And I think we can just swap in whichever IOManager we want to use here by changing the pudl_sqlite_io_manager
and have it affect the whole ETL. Right now these jobs and the resources associated with them are hard coded, but @bendnorman recently suggested that we migrate away from using pudl_etl
as a wrapper and switch to using the Dagster CLI directly. Would that make it easier to switch between an SQLite only and an SQLite + Parquet IOManager just using the Dagster resource configurations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dug deeper into how we deal with io_managers and I'm a bit perplexed by the apparent complexity here. There are several partial implementations, some of which support reads, some of which only writes, and there's more duplicate code than ideally should be, but I think this is a problem for a little later and shouldn't be necessarily solved as part of this.
After some more tinkering, I think that the idea of composite IO manager that simply hands-off reads/writes to either sqlite or parquet should work well and should be relatively clean too.
The only remaining snag is that dagster support for loading configuration from env variables seems kind of... incomplete... in that there doesn't seem to be support for booleans or for supplying default value for when the env variable is not set.
I was expecting something along the lines of 1. use built-in default, 2. replace this with env-variable if set, and 3. replace env default with custom value in case it is supplied via dagster ui/job configuration, but either I'm misreading or this is not actually supported?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Among us I think @bendnorman knows the most about how Dagster resource configuration is supposed to work. I only have a vague understanding, and am just hoping that we can avoid a proliferation of configuration systems, and ideally use whatever the canonical system is for the framework we've adopted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to use an env var to configure the PudlMixedFormatIOManager
? I think just using vanilla boolean attributes would be fine:
write_to_parquet: bool = False
"""If true, data will be written to parquet files."""
read_from_parquet: bool = False
"""If true, data will be read from parquet files instead of sqlite."""
We could create a etl_full_gcp_build
that configures the PudlMixedFormatIOManager
to write and read from parquet.
You can also set a default value for a EnvVar
using EnvVar.get_value()
. We could do something like this:
write_to_parquet: bool = bool(EnvVar("PUDL_WRITE_TO_PARQUET").get_value(default=True))
Not the most elegant thing ever but I tested it out and it allows you to have a default, read from an env var and overwrite the value in the UI.
There is an IntEnvVar
that casts env var values to ints. Maybe we could make one for booleans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my preference is for PudlMixedFormatIOManager
to be a subclass of IOManager
and to configure it using the @io_manager(config_schema=...)
decorator. These are the legacy concepts but I don't think it's great to have our IO managers use a mix of the legacy and current resource concepts.
Eventually we should move to using the new pydantic configurable resources. We explored migrating our setting configurations in #2842 but decided not to because it was a bigger change than we expected. We could create separate issues for migrating the resources in pudl.resources
and pudl.io_managers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know ConfigurableIOManager
works with the @io_manager
decorator; however, I think it might be duplicative given that @io_managers
are also used to make the IO Manager configurable.
To be consistent with our other IO mangers PudlMixedFormatIOManager
would look something like this:
class PudlMixedFormatIOManager(IOManager):
def __init__(self, write_to_parquet: bool, read_from_parquet: bool):
self.write_to_parquet: bool = write_to_parquet.
self.read_from_parquet: bool = read_from_parquet
@io_manager(
config_schema={
"write_to_parquet": Field(
str,
default_value=False,
),
"read_from_parquet": Field(
bool,
default_value=False,
),
},
)
def pudl_io_manager(init_context) -> IOManager:
"""Create a SQLiteManager dagster resource for the pudl database."""
gcs_cache_path = init_context.resource_config["gcs_cache_path"]
read_from_parquet = init_context.resource_config["read_from_parquet"]
return PudlMixedFormatIOManager(write_to_parquet, read_from_parquet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually everything should be ConfigurableResource
s that are instantiated without an @io_manager
decorator like in this example. I think it will be easier to switch all of the IO managers together instead of partially in this PR and partially in another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fair point, the duplication of the config both in the ConfigurableIOManager
and config_schema
here would be tedious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bendnorman, fair point re/ use of config_schema
here. I will make the necessary changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it's tempting to keep using ConfigurableIOManager
as that takes care of building constructor for me that would be identical to what I'm about to write :-)
Moving the conversation ahead a notch. The modified code that addresses some of the issues outlined here is still on my local machine and need to be cleaned up before publishing - this is pending open issue with dagster |
I've refactored IO managers so that there's sqlite, parquet and a combined one. Reading from env variables is kind of crummy, because dagster has much worse support for this than pydantic I feel like there's still some work to be done to clean up things, and, ideally we would also use the same parquet io manager for epacems, but the presence of sharding will require some custom logic. pyarrow parquet writer does support sharding, but the naming schema is usually |
@rousik It looks like it's not actually able to write to the DB because it has no connection engine? |
Oh that would be tests reaching into the IO manager and assuming it's internal structure. I'll fix later today and when at the keyboard. ETL itself should be working. |
src/pudl/io_managers.py
Outdated
@@ -536,9 +638,19 @@ def load_input(self, context: InputContext) -> pd.DataFrame: | |||
|
|||
|
|||
@io_manager | |||
def pudl_sqlite_io_manager(init_context) -> PudlSQLiteIOManager: | |||
def pudl_sqlite_io_manager(init_context) -> IOManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bendnorman and @zaneselvans - given that this now returns mixed format IO manager, could we rename this resource/io-manager from pudl_sqlite_io_manager
to plain pudl_io_manager
? I think that would make it more appropriate. There's a bit of a risk that very widely used word pudl
refers both to the primary output database (pudl.sqlite
up to this point) as well as the overall project/ETL/codebase. This ambiguity may be okay, because I'm not quite sure what would be the other word we could substitute here. Maybe main_output
or something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pudl_io_manager
is appropriate for this IO Manager now that it loads data to sqlite and parquet files. We're currently ok with the naming ambiguity of the code and data outputs. We explain the different parts of pudl in the readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think pudl_io_manager
makes sense.
This is probably beyond the scope of this PR and maybe more of a @bendnorman thing but I think we should also change the way the default_resources
are defined in pudl.etl.__init__.py
:
default_resources = {
"datastore": datastore,
"pudl_sqlite_io_manager": pudl_sqlite_io_manager,
"ferc1_dbf_sqlite_io_manager": ferc1_dbf_sqlite_io_manager,
"ferc1_xbrl_sqlite_io_manager": ferc1_xbrl_sqlite_io_manager,
"dataset_settings": dataset_settings,
"ferc_to_sqlite_settings": ferc_to_sqlite_settings,
"epacems_io_manager": epacems_io_manager,
}
It seems like the keys in this dictionary should not specify anything about the format / destination of the data, while the values (the actual IO managers) should have that specificity. So pudl_io_manager
could be any kind of IO manager... but regardless it's the one that's going to be used to output PUDL data, whether it's going to SQLite, BigQuery, both SQLite + Parquet, etc. And the value associated with it determines the details.
I found this very confusing in thinking about how we were going to switch over to writing the data to not just SQLite, since we've hard-coded pudl_sqlite_io_manager
all throughout the system. But it's not really an SQLite IO Manager. It could be any kind of IO Manager. But I didn't realize they were separable until reviewing this PR.
The FERC IO managers seem like they have a similar issue, and also, what IO Managers are used to output all of the FERC Forms that aren't Form 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree Zane the keys shouldn't specify anything about format / destination.
The FERC IO managers are just for reading data out of the dbf and xbrl sqlite databases. We don't process any of the other ferc forms in the main PUDL ETL so we don't have IO managers for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also don't we need to tell it to use the new IO Manager here? Otherwise it won't actually generate any Parquet files will it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are instantiating PudlMixedFormatIOManager
here which uses either of the formats depending on the configuration.
For more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one minor issue with what we're logging for the read/write formats.
src/pudl/io_managers.py
Outdated
# TODO(rousik): now that this experimentally supports also writing to parquet | ||
# as an alternative storage format, we should probably rename this class | ||
# to be less sqlite-centric. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know ConfigurableIOManager
works with the @io_manager
decorator; however, I think it might be duplicative given that @io_managers
are also used to make the IO Manager configurable.
To be consistent with our other IO mangers PudlMixedFormatIOManager
would look something like this:
class PudlMixedFormatIOManager(IOManager):
def __init__(self, write_to_parquet: bool, read_from_parquet: bool):
self.write_to_parquet: bool = write_to_parquet.
self.read_from_parquet: bool = read_from_parquet
@io_manager(
config_schema={
"write_to_parquet": Field(
str,
default_value=False,
),
"read_from_parquet": Field(
bool,
default_value=False,
),
},
)
def pudl_io_manager(init_context) -> IOManager:
"""Create a SQLiteManager dagster resource for the pudl database."""
gcs_cache_path = init_context.resource_config["gcs_cache_path"]
read_from_parquet = init_context.resource_config["read_from_parquet"]
return PudlMixedFormatIOManager(write_to_parquet, read_from_parquet)
src/pudl/io_managers.py
Outdated
# TODO(rousik): now that this experimentally supports also writing to parquet | ||
# as an alternative storage format, we should probably rename this class | ||
# to be less sqlite-centric. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually everything should be ConfigurableResource
s that are instantiated without an @io_manager
decorator like in this example. I think it will be easier to switch all of the IO managers together instead of partially in this PR and partially in another.
src/pudl/io_managers.py
Outdated
write_to_parquet: bool = bool(EnvVar("PUDL_WRITE_TO_PARQUET").get_value(False)) | ||
"""If true, data will be written to parquet files.""" | ||
|
||
read_from_parquet: bool = bool(EnvVar("PUDL_READ_FROM_PARQUET").get_value(False)) | ||
"""If true, data will be read from parquet files instead of sqlite.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the benefits of using environment variables to configure this IO managers? We currently aren't using them to configure out other resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short, practical concerns.
I'm treating this as a "feature flag", where we start by having the new functionality (writing to parquet files) off by default, but toggle-able on demand. While toggling this in the dagster UI is useful for local development, in many cases where we might want to test this (e.g. in the CI, nightly builds and other dockerized/remote scenarios), it's very easy to set env variable and pass it to the ETL, and it's much more difficult to be tweaking dagster configuration from the outside IMO.
This could be hooked into command-line flag, but that is also somewhat clunky and would require ad-hoc wiring to be piped through while env variables are easy and widely supported (both by github actions as well as when running docker image on vm/batch), so it should be fairly easy to toggle this new functionality just where we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all goes well and this feature will end up being enabled everywhere and always, we can drop this toggle and hardcode the new behavior (always on). I'm unsure what would be the timeline for this, given the early stage this is in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If reading from Parquet ends up significantly speeding up the ETL it's easy to imagine this getting turned on immediately.
Once it's there, I'd experiment with using it to speed up the integration tests and data validations (which would mean relying on SQLite primarily for bulk distribution, validation of referential integrity, and maybe other schema checks / constraints that are only implemented by the to_sqlite()
schema). Since we want to remove PudlTabl
(probably in the Spring) and that will mean reorganizing the tests, that would probably be a good time to play around with it. I don't know exactly what fraction of the time in the integration tests and data validations is spent reading data, but I suspect it's significant given the way we've wired up PudlTabl
to pass the read requests through to the database and not cache anything in memory, and I think reading
I can't remember where we were discussing it but I think @bendnorman suggested that since the PUDL ETL script is now a relatively thin wrapper around some slightly nonstandard Dagster structures, we might want to switch to using the Dagster CLI directly, which I think would make these job configurations more straightforward with the Dagster configs.
src/pudl/io_managers.py
Outdated
@@ -536,9 +638,19 @@ def load_input(self, context: InputContext) -> pd.DataFrame: | |||
|
|||
|
|||
@io_manager | |||
def pudl_sqlite_io_manager(init_context) -> PudlSQLiteIOManager: | |||
def pudl_sqlite_io_manager(init_context) -> IOManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree Zane the keys shouldn't specify anything about format / destination.
The FERC IO managers are just for reading data out of the dbf and xbrl sqlite databases. We don't process any of the other ferc forms in the main PUDL ETL so we don't have IO managers for them.
Also, don't we need to update the resource configuration in Like if I check out this branch and run the ETL, it won't actually make any Parquet files will it? |
This makes more sense as the io manager should be format independent.
Okay. I have made some more changes, specifically renamed I have left
We're already using |
This encapsulation makes it much nicer than passing around fixtures.
@@ -33,7 +33,7 @@ def test_pudl_engine( | |||
|
|||
if check_foreign_keys: | |||
# Raises ForeignKeyErrors if there are any | |||
pudl_sql_io_manager.check_foreign_keys() | |||
pudl_sql_io_manager._sqlite_io_manager.check_foreign_keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dirty, but I'm not sure what would be the best way around it. Maybe implement MixedFormatIOManager.get_sqlite_io_manager() -> PudlSqliteIOManager
method that we could call when we really need to access sqlite only features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think implementing a getter that throws a helpful error if the MixedFormatIOManager
doesn't have a sqlite_io_manager
.
It might not be worth the complexity but we could make the MixedFormatIOManager
more generalizable and accept an arbitrary number of IO managers to use. Then we could have some methods for viewing and grabbing specific io managers:
>>> MixedFormatIOManager.list_io_manager_names()
["sqlite_io_manager", "parquet_io_manager", "bq_io_manager", ...]
>>> MixedFormatIOManager.get_io_manager("sqlite_io_manager")
I think for now a simple MixedFormatIOManager.get_sqlite_io_manager() -> PudlSqliteIOManager
method seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the generalization here is probably not warranted. I would only add that if/when we know we will need it but otherwise it may be just future proofing for something that will never happen and I agree that the complexity is not worth it.
What this does show, however, is that sqlite and pudl-sqlite io managers are special in that their APIs and functionality is much broader than that of the typical/format-agnostic io manager and this API is a bit vague.
It might be valuable to separate these concerns somewhat, perhaps by defining the interface that such sqlite thingy should provide and have a method on the mixed io manager that can return implementations of this clearly-defined and separate interface. It could return the io_manager that implements this interface, but in most cases I see, we don't actually need the io_manager but the specific sqlite validation/schema-management functionality.
It might even make sense to separate this sqlite/schema functionality to a standalone class that is embedded in the IO manager but with which IO manager communicated through this well defined API.
Note that the above may very well go past the scope of this PR and should be likely reserved to subsequent cleanup/refactoring. Providing get_sqlite_io_manager()
function is simple and deals with this dirtiness reasonably well.
Refactored |
Looks like failing validation when there are unknown columns is not the right thing to do as demonstrated by the failing |
@rousik whenever you introduce additional frameworks or systems that we're not using (like |
io_manager: PudlSQLiteIOManager = pudl_io_manager(context)._sqlite_io_manager | ||
# TODO(rousik): This reaches into the io_manager and assumes | ||
# PudlSQLiteIOManager and the corresponding foreign key | ||
# functionality. It's strange to be constructing dagster | ||
# resources to achieve this, but alas, that's where the logic | ||
# lies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd have a storage agnostic method for referential integrity checks? Maybe we can pull the logic out of the IO managers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the foreign key check on the schemas/Packages (do the foreign keys make sense at all), and right now we have the validation on the actual data, which is very much format specific (relies on sqlite pragmas). Building something format agnostic may be well out of scope of this PR, but nevertheless, extracting the functionality out of io_manager does actually make sense and I will do so. That will simplify things.
settings: controls how parquet files are used by the PUDL ETL, in particular, | ||
whether they should be used for input/output of dataframes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still relevant? I don't see any references to settings
.
@@ -33,7 +33,7 @@ def test_pudl_engine( | |||
|
|||
if check_foreign_keys: | |||
# Raises ForeignKeyErrors if there are any | |||
pudl_sql_io_manager.check_foreign_keys() | |||
pudl_sql_io_manager._sqlite_io_manager.check_foreign_keys() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think implementing a getter that throws a helpful error if the MixedFormatIOManager
doesn't have a sqlite_io_manager
.
It might not be worth the complexity but we could make the MixedFormatIOManager
more generalizable and accept an arbitrary number of IO managers to use. Then we could have some methods for viewing and grabbing specific io managers:
>>> MixedFormatIOManager.list_io_manager_names()
["sqlite_io_manager", "parquet_io_manager", "bq_io_manager", ...]
>>> MixedFormatIOManager.get_io_manager("sqlite_io_manager")
I think for now a simple MixedFormatIOManager.get_sqlite_io_manager() -> PudlSqliteIOManager
method seems fine.
fake_pudl_sqlite_io_manager_fixture.load_input(input_context) | ||
# TODO(rousik): is there difference between fake_sqlite_io_manager | ||
# and real sqlite_io_manager in terms of functionality?!! | ||
class PudlSQLiteIOManagerTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using unittest instead of pytest here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other threads on the PR. I thought that unitest
offers generally cleaner tests with less verbosity. However, if this breaks the convention (most of the code does use pytest
even though there are some unitest
based tests too), I can revert these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the ability to put all of the assumptions/expectations about single class-under-test into a parallel unittest test-class, as this provides nice encapsulation and visual separation compared to the flat undifferentiated pytest style approach, but if we think consistency is more valuable (pytest vs unittest) I can revert these last changes.
This is a fair point. I agree that adding new technologies to the mix have non-trivial barriers and we shouldn't do this haphazardly, which is, perhaps, how it looks from the outside. Let me try to provide a bit more detailed reasoning for
It is, in particular, the point (3) where I think that departing from I can roll these changes back if this is undesirable. I think we can have separate conversation re/ |
Hey @rousik, we decided we want to try to get the parquet support completed by the end of the week, so I'm going to branch off of here and see if I can get a parallel PR through over the finish line. Thanks for getting this most of the way there, though! |
Thanks, that is awesome. I'm off the grid most of this week so it's a great decision. |
Parquet outputs have been merrrrged in #3296 |
This is an experimental support for emitting data in parquet format. It is currently controlled via env variables
PUDL_WRITE_TO_PARQUET
andPUDL_READ_FROM_PARQUET
and both these modes are currently disabled.Note that for now, we're always going to be emitting the data to sqlite as well, regardless of whether parquet files are used or not.