Add ability to write data to both Parquet and SQLite during ETL #3232

rousik · 2024-01-10T23:39:17Z

This is an experimental support for emitting data in parquet format. It is currently controlled via env variables PUDL_WRITE_TO_PARQUET and PUDL_READ_FROM_PARQUET and both these modes are currently disabled.

Note that for now, we're always going to be emitting the data to sqlite as well, regardless of whether parquet files are used or not.

Use pyarrow.parquet for writing/reading, add BaseSettings to configure input/output behaviors. By default, let both modes be disabled but allow overrides via PUDL_WRITE_TO_PARQUET and PUDL_READ_FROM_PARQUET env variables.

For more information, see https://pre-commit.ci

rousik · 2024-01-10T23:46:13Z

@zaneselvans once we actually support dual output formats (sqlite+parquet) it might be good time to do a little cleanup in the io manager codebase as this has been clearly designed with sqlite only in mind, and the class structure seems little clunky here, esp. when it comes to overlap/differences between pudl/ferc io managers (the latter being only partial and fixed to sqlite formats) and pudl/epacems (that might now share more due to both writing to parquet). I will think about how we could improve the situation, but this could get us going in the meantime.

Conversion to pyarrow table was necessary before writing to parquet.

For more information, see https://pre-commit.ci

src/pudl/ferc_to_sqlite/cli.py

src/pudl/io_managers.py

rousik · 2024-01-10T23:52:56Z

src/pudl/workspace/setup.py

@@ -66,6 +66,10 @@ def sqlite_db_uri(self, name: str) -> str:
        # sqlite://{credentials}/{db_path}
        return f"sqlite:///{self.sqlite_db_path(name)}"

+    def parquet_path(self, db_name: str, table_name: str) -> Path:
+        """Return path to parquet file for given databae and table."""
+        return self.output_dir / "parquet" / f"{table_name}.parquet"


I thought this could be a natural place to abstract the naming away from specific sites like io managers, and might allow for some degree of flexibility if needed. For now, db_name may not be necessary here.

codecov · 2024-01-11T02:03:03Z

Codecov Report

Attention: 25 lines in your changes are missing coverage. Please review.

Comparison is base (f4f2470) 92.6% compared to head (4ab01d5) 92.5%.
Report is 6 commits behind head on main.

Files	Patch %	Lines
src/pudl/io_managers.py	64.6%	17 Missing ⚠️
...rc/pudl/analysis/record_linkage/embed_dataframe.py	92.1%	5 Missing ⚠️
src/pudl/etl/check_foreign_keys.py	50.0%	1 Missing ⚠️
src/pudl/etl/cli.py	0.0%	1 Missing ⚠️
src/pudl/workspace/setup.py	50.0%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #3232     +/-   ##
=======================================
- Coverage   92.6%   92.5%   -0.1%     
=======================================
  Files        143     143             
  Lines      12979   13090    +111     
=======================================
+ Hits       12025   12114     +89     
- Misses       954     976     +22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zaneselvans

I'm fuzzy on where the configuration should go, but I'm pretty sure that Dagster already has a place where it expects to receive this kind of resource configuration, including via environment variables. But maybe @bendnorman knows off the top of his head.

It feels like it would be cleaner to keep the SQLite IO Manager dedicated to SQLite, and have another IO Manager that is dedicated to the Parquet logic, and then compose them together into a hybrid that can do both / either depending on how it's configured. Does that seem reasonable?

src/pudl/ferc_to_sqlite/cli.py

zaneselvans · 2024-01-11T01:42:02Z

src/pudl/io_managers.py

+    # TODO(rousik): now that this experimentally supports also writing to parquet
+    # as an alternative storage format, we should probably rename this class
+    # to be less sqlite-centric.
+


My understanding is that in Dagster the IO Managers are primarily designed to allow switching the whole system between say, writing the outputs to SQLite for testing vs. writing them to BigQuery in production, and that this dual-output use case is a little weird.

Could we stick with a pure PudlSQLiteIOManager, and a pure PudlParquetIOManager, and then create a dual PudlSQLiteAndParquetIOManager that is a composition of the two simple ones? It seems wrong to be mixing in this entire other kind of output in the most generic SQLiteIOManager that we've defined.

In pudl.etl we define the default resources which are used to construct the jobs.

default_resources = { "datastore": datastore, "pudl_sqlite_io_manager": pudl_sqlite_io_manager, "ferc1_dbf_sqlite_io_manager": ferc1_dbf_sqlite_io_manager, "ferc1_xbrl_sqlite_io_manager": ferc1_xbrl_sqlite_io_manager, "dataset_settings": dataset_settings, "ferc_to_sqlite_settings": ferc_to_sqlite_settings, "epacems_io_manager": epacems_io_manager, }

And I think we can just swap in whichever IOManager we want to use here by changing the pudl_sqlite_io_manager and have it affect the whole ETL. Right now these jobs and the resources associated with them are hard coded, but @bendnorman recently suggested that we migrate away from using pudl_etl as a wrapper and switch to using the Dagster CLI directly. Would that make it easier to switch between an SQLite only and an SQLite + Parquet IOManager just using the Dagster resource configurations?

I dug deeper into how we deal with io_managers and I'm a bit perplexed by the apparent complexity here. There are several partial implementations, some of which support reads, some of which only writes, and there's more duplicate code than ideally should be, but I think this is a problem for a little later and shouldn't be necessarily solved as part of this.

After some more tinkering, I think that the idea of composite IO manager that simply hands-off reads/writes to either sqlite or parquet should work well and should be relatively clean too.

The only remaining snag is that dagster support for loading configuration from env variables seems kind of... incomplete... in that there doesn't seem to be support for booleans or for supplying default value for when the env variable is not set.

I was expecting something along the lines of 1. use built-in default, 2. replace this with env-variable if set, and 3. replace env default with custom value in case it is supplied via dagster ui/job configuration, but either I'm misreading or this is not actually supported?!

Among us I think @bendnorman knows the most about how Dagster resource configuration is supposed to work. I only have a vague understanding, and am just hoping that we can avoid a proliferation of configuration systems, and ideally use whatever the canonical system is for the framework we've adopted.

Do we have to use an env var to configure the PudlMixedFormatIOManager? I think just using vanilla boolean attributes would be fine:

write_to_parquet: bool = False """If true, data will be written to parquet files.""" read_from_parquet: bool = False """If true, data will be read from parquet files instead of sqlite."""

We could create a etl_full_gcp_build that configures the PudlMixedFormatIOManager to write and read from parquet.

You can also set a default value for a EnvVar using EnvVar.get_value(). We could do something like this:

write_to_parquet: bool = bool(EnvVar("PUDL_WRITE_TO_PARQUET").get_value(default=True))

Not the most elegant thing ever but I tested it out and it allows you to have a default, read from an env var and overwrite the value in the UI.

There is an IntEnvVar that casts env var values to ints. Maybe we could make one for booleans?

I think my preference is for PudlMixedFormatIOManager to be a subclass of IOManager and to configure it using the @io_manager(config_schema=...) decorator. These are the legacy concepts but I don't think it's great to have our IO managers use a mix of the legacy and current resource concepts.

Eventually we should move to using the new pydantic configurable resources. We explored migrating our setting configurations in #2842 but decided not to because it was a bigger change than we expected. We could create separate issues for migrating the resources in pudl.resources and pudl.io_managers.

Good to know ConfigurableIOManager works with the @io_manager decorator; however, I think it might be duplicative given that @io_managers are also used to make the IO Manager configurable.

To be consistent with our other IO mangers PudlMixedFormatIOManager would look something like this:

class PudlMixedFormatIOManager(IOManager): def __init__(self, write_to_parquet: bool, read_from_parquet: bool): self.write_to_parquet: bool = write_to_parquet. self.read_from_parquet: bool = read_from_parquet @io_manager( config_schema={ "write_to_parquet": Field( str, default_value=False, ), "read_from_parquet": Field( bool, default_value=False, ), }, ) def pudl_io_manager(init_context) -> IOManager: """Create a SQLiteManager dagster resource for the pudl database.""" gcs_cache_path = init_context.resource_config["gcs_cache_path"] read_from_parquet = init_context.resource_config["read_from_parquet"] return PudlMixedFormatIOManager(write_to_parquet, read_from_parquet)

Eventually everything should be ConfigurableResources that are instantiated without an @io_manager decorator like in this example. I think it will be easier to switch all of the IO managers together instead of partially in this PR and partially in another.

This is fair point, the duplication of the config both in the ConfigurableIOManager and config_schema here would be tedious.

@bendnorman, fair point re/ use of config_schema here. I will make the necessary changes.

Though it's tempting to keep using ConfigurableIOManager as that takes care of building constructor for me that would be identical to what I'm about to write :-)

src/pudl/io_managers.py

rousik · 2024-01-15T05:49:04Z

Moving the conversation ahead a notch. The modified code that addresses some of the issues outlined here is still on my local machine and need to be cleaned up before publishing - this is pending open issue with dagster EnvVar.

For more information, see https://pre-commit.ci

rousik · 2024-01-16T16:38:42Z

I've refactored IO managers so that there's sqlite, parquet and a combined one. Reading from env variables is kind of crummy, because dagster has much worse support for this than pydantic BaseSettings, but I don't think this is a major problem. parquet writes/reads will be enabled whenever the associated env variables are non-empty (i.e. truth value is based on strings, not interpreted as boolean, which could lead to some confusion).

I feel like there's still some work to be done to clean up things, and, ideally we would also use the same parquet io manager for epacems, but the presence of sharding will require some custom logic.

pyarrow parquet writer does support sharding, but the naming schema is usually base_dir/key=value/{uuid}.parquet i.e. in our case it would be epacems/state=CA/... which may not be what we want.

zaneselvans · 2024-01-16T19:24:54Z

@rousik It looks like it's not actually able to write to the DB because it has no connection engine?

rousik · 2024-01-16T19:31:24Z

Oh that would be tests reaching into the IO manager and assuming it's internal structure. I'll fix later today and when at the keyboard.

ETL itself should be working.

rousik · 2024-01-17T18:17:04Z

src/pudl/io_managers.py

@@ -536,9 +638,19 @@ def load_input(self, context: InputContext) -> pd.DataFrame:


 @io_manager
-def pudl_sqlite_io_manager(init_context) -> PudlSQLiteIOManager:
+def pudl_sqlite_io_manager(init_context) -> IOManager:


@bendnorman and @zaneselvans - given that this now returns mixed format IO manager, could we rename this resource/io-manager from pudl_sqlite_io_manager to plain pudl_io_manager? I think that would make it more appropriate. There's a bit of a risk that very widely used word pudl refers both to the primary output database (pudl.sqlite up to this point) as well as the overall project/ETL/codebase. This ambiguity may be okay, because I'm not quite sure what would be the other word we could substitute here. Maybe main_output or something along those lines?

I think pudl_io_manager is appropriate for this IO Manager now that it loads data to sqlite and parquet files. We're currently ok with the naming ambiguity of the code and data outputs. We explain the different parts of pudl in the readme.

I think pudl_io_manager makes sense.

This is probably beyond the scope of this PR and maybe more of a @bendnorman thing but I think we should also change the way the default_resources are defined in pudl.etl.__init__.py:

default_resources = { "datastore": datastore, "pudl_sqlite_io_manager": pudl_sqlite_io_manager, "ferc1_dbf_sqlite_io_manager": ferc1_dbf_sqlite_io_manager, "ferc1_xbrl_sqlite_io_manager": ferc1_xbrl_sqlite_io_manager, "dataset_settings": dataset_settings, "ferc_to_sqlite_settings": ferc_to_sqlite_settings, "epacems_io_manager": epacems_io_manager, }

It seems like the keys in this dictionary should not specify anything about the format / destination of the data, while the values (the actual IO managers) should have that specificity. So pudl_io_manager could be any kind of IO manager... but regardless it's the one that's going to be used to output PUDL data, whether it's going to SQLite, BigQuery, both SQLite + Parquet, etc. And the value associated with it determines the details.

I found this very confusing in thinking about how we were going to switch over to writing the data to not just SQLite, since we've hard-coded pudl_sqlite_io_manager all throughout the system. But it's not really an SQLite IO Manager. It could be any kind of IO Manager. But I didn't realize they were separable until reviewing this PR.

The FERC IO managers seem like they have a similar issue, and also, what IO Managers are used to output all of the FERC Forms that aren't Form 1?

I agree Zane the keys shouldn't specify anything about format / destination.

The FERC IO managers are just for reading data out of the dbf and xbrl sqlite databases. We don't process any of the other ferc forms in the main PUDL ETL so we don't have IO managers for them.

Also don't we need to tell it to use the new IO Manager here? Otherwise it won't actually generate any Parquet files will it?

We are instantiating PudlMixedFormatIOManager here which uses either of the formats depending on the configuration.

For more information, see https://pre-commit.ci

zaneselvans

Just one minor issue with what we're logging for the read/write formats.

src/pudl/io_managers.py

bendnorman · 2024-01-17T18:31:10Z

src/pudl/io_managers.py

+    # TODO(rousik): now that this experimentally supports also writing to parquet
+    # as an alternative storage format, we should probably rename this class
+    # to be less sqlite-centric.
+


Good to know ConfigurableIOManager works with the @io_manager decorator; however, I think it might be duplicative given that @io_managers are also used to make the IO Manager configurable.

To be consistent with our other IO mangers PudlMixedFormatIOManager would look something like this:

class PudlMixedFormatIOManager(IOManager): def __init__(self, write_to_parquet: bool, read_from_parquet: bool): self.write_to_parquet: bool = write_to_parquet. self.read_from_parquet: bool = read_from_parquet @io_manager( config_schema={ "write_to_parquet": Field( str, default_value=False, ), "read_from_parquet": Field( bool, default_value=False, ), }, ) def pudl_io_manager(init_context) -> IOManager: """Create a SQLiteManager dagster resource for the pudl database.""" gcs_cache_path = init_context.resource_config["gcs_cache_path"] read_from_parquet = init_context.resource_config["read_from_parquet"] return PudlMixedFormatIOManager(write_to_parquet, read_from_parquet)

bendnorman · 2024-01-17T18:38:06Z

src/pudl/io_managers.py

+    # TODO(rousik): now that this experimentally supports also writing to parquet
+    # as an alternative storage format, we should probably rename this class
+    # to be less sqlite-centric.
+


Eventually everything should be ConfigurableResources that are instantiated without an @io_manager decorator like in this example. I think it will be easier to switch all of the IO managers together instead of partially in this PR and partially in another.

bendnorman · 2024-01-17T18:38:46Z

src/pudl/io_managers.py

+    write_to_parquet: bool = bool(EnvVar("PUDL_WRITE_TO_PARQUET").get_value(False))
+    """If true, data will be written to parquet files."""
+
+    read_from_parquet: bool = bool(EnvVar("PUDL_READ_FROM_PARQUET").get_value(False))
+    """If true, data will be read from parquet files instead of sqlite."""


What are the benefits of using environment variables to configure this IO managers? We currently aren't using them to configure out other resources.

In short, practical concerns.

I'm treating this as a "feature flag", where we start by having the new functionality (writing to parquet files) off by default, but toggle-able on demand. While toggling this in the dagster UI is useful for local development, in many cases where we might want to test this (e.g. in the CI, nightly builds and other dockerized/remote scenarios), it's very easy to set env variable and pass it to the ETL, and it's much more difficult to be tweaking dagster configuration from the outside IMO.

This could be hooked into command-line flag, but that is also somewhat clunky and would require ad-hoc wiring to be piped through while env variables are easy and widely supported (both by github actions as well as when running docker image on vm/batch), so it should be fairly easy to toggle this new functionality just where we need it.

If all goes well and this feature will end up being enabled everywhere and always, we can drop this toggle and hardcode the new behavior (always on). I'm unsure what would be the timeline for this, given the early stage this is in.

If reading from Parquet ends up significantly speeding up the ETL it's easy to imagine this getting turned on immediately.

Once it's there, I'd experiment with using it to speed up the integration tests and data validations (which would mean relying on SQLite primarily for bulk distribution, validation of referential integrity, and maybe other schema checks / constraints that are only implemented by the to_sqlite() schema). Since we want to remove PudlTabl (probably in the Spring) and that will mean reorganizing the tests, that would probably be a good time to play around with it. I don't know exactly what fraction of the time in the integration tests and data validations is spent reading data, but I suspect it's significant given the way we've wired up PudlTabl to pass the read requests through to the database and not cache anything in memory, and I think reading

I can't remember where we were discussing it but I think @bendnorman suggested that since the PUDL ETL script is now a relatively thin wrapper around some slightly nonstandard Dagster structures, we might want to switch to using the Dagster CLI directly, which I think would make these job configurations more straightforward with the Dagster configs.

bendnorman · 2024-01-17T19:13:54Z

src/pudl/io_managers.py

@@ -536,9 +638,19 @@ def load_input(self, context: InputContext) -> pd.DataFrame:


 @io_manager
-def pudl_sqlite_io_manager(init_context) -> PudlSQLiteIOManager:
+def pudl_sqlite_io_manager(init_context) -> IOManager:


I agree Zane the keys shouldn't specify anything about format / destination.

The FERC IO managers are just for reading data out of the dbf and xbrl sqlite databases. We don't process any of the other ferc forms in the main PUDL ETL so we don't have IO managers for them.

zaneselvans · 2024-01-17T19:29:12Z

Also, don't we need to update the resource configuration in pudl/etl/__init__.py? Otherwise it won't actually use the new IO manager will it? And it seems like it should at least in the integration tests, or we won't know whether it's working.

Like if I check out this branch and run the ETL, it won't actually make any Parquet files will it?

This makes more sense as the io manager should be format independent.

rousik · 2024-01-18T06:35:50Z

Okay. I have made some more changes, specifically renamed pudl_sqlite_io_manager to a generic pudl_io_manager.

I have left fake_pudl_sqlite_io_manager in place, as that seems to be essentially unit tests for the PudlSQLiteIOManager itself. Personally I think that switching to unittest.TestCase here would make things a little cleaner instead of passing around these fixtures, e.g. (based on https://docs.python.org/3/library/unittest.html#basic-example)

class PudlSQLiteIOManagerTests(unittest.TestCase):

  def setUp(self):
    # spin up temp database (in TemporaryDirectory perhaps)
    self.io_manager = PudlSQLiteIOManager(...)

  def test_delete_statement(self):
    self.io_manager.handle_output(...)
    ...

We're already using unittest style tests for excel so that would not be first case :)

This encapsulation makes it much nicer than passing around fixtures.

rousik · 2024-01-18T16:01:35Z

test/integration/etl_test.py

@@ -33,7 +33,7 @@ def test_pudl_engine(

    if check_foreign_keys:
        # Raises ForeignKeyErrors if there are any
-        pudl_sql_io_manager.check_foreign_keys()
+        pudl_sql_io_manager._sqlite_io_manager.check_foreign_keys()


This is dirty, but I'm not sure what would be the best way around it. Maybe implement MixedFormatIOManager.get_sqlite_io_manager() -> PudlSqliteIOManager method that we could call when we really need to access sqlite only features?

I think implementing a getter that throws a helpful error if the MixedFormatIOManager doesn't have a sqlite_io_manager.

It might not be worth the complexity but we could make the MixedFormatIOManager more generalizable and accept an arbitrary number of IO managers to use. Then we could have some methods for viewing and grabbing specific io managers:

>>> MixedFormatIOManager.list_io_manager_names() ["sqlite_io_manager", "parquet_io_manager", "bq_io_manager", ...] >>> MixedFormatIOManager.get_io_manager("sqlite_io_manager")

I think for now a simple MixedFormatIOManager.get_sqlite_io_manager() -> PudlSqliteIOManager method seems fine.

I think the generalization here is probably not warranted. I would only add that if/when we know we will need it but otherwise it may be just future proofing for something that will never happen and I agree that the complexity is not worth it.

What this does show, however, is that sqlite and pudl-sqlite io managers are special in that their APIs and functionality is much broader than that of the typical/format-agnostic io manager and this API is a bit vague.

It might be valuable to separate these concerns somewhat, perhaps by defining the interface that such sqlite thingy should provide and have a method on the mixed io manager that can return implementations of this clearly-defined and separate interface. It could return the io_manager that implements this interface, but in most cases I see, we don't actually need the io_manager but the specific sqlite validation/schema-management functionality.

It might even make sense to separate this sqlite/schema functionality to a standalone class that is embedded in the IO manager but with which IO manager communicated through this well defined API.

Note that the above may very well go past the scope of this PR and should be likely reserved to subsequent cleanup/refactoring. Providing get_sqlite_io_manager() function is simple and deals with this dirtiness reasonably well.

rousik · 2024-01-18T16:31:29Z

Refactored PudlSQLiteIOManager tests to use unittest framework which IMO leads to much cleaner and leaner tests. Case in point, running these tests resulted in discovering two real bugs in the underlying schema validation code :)

rousik · 2024-01-18T18:23:38Z

Looks like failing validation when there are unknown columns is not the right thing to do as demonstrated by the failing ci-integration. Perhaps the assumption in the test test_extra_column_error is itself wrong? Halp!

zaneselvans · 2024-01-18T20:48:08Z

@rousik whenever you introduce additional frameworks or systems that we're not using (like poetry or unittest) it creates a barrier to our understanding, maintaining, and extending the work, or it means we need to dedicate additional internal resources to educating everyone on how these other systems work, or to replacing what you've done with the systems that we've already agreed to use internally.

bendnorman · 2024-01-18T17:52:51Z

src/pudl/etl/check_foreign_keys.py

+    io_manager: PudlSQLiteIOManager = pudl_io_manager(context)._sqlite_io_manager
+    # TODO(rousik): This reaches into the io_manager and assumes
+    # PudlSQLiteIOManager and the corresponding foreign key
+    # functionality. It's strange to be constructing dagster
+    # resources to achieve this, but alas, that's where the logic
+    # lies.


Ideally we'd have a storage agnostic method for referential integrity checks? Maybe we can pull the logic out of the IO managers?

We have the foreign key check on the schemas/Packages (do the foreign keys make sense at all), and right now we have the validation on the actual data, which is very much format specific (relies on sqlite pragmas). Building something format agnostic may be well out of scope of this PR, but nevertheless, extracting the functionality out of io_manager does actually make sense and I will do so. That will simplify things.

bendnorman · 2024-01-18T18:00:17Z

src/pudl/io_managers.py

+            settings: controls how parquet files are used by the PUDL ETL, in particular,
+                whether they should be used for input/output of dataframes.


Is this still relevant? I don't see any references to settings.

src/pudl/io_managers.py

bendnorman · 2024-01-18T21:17:21Z

test/integration/etl_test.py

@@ -33,7 +33,7 @@ def test_pudl_engine(

    if check_foreign_keys:
        # Raises ForeignKeyErrors if there are any
-        pudl_sql_io_manager.check_foreign_keys()
+        pudl_sql_io_manager._sqlite_io_manager.check_foreign_keys()


I think implementing a getter that throws a helpful error if the MixedFormatIOManager doesn't have a sqlite_io_manager.

It might not be worth the complexity but we could make the MixedFormatIOManager more generalizable and accept an arbitrary number of IO managers to use. Then we could have some methods for viewing and grabbing specific io managers:

>>> MixedFormatIOManager.list_io_manager_names() ["sqlite_io_manager", "parquet_io_manager", "bq_io_manager", ...] >>> MixedFormatIOManager.get_io_manager("sqlite_io_manager")

I think for now a simple MixedFormatIOManager.get_sqlite_io_manager() -> PudlSqliteIOManager method seems fine.

bendnorman · 2024-01-18T21:19:20Z

test/unit/io_managers_test.py

-        fake_pudl_sqlite_io_manager_fixture.load_input(input_context)
+# TODO(rousik): is there difference between fake_sqlite_io_manager
+# and real sqlite_io_manager in terms of functionality?!!
+class PudlSQLiteIOManagerTest(unittest.TestCase):


Why are you using unittest instead of pytest here?

See other threads on the PR. I thought that unitest offers generally cleaner tests with less verbosity. However, if this breaks the convention (most of the code does use pytest even though there are some unitest based tests too), I can revert these changes.

I do like the ability to put all of the assumptions/expectations about single class-under-test into a parallel unittest test-class, as this provides nice encapsulation and visual separation compared to the flat undifferentiated pytest style approach, but if we think consistency is more valuable (pytest vs unittest) I can revert these last changes.

rousik · 2024-01-18T21:28:19Z

@rousik whenever you introduce additional frameworks or systems that we're not using (like poetry or unittest) it creates a barrier to our understanding, maintaining, and extending the work, or it means we need to dedicate additional internal resources to educating everyone on how these other systems work, or to replacing what you've done with the systems that we've already agreed to use internally.

This is a fair point. I agree that adding new technologies to the mix have non-trivial barriers and we shouldn't do this haphazardly, which is, perhaps, how it looks from the outside. Let me try to provide a bit more detailed reasoning for unittest:

unittest style is already used in excel, so we're not strictly introducing it from scratch. It was me who added it back then, so it's still on me though :)
unittest plays well in pytest in that pytest can run unittest based tests just fine, so it boils down to different "style" of writing tests, but the overall usage remains unchanged and unaffected.
It feels to me that unittest offers a little more encapsulated and nicer ways to build tests, which is particularly nice for situations where we are testing specific classes and their expectations (i.e. rather than having flat collection of tests, we can group them into test-classes that are geared towards specific functionality) and provide very explicit setup/teardown logic rather than relying on the nebulous cloud of fixtures.

It is, in particular, the point (3) where I think that departing from pytest might result in cleaner and leaner looking test and it might very well be worth it to use that. I do think that we're also using fairly limited and intuitive features from unittest so it shouldn't be too hard to maintain those tests, but perhaps I'm underestimating the burden, and overestimating the lower maintenance of these "nicer" tests.

I can roll these changes back if this is undesirable.

I think we can have separate conversation re/ poetry - there, the use is definitely more experimental, but also non-binding in that it should be possible to revert back/switch to other dep management systems.

zschira · 2024-01-24T21:03:00Z

Hey @rousik, we decided we want to try to get the parquet support completed by the end of the week, so I'm going to branch off of here and see if I can get a parallel PR through over the finish line. Thanks for getting this most of the way there, though!

rousik · 2024-01-24T23:33:14Z

Hey @rousik, we decided we want to try to get the parquet support completed by the end of the week, so I'm going to branch off of here and see if I can get a parallel PR through over the finish line. Thanks for getting this most of the way there, though!

Thanks, that is awesome. I'm off the grid most of this week so it's a great decision.

zaneselvans · 2024-02-06T16:39:26Z

Parquet outputs have been merrrrged in #3296

rousik and others added 4 commits January 6, 2024 17:12

Experimental support for exporting to parquet files.

8b4fed4

Merge remote-tracking branch 'origin/main' into export-to-parquet

b1ac18a

Modify PudlSqliteIOManager to support parquet.

0a78a94

Use pyarrow.parquet for writing/reading, add BaseSettings to configure input/output behaviors. By default, let both modes be disabled but allow overrides via PUDL_WRITE_TO_PARQUET and PUDL_READ_FROM_PARQUET env variables.

[pre-commit.ci] auto fixes from pre-commit.com hooks

c9aa6f0

For more information, see https://pre-commit.ci

rousik requested a review from zaneselvans January 10, 2024 23:47

rousik and others added 2 commits January 10, 2024 18:21

Fix the export of pandas/parquet.

b6136c7

Conversion to pyarrow table was necessary before writing to parquet.

[pre-commit.ci] auto fixes from pre-commit.com hooks

83492a0

For more information, see https://pre-commit.ci

rousik commented Jan 11, 2024

View reviewed changes

zaneselvans requested changes Jan 11, 2024

View reviewed changes

zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. parquet Issues related to the Apache Parquet file format which we use for long tables. dagster Issues related to our use of the Dagster orchestrator labels Jan 11, 2024

zaneselvans linked an issue Jan 11, 2024 that may be closed by this pull request

Output PUDL as Parquet as well as SQLite #3102

Closed

zaneselvans assigned rousik Jan 11, 2024

rousik and others added 3 commits January 16, 2024 09:29

Refactoring the io managers with parquet.

83650d5

Merge remote-tracking branch 'origin/main' into export-to-parquet

39b72e5

[pre-commit.ci] auto fixes from pre-commit.com hooks

9043255

For more information, see https://pre-commit.ci

zaneselvans changed the title ~~Add basic parquet export capability to PudlSqliteIOManager~~ Add ability to write data to both Parquet and SQLite during ETL Jan 16, 2024

zaneselvans mentioned this pull request Jan 16, 2024

Output PUDL as Parquet as well as SQLite #3102

Closed

rousik added 2 commits January 17, 2024 11:01

Change how EnvVars are read when creating mixed io manager.

4ab92f2

Merge remote-tracking branch 'origin/main' into export-to-parquet

c49d129

rousik commented Jan 17, 2024

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

afed63d

For more information, see https://pre-commit.ci

zaneselvans requested changes Jan 17, 2024

View reviewed changes

src/pudl/io_managers.py Outdated Show resolved Hide resolved

bendnorman reviewed Jan 17, 2024

View reviewed changes

rousik added 2 commits January 17, 2024 23:24

Rename pudl_sqlite_io_manager to pudl_io_manager

d88c1ef

This makes more sense as the io manager should be format independent.

Merge remote-tracking branch 'origin/main' into export-to-parquet

4ab01d5

rousik added 2 commits January 18, 2024 08:29

Fix minor issue in the schema validation code.

cd98379

Refactor PudlSQLiteIOManager tests to use unittest.

a400446

This encapsulation makes it much nicer than passing around fixtures.

rousik commented Jan 18, 2024

View reviewed changes

Fail schema enforcement when unknown columns are encountered.

df5e12d

rousik requested review from zaneselvans and bendnorman January 18, 2024 16:38

rousik added 3 commits January 18, 2024 09:40

Fail schema enforcement when unknown columns are encountered.

c3d4bb3

Switch mixed format io manager back to IOManager.

9081388

Remove _get_table_name() method in favor of standalone helper.

273741a

bendnorman requested changes Jan 18, 2024

View reviewed changes

jdangerx added the community label Jan 22, 2024

zschira mentioned this pull request Jan 25, 2024

Output Parquet files as well as SQLite in PUDL ETL #3296

Merged

zaneselvans removed a link to an issue Feb 6, 2024

Output PUDL as Parquet as well as SQLite #3102

Closed

zaneselvans closed this Feb 6, 2024

zaneselvans deleted the export-to-parquet branch February 25, 2024 02:48

		settings: controls how parquet files are used by the PUDL ETL, in particular,
		whether they should be used for input/output of dataframes.

Add ability to write data to both Parquet and SQLite during ETL #3232

Add ability to write data to both Parquet and SQLite during ETL #3232

Conversation

rousik commented Jan 10, 2024 • edited Loading

rousik commented Jan 10, 2024

Choose a reason for hiding this comment

codecov bot commented Jan 11, 2024 • edited Loading

Codecov Report

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rousik commented Jan 15, 2024

rousik commented Jan 16, 2024

zaneselvans commented Jan 16, 2024

rousik commented Jan 16, 2024

Choose a reason for hiding this comment

bendnorman Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans commented Jan 17, 2024 • edited Loading

rousik commented Jan 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rousik commented Jan 18, 2024

rousik commented Jan 18, 2024

zaneselvans commented Jan 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rousik commented Jan 18, 2024

zschira commented Jan 24, 2024

rousik commented Jan 24, 2024

zaneselvans commented Feb 6, 2024

rousik commented Jan 10, 2024 •

edited

Loading

codecov bot commented Jan 11, 2024 •

edited

Loading

bendnorman Jan 17, 2024 •

edited

Loading

zaneselvans commented Jan 17, 2024 •

edited

Loading