Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to write data to both Parquet and SQLite during ETL #3232

Closed
wants to merge 20 commits into from

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Jan 10, 2024

This is an experimental support for emitting data in parquet format. It is currently controlled via env variables PUDL_WRITE_TO_PARQUET and PUDL_READ_FROM_PARQUET and both these modes are currently disabled.

Note that for now, we're always going to be emitting the data to sqlite as well, regardless of whether parquet files are used or not.

rousik and others added 4 commits January 6, 2024 17:12
Use pyarrow.parquet for writing/reading, add BaseSettings to
configure input/output behaviors. By default, let both modes
be disabled but allow overrides via PUDL_WRITE_TO_PARQUET and
PUDL_READ_FROM_PARQUET env variables.
@rousik
Copy link
Collaborator Author

rousik commented Jan 10, 2024

@zaneselvans once we actually support dual output formats (sqlite+parquet) it might be good time to do a little cleanup in the io manager codebase as this has been clearly designed with sqlite only in mind, and the class structure seems little clunky here, esp. when it comes to overlap/differences between pudl/ferc io managers (the latter being only partial and fixed to sqlite formats) and pudl/epacems (that might now share more due to both writing to parquet). I will think about how we could improve the situation, but this could get us going in the meantime.

rousik and others added 2 commits January 10, 2024 18:21
Conversion to pyarrow table was necessary before writing to parquet.
src/pudl/ferc_to_sqlite/cli.py Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
@@ -66,6 +66,10 @@ def sqlite_db_uri(self, name: str) -> str:
# sqlite://{credentials}/{db_path}
return f"sqlite:///{self.sqlite_db_path(name)}"

def parquet_path(self, db_name: str, table_name: str) -> Path:
"""Return path to parquet file for given databae and table."""
return self.output_dir / "parquet" / f"{table_name}.parquet"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this could be a natural place to abstract the naming away from specific sites like io managers, and might allow for some degree of flexibility if needed. For now, db_name may not be necessary here.

Copy link

codecov bot commented Jan 11, 2024

Codecov Report

Attention: 25 lines in your changes are missing coverage. Please review.

Comparison is base (f4f2470) 92.6% compared to head (4ab01d5) 92.5%.
Report is 6 commits behind head on main.

Files Patch % Lines
src/pudl/io_managers.py 64.6% 17 Missing ⚠️
...rc/pudl/analysis/record_linkage/embed_dataframe.py 92.1% 5 Missing ⚠️
src/pudl/etl/check_foreign_keys.py 50.0% 1 Missing ⚠️
src/pudl/etl/cli.py 0.0% 1 Missing ⚠️
src/pudl/workspace/setup.py 50.0% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #3232     +/-   ##
=======================================
- Coverage   92.6%   92.5%   -0.1%     
=======================================
  Files        143     143             
  Lines      12979   13090    +111     
=======================================
+ Hits       12025   12114     +89     
- Misses       954     976     +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fuzzy on where the configuration should go, but I'm pretty sure that Dagster already has a place where it expects to receive this kind of resource configuration, including via environment variables. But maybe @bendnorman knows off the top of his head.

It feels like it would be cleaner to keep the SQLite IO Manager dedicated to SQLite, and have another IO Manager that is dedicated to the Parquet logic, and then compose them together into a hybrid that can do both / either depending on how it's configured. Does that seem reasonable?

src/pudl/ferc_to_sqlite/cli.py Show resolved Hide resolved
Comment on lines 408 to 411
# TODO(rousik): now that this experimentally supports also writing to parquet
# as an alternative storage format, we should probably rename this class
# to be less sqlite-centric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that in Dagster the IO Managers are primarily designed to allow switching the whole system between say, writing the outputs to SQLite for testing vs. writing them to BigQuery in production, and that this dual-output use case is a little weird.

Could we stick with a pure PudlSQLiteIOManager, and a pure PudlParquetIOManager, and then create a dual PudlSQLiteAndParquetIOManager that is a composition of the two simple ones? It seems wrong to be mixing in this entire other kind of output in the most generic SQLiteIOManager that we've defined.

In pudl.etl we define the default resources which are used to construct the jobs.

default_resources = {
    "datastore": datastore,
    "pudl_sqlite_io_manager": pudl_sqlite_io_manager,
    "ferc1_dbf_sqlite_io_manager": ferc1_dbf_sqlite_io_manager,
    "ferc1_xbrl_sqlite_io_manager": ferc1_xbrl_sqlite_io_manager,
    "dataset_settings": dataset_settings,
    "ferc_to_sqlite_settings": ferc_to_sqlite_settings,
    "epacems_io_manager": epacems_io_manager,
}

And I think we can just swap in whichever IOManager we want to use here by changing the pudl_sqlite_io_manager and have it affect the whole ETL. Right now these jobs and the resources associated with them are hard coded, but @bendnorman recently suggested that we migrate away from using pudl_etl as a wrapper and switch to using the Dagster CLI directly. Would that make it easier to switch between an SQLite only and an SQLite + Parquet IOManager just using the Dagster resource configurations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug deeper into how we deal with io_managers and I'm a bit perplexed by the apparent complexity here. There are several partial implementations, some of which support reads, some of which only writes, and there's more duplicate code than ideally should be, but I think this is a problem for a little later and shouldn't be necessarily solved as part of this.

After some more tinkering, I think that the idea of composite IO manager that simply hands-off reads/writes to either sqlite or parquet should work well and should be relatively clean too.

The only remaining snag is that dagster support for loading configuration from env variables seems kind of... incomplete... in that there doesn't seem to be support for booleans or for supplying default value for when the env variable is not set.

I was expecting something along the lines of 1. use built-in default, 2. replace this with env-variable if set, and 3. replace env default with custom value in case it is supplied via dagster ui/job configuration, but either I'm misreading or this is not actually supported?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among us I think @bendnorman knows the most about how Dagster resource configuration is supposed to work. I only have a vague understanding, and am just hoping that we can avoid a proliferation of configuration systems, and ideally use whatever the canonical system is for the framework we've adopted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to use an env var to configure the PudlMixedFormatIOManager? I think just using vanilla boolean attributes would be fine:

write_to_parquet: bool = False
"""If true, data will be written to parquet files."""

read_from_parquet: bool = False
"""If true, data will be read from parquet files instead of sqlite."""

We could create a etl_full_gcp_build that configures the PudlMixedFormatIOManager to write and read from parquet.

You can also set a default value for a EnvVar using EnvVar.get_value(). We could do something like this:

write_to_parquet: bool = bool(EnvVar("PUDL_WRITE_TO_PARQUET").get_value(default=True))

Not the most elegant thing ever but I tested it out and it allows you to have a default, read from an env var and overwrite the value in the UI.

There is an IntEnvVar that casts env var values to ints. Maybe we could make one for booleans?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my preference is for PudlMixedFormatIOManager to be a subclass of IOManager and to configure it using the @io_manager(config_schema=...) decorator. These are the legacy concepts but I don't think it's great to have our IO managers use a mix of the legacy and current resource concepts.

Eventually we should move to using the new pydantic configurable resources. We explored migrating our setting configurations in #2842 but decided not to because it was a bigger change than we expected. We could create separate issues for migrating the resources in pudl.resources and pudl.io_managers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know ConfigurableIOManager works with the @io_manager decorator; however, I think it might be duplicative given that @io_managers are also used to make the IO Manager configurable.

To be consistent with our other IO mangers PudlMixedFormatIOManager would look something like this:

class PudlMixedFormatIOManager(IOManager):
    def __init__(self, write_to_parquet: bool, read_from_parquet: bool):
         self.write_to_parquet: bool = write_to_parquet.
         self.read_from_parquet: bool = read_from_parquet


@io_manager(
    config_schema={
        "write_to_parquet": Field(
            str,
            default_value=False,
        ),
        "read_from_parquet": Field(
            bool,
            default_value=False,
        ),
    },
)
def pudl_io_manager(init_context) -> IOManager:
    """Create a SQLiteManager dagster resource for the pudl database."""
    gcs_cache_path = init_context.resource_config["gcs_cache_path"]
    read_from_parquet = init_context.resource_config["read_from_parquet"]
    return PudlMixedFormatIOManager(write_to_parquet, read_from_parquet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually everything should be ConfigurableResources that are instantiated without an @io_manager decorator like in this example. I think it will be easier to switch all of the IO managers together instead of partially in this PR and partially in another.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fair point, the duplication of the config both in the ConfigurableIOManager and config_schema here would be tedious.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendnorman, fair point re/ use of config_schema here. I will make the necessary changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it's tempting to keep using ConfigurableIOManager as that takes care of building constructor for me that would be identical to what I'm about to write :-)

src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
src/pudl/io_managers.py Outdated Show resolved Hide resolved
@zaneselvans zaneselvans added output Exporting data from PUDL into other platforms or interchange formats. parquet Issues related to the Apache Parquet file format which we use for long tables. dagster Issues related to our use of the Dagster orchestrator labels Jan 11, 2024
@zaneselvans zaneselvans linked an issue Jan 11, 2024 that may be closed by this pull request
@rousik
Copy link
Collaborator Author

rousik commented Jan 15, 2024

Moving the conversation ahead a notch. The modified code that addresses some of the issues outlined here is still on my local machine and need to be cleaned up before publishing - this is pending open issue with dagster EnvVar.

@rousik
Copy link
Collaborator Author

rousik commented Jan 16, 2024

I've refactored IO managers so that there's sqlite, parquet and a combined one. Reading from env variables is kind of crummy, because dagster has much worse support for this than pydantic BaseSettings, but I don't think this is a major problem. parquet writes/reads will be enabled whenever the associated env variables are non-empty (i.e. truth value is based on strings, not interpreted as boolean, which could lead to some confusion).

I feel like there's still some work to be done to clean up things, and, ideally we would also use the same parquet io manager for epacems, but the presence of sharding will require some custom logic.

pyarrow parquet writer does support sharding, but the naming schema is usually base_dir/key=value/{uuid}.parquet i.e. in our case it would be epacems/state=CA/... which may not be what we want.

@zaneselvans
Copy link
Member

@rousik It looks like it's not actually able to write to the DB because it has no connection engine?

@rousik
Copy link
Collaborator Author

rousik commented Jan 16, 2024

Oh that would be tests reaching into the IO manager and assuming it's internal structure. I'll fix later today and when at the keyboard.

ETL itself should be working.

@zaneselvans zaneselvans changed the title Add basic parquet export capability to PudlSqliteIOManager Add ability to write data to both Parquet and SQLite during ETL Jan 16, 2024
@@ -536,9 +638,19 @@ def load_input(self, context: InputContext) -> pd.DataFrame:


@io_manager
def pudl_sqlite_io_manager(init_context) -> PudlSQLiteIOManager:
def pudl_sqlite_io_manager(init_context) -> IOManager:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bendnorman and @zaneselvans - given that this now returns mixed format IO manager, could we rename this resource/io-manager from pudl_sqlite_io_manager to plain pudl_io_manager? I think that would make it more appropriate. There's a bit of a risk that very widely used word pudl refers both to the primary output database (pudl.sqlite up to this point) as well as the overall project/ETL/codebase. This ambiguity may be okay, because I'm not quite sure what would be the other word we could substitute here. Maybe main_output or something along those lines?

Copy link
Member

@bendnorman bendnorman Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pudl_io_manager is appropriate for this IO Manager now that it loads data to sqlite and parquet files. We're currently ok with the naming ambiguity of the code and data outputs. We explain the different parts of pudl in the readme.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pudl_io_manager makes sense.

This is probably beyond the scope of this PR and maybe more of a @bendnorman thing but I think we should also change the way the default_resources are defined in pudl.etl.__init__.py:

default_resources = {
    "datastore": datastore,
    "pudl_sqlite_io_manager": pudl_sqlite_io_manager,
    "ferc1_dbf_sqlite_io_manager": ferc1_dbf_sqlite_io_manager,
    "ferc1_xbrl_sqlite_io_manager": ferc1_xbrl_sqlite_io_manager,
    "dataset_settings": dataset_settings,
    "ferc_to_sqlite_settings": ferc_to_sqlite_settings,
    "epacems_io_manager": epacems_io_manager,
}

It seems like the keys in this dictionary should not specify anything about the format / destination of the data, while the values (the actual IO managers) should have that specificity. So pudl_io_manager could be any kind of IO manager... but regardless it's the one that's going to be used to output PUDL data, whether it's going to SQLite, BigQuery, both SQLite + Parquet, etc. And the value associated with it determines the details.

I found this very confusing in thinking about how we were going to switch over to writing the data to not just SQLite, since we've hard-coded pudl_sqlite_io_manager all throughout the system. But it's not really an SQLite IO Manager. It could be any kind of IO Manager. But I didn't realize they were separable until reviewing this PR.

The FERC IO managers seem like they have a similar issue, and also, what IO Managers are used to output all of the FERC Forms that aren't Form 1?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree Zane the keys shouldn't specify anything about format / destination.

The FERC IO managers are just for reading data out of the dbf and xbrl sqlite databases. We don't process any of the other ferc forms in the main PUDL ETL so we don't have IO managers for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also don't we need to tell it to use the new IO Manager here? Otherwise it won't actually generate any Parquet files will it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are instantiating PudlMixedFormatIOManager here which uses either of the formats depending on the configuration.

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor issue with what we're logging for the read/write formats.

src/pudl/io_managers.py Outdated Show resolved Hide resolved
Comment on lines 408 to 411
# TODO(rousik): now that this experimentally supports also writing to parquet
# as an alternative storage format, we should probably rename this class
# to be less sqlite-centric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know ConfigurableIOManager works with the @io_manager decorator; however, I think it might be duplicative given that @io_managers are also used to make the IO Manager configurable.

To be consistent with our other IO mangers PudlMixedFormatIOManager would look something like this:

class PudlMixedFormatIOManager(IOManager):
    def __init__(self, write_to_parquet: bool, read_from_parquet: bool):
         self.write_to_parquet: bool = write_to_parquet.
         self.read_from_parquet: bool = read_from_parquet


@io_manager(
    config_schema={
        "write_to_parquet": Field(
            str,
            default_value=False,
        ),
        "read_from_parquet": Field(
            bool,
            default_value=False,
        ),
    },
)
def pudl_io_manager(init_context) -> IOManager:
    """Create a SQLiteManager dagster resource for the pudl database."""
    gcs_cache_path = init_context.resource_config["gcs_cache_path"]
    read_from_parquet = init_context.resource_config["read_from_parquet"]
    return PudlMixedFormatIOManager(write_to_parquet, read_from_parquet)

Comment on lines 408 to 411
# TODO(rousik): now that this experimentally supports also writing to parquet
# as an alternative storage format, we should probably rename this class
# to be less sqlite-centric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually everything should be ConfigurableResources that are instantiated without an @io_manager decorator like in this example. I think it will be easier to switch all of the IO managers together instead of partially in this PR and partially in another.

Comment on lines 117 to 121
write_to_parquet: bool = bool(EnvVar("PUDL_WRITE_TO_PARQUET").get_value(False))
"""If true, data will be written to parquet files."""

read_from_parquet: bool = bool(EnvVar("PUDL_READ_FROM_PARQUET").get_value(False))
"""If true, data will be read from parquet files instead of sqlite."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the benefits of using environment variables to configure this IO managers? We currently aren't using them to configure out other resources.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In short, practical concerns.

I'm treating this as a "feature flag", where we start by having the new functionality (writing to parquet files) off by default, but toggle-able on demand. While toggling this in the dagster UI is useful for local development, in many cases where we might want to test this (e.g. in the CI, nightly builds and other dockerized/remote scenarios), it's very easy to set env variable and pass it to the ETL, and it's much more difficult to be tweaking dagster configuration from the outside IMO.

This could be hooked into command-line flag, but that is also somewhat clunky and would require ad-hoc wiring to be piped through while env variables are easy and widely supported (both by github actions as well as when running docker image on vm/batch), so it should be fairly easy to toggle this new functionality just where we need it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all goes well and this feature will end up being enabled everywhere and always, we can drop this toggle and hardcode the new behavior (always on). I'm unsure what would be the timeline for this, given the early stage this is in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reading from Parquet ends up significantly speeding up the ETL it's easy to imagine this getting turned on immediately.

Once it's there, I'd experiment with using it to speed up the integration tests and data validations (which would mean relying on SQLite primarily for bulk distribution, validation of referential integrity, and maybe other schema checks / constraints that are only implemented by the to_sqlite() schema). Since we want to remove PudlTabl (probably in the Spring) and that will mean reorganizing the tests, that would probably be a good time to play around with it. I don't know exactly what fraction of the time in the integration tests and data validations is spent reading data, but I suspect it's significant given the way we've wired up PudlTabl to pass the read requests through to the database and not cache anything in memory, and I think reading

I can't remember where we were discussing it but I think @bendnorman suggested that since the PUDL ETL script is now a relatively thin wrapper around some slightly nonstandard Dagster structures, we might want to switch to using the Dagster CLI directly, which I think would make these job configurations more straightforward with the Dagster configs.

@@ -536,9 +638,19 @@ def load_input(self, context: InputContext) -> pd.DataFrame:


@io_manager
def pudl_sqlite_io_manager(init_context) -> PudlSQLiteIOManager:
def pudl_sqlite_io_manager(init_context) -> IOManager:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree Zane the keys shouldn't specify anything about format / destination.

The FERC IO managers are just for reading data out of the dbf and xbrl sqlite databases. We don't process any of the other ferc forms in the main PUDL ETL so we don't have IO managers for them.

@zaneselvans
Copy link
Member

zaneselvans commented Jan 17, 2024

Also, don't we need to update the resource configuration in pudl/etl/__init__.py? Otherwise it won't actually use the new IO manager will it? And it seems like it should at least in the integration tests, or we won't know whether it's working.

Like if I check out this branch and run the ETL, it won't actually make any Parquet files will it?

@rousik
Copy link
Collaborator Author

rousik commented Jan 18, 2024

Okay. I have made some more changes, specifically renamed pudl_sqlite_io_manager to a generic pudl_io_manager.

I have left fake_pudl_sqlite_io_manager in place, as that seems to be essentially unit tests for the PudlSQLiteIOManager itself. Personally I think that switching to unittest.TestCase here would make things a little cleaner instead of passing around these fixtures, e.g. (based on https://docs.python.org/3/library/unittest.html#basic-example)

class PudlSQLiteIOManagerTests(unittest.TestCase):

  def setUp(self):
    # spin up temp database (in TemporaryDirectory perhaps)
    self.io_manager = PudlSQLiteIOManager(...)

  def test_delete_statement(self):
    self.io_manager.handle_output(...)
    ...

We're already using unittest style tests for excel so that would not be first case :)

@@ -33,7 +33,7 @@ def test_pudl_engine(

if check_foreign_keys:
# Raises ForeignKeyErrors if there are any
pudl_sql_io_manager.check_foreign_keys()
pudl_sql_io_manager._sqlite_io_manager.check_foreign_keys()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dirty, but I'm not sure what would be the best way around it. Maybe implement MixedFormatIOManager.get_sqlite_io_manager() -> PudlSqliteIOManager method that we could call when we really need to access sqlite only features?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think implementing a getter that throws a helpful error if the MixedFormatIOManager doesn't have a sqlite_io_manager.

It might not be worth the complexity but we could make the MixedFormatIOManager more generalizable and accept an arbitrary number of IO managers to use. Then we could have some methods for viewing and grabbing specific io managers:

>>> MixedFormatIOManager.list_io_manager_names()
["sqlite_io_manager", "parquet_io_manager", "bq_io_manager", ...]
>>> MixedFormatIOManager.get_io_manager("sqlite_io_manager")

I think for now a simple MixedFormatIOManager.get_sqlite_io_manager() -> PudlSqliteIOManager method seems fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the generalization here is probably not warranted. I would only add that if/when we know we will need it but otherwise it may be just future proofing for something that will never happen and I agree that the complexity is not worth it.

What this does show, however, is that sqlite and pudl-sqlite io managers are special in that their APIs and functionality is much broader than that of the typical/format-agnostic io manager and this API is a bit vague.

It might be valuable to separate these concerns somewhat, perhaps by defining the interface that such sqlite thingy should provide and have a method on the mixed io manager that can return implementations of this clearly-defined and separate interface. It could return the io_manager that implements this interface, but in most cases I see, we don't actually need the io_manager but the specific sqlite validation/schema-management functionality.

It might even make sense to separate this sqlite/schema functionality to a standalone class that is embedded in the IO manager but with which IO manager communicated through this well defined API.

Note that the above may very well go past the scope of this PR and should be likely reserved to subsequent cleanup/refactoring. Providing get_sqlite_io_manager() function is simple and deals with this dirtiness reasonably well.

@rousik
Copy link
Collaborator Author

rousik commented Jan 18, 2024

Refactored PudlSQLiteIOManager tests to use unittest framework which IMO leads to much cleaner and leaner tests. Case in point, running these tests resulted in discovering two real bugs in the underlying schema validation code :)

@rousik
Copy link
Collaborator Author

rousik commented Jan 18, 2024

Looks like failing validation when there are unknown columns is not the right thing to do as demonstrated by the failing ci-integration. Perhaps the assumption in the test test_extra_column_error is itself wrong? Halp!

@zaneselvans
Copy link
Member

@rousik whenever you introduce additional frameworks or systems that we're not using (like poetry or unittest) it creates a barrier to our understanding, maintaining, and extending the work, or it means we need to dedicate additional internal resources to educating everyone on how these other systems work, or to replacing what you've done with the systems that we've already agreed to use internally.

Comment on lines +54 to +59
io_manager: PudlSQLiteIOManager = pudl_io_manager(context)._sqlite_io_manager
# TODO(rousik): This reaches into the io_manager and assumes
# PudlSQLiteIOManager and the corresponding foreign key
# functionality. It's strange to be constructing dagster
# resources to achieve this, but alas, that's where the logic
# lies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have a storage agnostic method for referential integrity checks? Maybe we can pull the logic out of the IO managers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the foreign key check on the schemas/Packages (do the foreign keys make sense at all), and right now we have the validation on the actual data, which is very much format specific (relies on sqlite pragmas). Building something format agnostic may be well out of scope of this PR, but nevertheless, extracting the functionality out of io_manager does actually make sense and I will do so. That will simplify things.

Comment on lines +515 to +516
settings: controls how parquet files are used by the PUDL ETL, in particular,
whether they should be used for input/output of dataframes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant? I don't see any references to settings.

src/pudl/io_managers.py Show resolved Hide resolved
@@ -33,7 +33,7 @@ def test_pudl_engine(

if check_foreign_keys:
# Raises ForeignKeyErrors if there are any
pudl_sql_io_manager.check_foreign_keys()
pudl_sql_io_manager._sqlite_io_manager.check_foreign_keys()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think implementing a getter that throws a helpful error if the MixedFormatIOManager doesn't have a sqlite_io_manager.

It might not be worth the complexity but we could make the MixedFormatIOManager more generalizable and accept an arbitrary number of IO managers to use. Then we could have some methods for viewing and grabbing specific io managers:

>>> MixedFormatIOManager.list_io_manager_names()
["sqlite_io_manager", "parquet_io_manager", "bq_io_manager", ...]
>>> MixedFormatIOManager.get_io_manager("sqlite_io_manager")

I think for now a simple MixedFormatIOManager.get_sqlite_io_manager() -> PudlSqliteIOManager method seems fine.

fake_pudl_sqlite_io_manager_fixture.load_input(input_context)
# TODO(rousik): is there difference between fake_sqlite_io_manager
# and real sqlite_io_manager in terms of functionality?!!
class PudlSQLiteIOManagerTest(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using unittest instead of pytest here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other threads on the PR. I thought that unitest offers generally cleaner tests with less verbosity. However, if this breaks the convention (most of the code does use pytest even though there are some unitest based tests too), I can revert these changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the ability to put all of the assumptions/expectations about single class-under-test into a parallel unittest test-class, as this provides nice encapsulation and visual separation compared to the flat undifferentiated pytest style approach, but if we think consistency is more valuable (pytest vs unittest) I can revert these last changes.

@rousik
Copy link
Collaborator Author

rousik commented Jan 18, 2024

@rousik whenever you introduce additional frameworks or systems that we're not using (like poetry or unittest) it creates a barrier to our understanding, maintaining, and extending the work, or it means we need to dedicate additional internal resources to educating everyone on how these other systems work, or to replacing what you've done with the systems that we've already agreed to use internally.

This is a fair point. I agree that adding new technologies to the mix have non-trivial barriers and we shouldn't do this haphazardly, which is, perhaps, how it looks from the outside. Let me try to provide a bit more detailed reasoning for unittest:

  1. unittest style is already used in excel, so we're not strictly introducing it from scratch. It was me who added it back then, so it's still on me though :)
  2. unittest plays well in pytest in that pytest can run unittest based tests just fine, so it boils down to different "style" of writing tests, but the overall usage remains unchanged and unaffected.
  3. It feels to me that unittest offers a little more encapsulated and nicer ways to build tests, which is particularly nice for situations where we are testing specific classes and their expectations (i.e. rather than having flat collection of tests, we can group them into test-classes that are geared towards specific functionality) and provide very explicit setup/teardown logic rather than relying on the nebulous cloud of fixtures.

It is, in particular, the point (3) where I think that departing from pytest might result in cleaner and leaner looking test and it might very well be worth it to use that. I do think that we're also using fairly limited and intuitive features from unittest so it shouldn't be too hard to maintain those tests, but perhaps I'm underestimating the burden, and overestimating the lower maintenance of these "nicer" tests.

I can roll these changes back if this is undesirable.

I think we can have separate conversation re/ poetry - there, the use is definitely more experimental, but also non-binding in that it should be possible to revert back/switch to other dep management systems.

@zschira
Copy link
Member

zschira commented Jan 24, 2024

Hey @rousik, we decided we want to try to get the parquet support completed by the end of the week, so I'm going to branch off of here and see if I can get a parallel PR through over the finish line. Thanks for getting this most of the way there, though!

@rousik
Copy link
Collaborator Author

rousik commented Jan 24, 2024

Hey @rousik, we decided we want to try to get the parquet support completed by the end of the week, so I'm going to branch off of here and see if I can get a parallel PR through over the finish line. Thanks for getting this most of the way there, though!

Thanks, that is awesome. I'm off the grid most of this week so it's a great decision.

@zaneselvans
Copy link
Member

Parquet outputs have been merrrrged in #3296

@zaneselvans zaneselvans closed this Feb 6, 2024
@zaneselvans zaneselvans deleted the export-to-parquet branch February 25, 2024 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community dagster Issues related to our use of the Dagster orchestrator output Exporting data from PUDL into other platforms or interchange formats. parquet Issues related to the Apache Parquet file format which we use for long tables.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants