Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use context managers for opening zipfiles #3369

Merged
merged 4 commits into from
Feb 9, 2024
Merged

Conversation

bendnorman
Copy link
Member

@bendnorman bendnorman commented Feb 7, 2024

Overview

Hopefully, using context managers to close connections to zip files will resolve our sporadic BadZipFile error #3364.

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

@bendnorman bendnorman linked an issue Feb 7, 2024 that may be closed by this pull request
@zaneselvans zaneselvans linked an issue Feb 8, 2024 that may be closed by this pull request
@zaneselvans zaneselvans added bug Things that are just plain broken. nightly-builds Anything having to do with nightly builds or continuous deployment. labels Feb 8, 2024
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, glad to see it's so straightforward :) There are a few other places where we use get_zipfile_resource outside of a context manager (dbf.py, eia176.py, epacems.py, etl_test.py, glue_assets.py) - any particular reason not to change those as well?

@bendnorman
Copy link
Member Author

bendnorman commented Feb 9, 2024

Ugh I goofed this rebase. @jdangerx any idea how to resolve it? I only meant to push the latest commit.

@bendnorman
Copy link
Member Author

I went back and added context managers to where we use get_zipfile_resource except for pudl.extract.dbf.FercDbfReader.

@lru_cache # noqa: B019
def get_archive(self: Self, year: int, **filters) -> FercDbfArchive:
"""Returns single dbf archive matching given filters."""
nfilters = self._normalize(filters)
return FercDbfArchive(
self.datastore.get_zipfile_resource(self.dataset, year=year, **nfilters),
dbc_path=self._dbc_path[year],
partition=filters,
table_file_map=self._table_file_map,
field_parser=self.field_parser,
)

It doesn't seem right to pass an open ZipFile object into an objection initialization because it's not super clear when it should be closed. I think it'd be better to pass a zipfile path or the arguments of get_zipfile_resource into this class so it's methods can open and close the file as needed. Given we haven't seen this bit of code produce the BadZipError I think it's fine to keep it as is for now. I can create an issue to redesign the classes that accept open zipfiles.

I'm also more optimistic adding context managers will resolve the BadZipFile("File is not a zip file") error. The zipfile docs say:

You must call close() before exiting your program or essential records will not be written.

This makes me think the "essential records" weren't written and the PHMSA and FERC 714 extractors tried to reopen these corrupted zipfiles. However, I don't know what "essential records" would need to be written when you are reading the file 🤷

@jdangerx
Copy link
Member

jdangerx commented Feb 9, 2024

Leaving the FercDbfReader as is seems fine! Maybe a comment in-line to leave a breadcrumb trail would be helpful for future Ben :)

Your comment about essential records led me down this rabbit hole: https://github.com/python/cpython/blob/3.12/Lib/zipfile/__init__.py#L1211

@bendnorman bendnorman marked this pull request as ready for review February 9, 2024 19:42
Copy link
Member

@jdangerx jdangerx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤐

@bendnorman bendnorman added this pull request to the merge queue Feb 9, 2024
Merged via the queue into main with commit bbf79ea Feb 9, 2024
12 checks passed
@bendnorman bendnorman deleted the resolve-badzipfile-error branch February 9, 2024 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. nightly-builds Anything having to do with nightly builds or continuous deployment.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

BadZipFile error sometimes shows up in nightly builds Nightly Build Failure 2024-02-05
3 participants