Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of archives (tar, zip) not well defined #160

Open
ptgolden opened this issue Nov 8, 2024 · 2 comments
Open

Handling of archives (tar, zip) not well defined #160

ptgolden opened this issue Nov 8, 2024 · 2 comments

Comments

@ptgolden
Copy link
Member

ptgolden commented Nov 8, 2024

Koza supports reading data from two archive formats: zip and tar. However, they are possibly treated differently.

If a zip archive is listed in a transform's files setting, then only the first file in the archive's file list is read. That read is streaming (i.e. it is not decompressed first):

if is_zipfile(resource):
with ZipFile(resource, 'r') as zip_file:
file = TextIOWrapper(zip_file.open(zip_file.namelist()[0], 'r')) # , encoding='utf-8')
# file = zip_file.read(zip_file.namelist()[0], 'r').decode('utf-8')

If a zip or tar archive is listed in a transform's file_archives setting, they are first decompressed, and then every file in the archive is added to the files setting:

def extract_archive(self):
archive_path = Path(self.file_archive).parent # .absolute()
if self.file_archive.endswith(".tar.gz") or self.file_archive.endswith(".tar"):
with tarfile.open(self.file_archive) as archive:
archive.extractall(archive_path)
elif self.file_archive.endswith(".zip"):
with zipfile.ZipFile(self.file_archive, "r") as archive:
archive.extractall(archive_path)
else:
raise ValueError("Error extracting archive. Supported archive types: .tar.gz, .zip")
if self.files:
files = [os.path.join(archive_path, file) for file in self.files]
else:
files = [os.path.join(archive_path, file) for file in os.listdir(archive_path)]
return files

There are a couple issues here.

  1. In the zip-in-files case, files beyond the first one are silently ignored
  2. In the archive-in-file_archive case, it is a waste of CPU time and disk space to extract an archive when it's possible to stream its read
  3. Also in the archive-in-file_archive case, reading will fail if the archive contains file types different from the format that was declared. For example this example will fail or (worse) silently read garbage data if data.zip contains the files data.csv and README.txt:
file_archive: data.zip
format: csv
@ptgolden
Copy link
Member Author

ptgolden commented Nov 8, 2024

Related: #124

A possible solution might transparently deal with compression (easy!- this would just mirror the behavior in the zip-in-files case, and add similar behavior to tar files)

Dealing with the third issue above would only be possible by detecting file types of files contained within the archive, probably through file extensions. The easier thing to do might be to document that when reading from an archive, all of the files contained within that archive are expected to be of the format which you expect to read.

@ptgolden
Copy link
Member Author

1 and 2 fixed by 4f95c13

3 remains an open question. It would require some way of declaring "I only want to use files a.csv and b.csv in data_archive.zip".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant