Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipfile.BadZipFile: File is not a zip file when loading an .xlsx file #54

Open
craigastill opened this issue May 15, 2023 · 7 comments · May be fixed by #56
Open

zipfile.BadZipFile: File is not a zip file when loading an .xlsx file #54

craigastill opened this issue May 15, 2023 · 7 comments · May be fixed by #56

Comments

@craigastill
Copy link
Contributor

Not sure if this is a misconfiguration on my side, but I get the error: zipfile.BadZipFile: File is not a zip file both from reading an .xlsx from local/S3, or a pytest testcase saving and reloading an openpyxl created workbook.

Debugging/Workaround

My current workaround is to modify: https://github.com/ets/tap-spreadsheets-anywhere/blob/main/tap_spreadsheets_anywhere/excel_handler.py#L64-L65 from:

def get_row_iterator(table_spec, file_handle):
    workbook = openpyxl.load_workbook(file_handle, read_only=True)

to:

def get_row_iterator(table_spec, file_handle):
    workbook = openpyxl.load_workbook(file_handle.name, read_only=True)

Reproduction:

  • Created an example testcase:
    from openpyxl import Workbook
    from tap_spreadsheets_anywhere.format_handler import get_row_iterator
    
    
    def get_worksheet():
        """Create a basic workbook that can be manipulated for tests.
        See: https://openpyxl.readthedocs.io/en/stable/usage.html.
        """
        wb = Workbook()
        ws = wb.active
        tree_data = [
            ["Type", "Leaf Color", "Height"],
            ["Maple", "Red", 549],
            ["Oak", "Green", 783],
            ["Pine", "Green", 1204]
        ]
        exp_tree_data = [
            {'type': 'Maple', 'leaf_color': 'Red', 'height': 549},
            {'type': 'Oak', 'leaf_color': 'Green', 'height': 783},
            {'type': 'Pine', 'leaf_color': 'Green', 'height': 1204},
        ]
        [ws.append(row) for row in tree_data]
        return ws, wb, tree_data, exp_tree_data
    
    
    class TestFormatHandlerExcelSkipInitial:
        """pytests to validate Skip Initial for Excel files works as expected."""
        def test_parse_data(self, tmpdir):
            xlsx = tmpdir / "fake_test.xlsx"
            uri = f"file://{xlsx}"
            worksheet, workbook, _, exp = get_worksheet()
            workbook.save(xlsx)
    
            iterator = get_row_iterator({"format": "excel"}, uri)
            assert next(iterator) == exp[0]
  • Current output:
    cd ~/github_forks/tap-spreadsheets-anywhere/ && source .venv/bin/activate && pytest -vvvv -p no:warnings -k TestFormatHandlerExcelSkipInitial
    ============================================= test session starts ==============================================
    platform darwin -- Python 3.11.3, pytest-7.3.1, pluggy-1.0.0 -- /Users/craigastill/github_forks/tap-spreadsheets-anywhere/.venv/bin/python3.11
    cachedir: .pytest_cache
    rootdir: /Users/craigastill/github_forks/tap-spreadsheets-anywhere
    collected 30 items / 29 deselected / 1 selected                                                                
    
    tap_spreadsheets_anywhere/test/test_format.py::TestFormatHandlerExcelSkipInitial::test_parse_data FAILED [100%]
    
    =================================================== FAILURES ===================================================
    ______________________________ TestFormatHandlerExcelSkipInitial.test_parse_data _______________________________
    
    self = <tap_spreadsheets_anywhere.test.test_format.TestFormatHandlerExcelSkipInitial object at 0x1055e9e50>
    tmpdir = local('/private/var/folders/yx/p09k0hh165scp5p_ntthr29c0000gn/T/pytest-of-craigastill/pytest-6/test_parse_data0')
    
        def test_parse_data(self, tmpdir):
            xlsx = tmpdir / "fake_test.xlsx"
            uri = f"file://{xlsx}"
            worksheet, workbook, _, exp = get_worksheet()
            workbook.save(xlsx)
    
    >       iterator = get_row_iterator({"format": "excel"}, uri)
    
    tap_spreadsheets_anywhere/test/test_format.py:218: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    tap_spreadsheets_anywhere/format_handler.py:164: in get_row_iterator
        iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)
    tap_spreadsheets_anywhere/excel_handler.py:66: in get_row_iterator
        workbook = openpyxl.load_workbook(file_handle, read_only=True)
    .venv/lib/python3.11/site-packages/openpyxl/reader/excel.py:344: in load_workbook
        reader = ExcelReader(filename, read_only, keep_vba,
    .venv/lib/python3.11/site-packages/openpyxl/reader/excel.py:123: in __init__
        self.archive = _validate_archive(fn)
    .venv/lib/python3.11/site-packages/openpyxl/reader/excel.py:95: in _validate_archive
        archive = ZipFile(filename, 'r')
    /usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py:1301: in __init__
        self._RealGetContents()
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <zipfile.ZipFile [closed]>
    
        def _RealGetContents(self):
            """Read in the table of contents for the ZIP file."""
            fp = self.fp
            try:
                endrec = _EndRecData(fp)
            except OSError:
                raise BadZipFile("File is not a zip file")
            if not endrec:
    >           raise BadZipFile("File is not a zip file")
    E           zipfile.BadZipFile: File is not a zip file
    
    /usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py:1368: BadZipFile
    --------------------------------------------- Captured stdout call ---------------------------------------------
    <_io.TextIOWrapper name='/private/var/folders/yx/p09k0hh165scp5p_ntthr29c0000gn/T/pytest-of-craigastill/pytest-6/test_parse_data0/fake_test.xlsx' mode='r' encoding='utf-8'>
    =========================================== short test summary info ============================================
    FAILED tap_spreadsheets_anywhere/test/test_format.py::TestFormatHandlerExcelSkipInitial::test_parse_data - zi...
    ======================================= 1 failed, 29 deselected in 1.22s =======================================

Workaround:

Adding the above mentioned workaround results in success, but not tested this in anger.

cd ~/github_forks/tap-spreadsheets-anywhere/ && source .venv/bin/activate && pytest -vvvv -p no:warnings -k TestFormatHandlerExcelSkipInitial
============================= test session starts ==============================
platform darwin -- Python 3.11.3, pytest-7.3.1, pluggy-1.0.0 -- /Users/craigastill/github_forks/tap-spreadsheets-anywhere/.venv/bin/python3.11
cachedir: .pytest_cache
rootdir: /Users/craigastill/github_forks/tap-spreadsheets-anywhere
collected 30 items / 29 deselected / 1 selected                                

tap_spreadsheets_anywhere/test/test_format.py::TestFormatHandlerExcelSkipInitial::test_parse_data PASSED [100%]

====================== 1 passed, 29 deselected in 23.37s =======================
craigastill added a commit to craigastill/tap-spreadsheets-anywhere that referenced this issue May 16, 2023
Use the name of the `.xlsx`, when opening the workbook, since the following
error is thrown when attempting to read the `io.TextIOWrapper` instance:

`zipfile.BadZipFile: File is not a zip file when loading an .xlsx file`

Issue: ets#54.
craigastill added a commit to craigastill/tap-spreadsheets-anywhere that referenced this issue May 16, 2023
Use the name of the `.xlsx`, when opening the workbook, since the following
error is thrown when attempting to read the `io.TextIOWrapper` instance:

`zipfile.BadZipFile: File is not a zip file when loading an .xlsx file`

Issue: ets#54.
@craigastill
Copy link
Contributor Author

My quick .name hack works for local files, but is throwing: FileNotFoundError: [Errno 2] No such file or directory: <sub-folder>/<file>.xlsx, when running meltano invoke tap-<custom_tap> --dev when attempting to extract the same file from an S3 bucket.

Edit: redacted output:

  INFO Using supplied catalog /path/to/meltano_project/.meltano/run/tap-custom-tap/tap.properties.json.
  INFO Processing 1 selected streams from Catalog
  INFO Syncing stream:account_transactions
  {"type": "SCHEMA", "stream": "account_transactions", "schema": {"properties": {"date": {"type": ["null", "string"]}, "contact": {"type": ["null", "string"]}, "description": {"type": ["null", "string"]}, "invoice_number": {"type": ["null", "string"]}, "reference": {"type": ["null", "string"]}, "debit_gbp": {"type": ["null", "string"]}, "credit_gbp": {"type": ["null", "string"]}, "gross_gbp": {"type": ["null", "string"]}, "net_gbp": {"type": ["null", "string"]}, "vat_gbp": {"type": ["null", "string"]}, "account_code": {"type": ["null", "integer"]}, "account": {"type": ["null", "string"]}, "account_type": {"type": ["null", "string"]}, "revenue_type": {"type": ["null", "string"]}, "source": {"type": ["null", "string"]}, "contact_group": {"type": ["null", "string"]}, "debit": {"type": ["null", "string"]}, "credit": {"type": ["null", "string"]}, "gross": {"type": ["null", "string"]}, "net": {"type": ["null", "string"]}, "vat": {"type": ["null", "string"]}, "vat_rate": {"type": ["null", "integer"]}, "vat_rate_name": {"type": ["null", "string"]}, "region": {"type": ["null", "string"]}, "related_account": {"type": ["null", "string"]}, "_smart_source_bucket": {"type": "string"}, "_smart_source_file": {"type": "string"}, "_smart_source_lineno": {"type": "integer"}}, "selected": true, "type": "object"}, "key_properties": []}
  INFO Loading cached SSO token for default
  INFO Found 2 files.
  INFO Checking 2 resolved objects for any that match regular expression "account_transactions/.*xlsx$" and were modified since 1970-01-01 00:00:00+00:00
  INFO Processing 1 resolved objects that met our criteria. Enable debug verbosity logging for more details.
  INFO Syncing file "account_transactions/issue_52_bad_sample_file.xlsx".
  CRITICAL [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx'
  Traceback (most recent call last):
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/bin/tap-spreadsheets-anywhere", line 8, in <module>
      sys.exit(main())
               ^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/singer/utils.py", line 235, in wrapped
      return fnc(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 162, in main
      sync(tables_config, args.state, catalog)
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/__init__.py", line 117, in sync
      records_streamed += file_utils.write_file(t_file['key'], table_spec, merged_schema, max_records=max_records_per_run-records_streamed)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/file_utils.py", line 46, in write_file
      iterator = tap_spreadsheets_anywhere.format_handler.get_row_iterator(table_spec, target_uri)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/format_handler.py", line 164, in get_row_iterator
      iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/tap_spreadsheets_anywhere/excel_handler.py", line 72, in get_row_iterator
      workbook = openpyxl.load_workbook(file_handle.name, read_only=True)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 344, in load_workbook
      reader = ExcelReader(filename, read_only, keep_vba,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 123, in __init__
      self.archive = _validate_archive(fn)
                     ^^^^^^^^^^^^^^^^^^^^^
    File "/path/to/meltano_project/.meltano/extractors/tap-custom-tap/venv/lib/python3.11/site-packages/openpyxl/reader/excel.py", line 95, in _validate_archive
      archive = ZipFile(filename, 'r')
                ^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/zipfile.py", line 1283, in __init__
      self.fp = io.open(file, filemode)
                ^^^^^^^^^^^^^^^^^^^^^^^
  FileNotFoundError: [Errno 2] No such file or directory: 'account_transactions/issue_52_bad_sample_file.xlsx'

NOTE: Without this change, I get: zipfile.BadZipFile: File is not a zip file, in both local/S3 scenarios.

@craigastill
Copy link
Contributor Author

Diving in with pdb. With the code in main and using my sample file in S3, this is hitting the exception in the builtin: zipfile (https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L247). Which is throwing:

(Pdb) fpin.seek(-sizeEndCentDir, 2)
fpin.seek(-sizeEndCentDir, 2)
*** io.UnsupportedOperation: can't do nonzero end-relative seeks

Which is then re-raised on: https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L1367 as: BadZipFile("File is not a zip file"). zipfile Code:

    def _RealGetContents(self):
        """Read in the table of contents for the ZIP file."""
        fp = self.fp
        try:
            endrec = _EndRecData(fp)
        except OSError:
            raise BadZipFile("File is not a zip file")
        if not endrec:
            raise BadZipFile("File is not a zip file")

Was worried that my S3 sourced file was not completely streamed, but did the same check with the local .xlsx file as done by the built-in zipfile._EndRecData(fpin) (https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L285-L293). ie.

>>> f = open("/path/to/file.xlsx", errors="surrogateescape")  # `errors` is used to match current code + avoid: `*** UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 10: invalid continuation byte` when read.
>>> f.seek(0, 2)
>>> f.tell()
<int>  # Confirmed same for local vs S3 file pulled file during pdb session.

@craigastill
Copy link
Contributor Author

I don't know from the lack of details, but could this potentially be a duplicate of #26 ? With the problem being around: zipfile.ZipFile(<_io.TextWrapper>), returned by smart_open.open(<s3_uri>) ??

@craigastill
Copy link
Contributor Author

I've got a dirty workaround that reads the file contents from the smart_open.open(), to a temporary local file before passing into: openpyxl.load_workbook(). I'm not happy with having a second copy of the file in memory, but it might be a starting point for a cleaner fix:

from pathlib import Path
from tempfile import TemporaryDirectory

import smart_open
from openpyxl import load_workbook

S3_URI = "s3://<s3_bucket>/<prefix>/<file>.xlsx"


# Pull down `.xlsx` from S3 and work around `BadZipFile` error thrown by
# the internals of `openpyxl.load_workbook()` when passed the
# `_io.TextIOWrapper` that `smart_open.open()` returns, by dumping to a
# temporary file.
with TemporaryDirectory() as tmp_dir:
    _tmp_dir_path = Path(tmp_dir)
    _tmp_file = _tmp_dir_path / "tmp.xlsx"
    with open(_tmp_file, "wb") as tf:
        with smart_open.open(S3_URI, "rb") as f:
            for line in f:
                tf.write(line)
    # Following no longer throws: `BadZipFile`.
    wb = load_workbook(ttf)

@aminebeh
Copy link

Hi @craigastill, stuble upon the same issue when using the tap for local files.

I wanted to extract a local .xlsx file, but I got the File is not a zip file error.

After digging in the code, I found that the smart_open lib overwrite the binary mode when reading a file when an encoding is specified:

A fix can be to update the code here with:

elif format == 'excel':
    if uri.lower().endswith(".xls"):
        reader = get_streamreader(uri, universal_newlines=universal_newlines,newline=None, open_mode='rb')
        iterator = tap_spreadsheets_anywhere.excel_handler.get_legacy_row_iterator(table_spec, reader)
    else:
        reader = get_streamreader(uri, universal_newlines=universal_newlines,newline=None, open_mode='rb', encoding=None) # Adding encoding `None` to ensure smart_open will use binary mode
        iterator = tap_spreadsheets_anywhere.excel_handler.get_row_iterator(table_spec, reader)

We can add a check to ensure file extension is xlsx as this format require to open the file in binary mode.

@craigastill Can you test this fix with your S3 use case ?

I'll write a PR as soon as you've tested it 😉

@craigastill
Copy link
Contributor Author

Hi Amine. When I roll back onto the data project that is using tap-spreadsheets-anywhere, I'll try and give your fix a go.

Interesting find on the smart_open side!

@radbrt
Copy link
Contributor

radbrt commented Sep 21, 2023

@aminebeh I stumbled across this fix (after I made a much less elegant fix myself), it works excellently for me. My own branch is https://github.com/radbrt/tap-spreadsheets-anywhere/tree/excelbinary, I don't see any reason not to make a PR with the fix. Any thoughts @menzenski ?

mjsqu added a commit to mjsqu/tap-spreadsheets-anywhere that referenced this issue Oct 3, 2023
mjsqu added a commit to mjsqu/tap-spreadsheets-anywhere that referenced this issue Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants