Skip to content

BUG: to_json fails writing to GCS with compression #39985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
dariobig opened this issue Feb 23, 2021 · 9 comments · Fixed by #40010
Closed
2 of 3 tasks

BUG: to_json fails writing to GCS with compression #39985

dariobig opened this issue Feb 23, 2021 · 9 comments · Fixed by #40010
Labels
Bug IO JSON read_json, to_json, json_normalize Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@dariobig
Copy link

dariobig commented Feb 23, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
df = pd.DataFrame({'numbers': list(range(1, 10))})
df.to_json('gcs://test-bucket/test.json.gz')

Problem description

Error writing compressed stream using gcs. Removing compression works fine.

import pandas as pd...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/workspaces/charybdis/test_pg.py in 
      254 
      255 df = pd.DataFrame({'numbers': list(range(1, 10))})
----> 256 df.to_json('gcs://river-categorizer-data-us-central1/test.jsonl.gz')

/workspaces/charybdis/.venv/lib/python3.8/site-packages/pandas/core/generic.py in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index, indent, storage_options)
   2463         indent = indent or 0
   2464 
-> 2465         return json.to_json(
   2466             path_or_buf=path_or_buf,
   2467             obj=self,

/workspaces/charybdis/.venv/lib/python3.8/site-packages/pandas/io/json/_json.py in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index, indent, storage_options)
    100     if path_or_buf is not None:
    101         # apply compression and byte/text conversion
--> 102         with get_handle(
    103             path_or_buf, "wt", compression=compression, storage_options=storage_options
    104         ) as handles:

/workspaces/charybdis/.venv/lib/python3.8/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    590                 )
    591             else:
--> 592                 handle = gzip.GzipFile(
    593                     fileobj=handle,  # type: ignore[arg-type]
    594                     mode=ioargs.mode,

/usr/local/lib/python3.8/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
    202 
    203         if self.mode == WRITE:
--> 204             self._write_gzip_header(compresslevel)
    205 
    206     @property

/usr/local/lib/python3.8/gzip.py in _write_gzip_header(self, compresslevel)
    230 
    231     def _write_gzip_header(self, compresslevel):
--> 232         self.fileobj.write(b'\037\213')             # magic header
    233         self.fileobj.write(b'\010')                 # compression method

Expected Output

No error

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 7d32926 python : 3.8.7.final.0 python-bits : 64 OS : Linux OS-release : 4.19.121-linuxkit Version : #1 SMP Tue Dec 1 17:50:32 UTC 2020 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.2.2
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : 0.7.2
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : 1.3.23
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

@dariobig dariobig added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2021
@twoertwein
Copy link
Member

@dariobig thank you for the report! I assume the error is something like "str is expected but bytes is provided", is that correct? Does it work with pandas <1.2?

@dariobig
Copy link
Author

@dariobig thank you for the report! I assume the error is something like "str is expected but bytes is provided", is that correct? Does it work with pandas <1.2?

Not sure, this is new code I'll give it a try.
I saw that to_json sets file mode to "wt" and didn't see any logic to change it to "wb" in case of compression, but I don't understand the codebase well enough.

@dariobig
Copy link
Author

dariobig commented Feb 23, 2021

@twoertwein I've retried with pandas 1.1.5 and instead of failing I get an empty file whether I use compression or not:

image

These are the package versions used:

  • Installing pandas (1.1.5)
  • Installing fsspec (0.8.6)
  • Installing gcsfs (0.7.2)

@twoertwein
Copy link
Member

thank for testing! I think there might be two possible ways to address this 1) look into why we need "wt" for json maybe that can be changed or 2) implement #39383 and then use it to wrap files opened in text mode to then use compression with them (gzip and so on require file handles in binary mode).

@twoertwein twoertwein added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2021
@lithomas1 lithomas1 added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Feb 23, 2021
@roeps
Copy link

roeps commented Feb 23, 2021

For what it's worth I can confirm dariobig's findings. This works with Pandas 1.2.1. The regression appears to be with change #39440.

My findings (a lot repeats with dariobig) -

Reproduction:

from io import BytesIO
import pandas as pd

dataframe = pd.DataFrame([1, 2, 3], columns=['a'])
object_stream = BytesIO()
dataframe.to_json(object_stream, compression='gzip', orient='records', lines=True)

Result:

        if path_or_buf is not None:
            # apply compression and byte/text conversion
            with get_handle(
                path_or_buf, "wt", compression=compression, storage_options=storage_options
            ) as handles:
>               handles.handle.write(s)
E               TypeError: a bytes-like object is required, not 'str'

../../.env/lib/python3.7/site-packages/pandas/io/json/_json.py:105: TypeError

From what I can tell, _json's get_handle uses a static 'wt' mode which now, with the latest changes to common's _is_binary_mode, will always return false rather than check the path_or_buf instance type, thereby omitting the b flag on the mode passed to the rest of get_handle.

@lithomas1 lithomas1 removed the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Feb 23, 2021
@twoertwein
Copy link
Member

thank you for your investigation! I think we do not need mode="wt" for JSON and can use mode="w" instead (currently running pytest). That will still allow users to over-write pandas's binary/text auto-detection by specifying a mode that contains a "t" or a "b" (at least for functions that expose mode).

@twoertwein
Copy link
Member

@dariobig and @roeps I created a PR that should fix this #40010. Do you mind chaning "wt" to "w" in your pandas installation to confirm whether that fixes it?

I think read_json doesn't support reading from user-provided binary files, does it (or did it in <1.2.2)?

@twoertwein twoertwein added the Regression Functionality that used to work in a prior pandas version label Feb 24, 2021
@jreback jreback added this to the 1.2.3 milestone Feb 24, 2021
@roeps
Copy link

roeps commented Feb 25, 2021

@dariobig and @roeps I created a PR that should fix this #40010. Do you mind chaning "wt" to "w" in your pandas installation to confirm whether that fixes it?

I think read_json doesn't support reading from user-provided binary files, does it (or did it in <1.2.2)?

yes, the tests In my code once again pass when I manually change to pandas 1.2.2 to just 'w'.

@dariobig
Copy link
Author

dariobig commented Feb 26, 2021

@dariobig and @roeps I created a PR that should fix this #40010. Do you mind chaning "wt" to "w" in your pandas installation to confirm whether that fixes it?

I think read_json doesn't support reading from user-provided binary files, does it (or did it in <1.2.2)?

Works for me too! 🍾
I'm reading back the compressed file, so I'd say it should work now:

df = pd.DataFrame({'numbers': list(range(1, 10))})
df.to_json(gcs_path, compression='gzip')
pd.read_json(gcs_path, compression='gzip')

I don't know about before. As I said 1.1.5 doesn't work at all for me (empty file).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants