BUG: to_json fails writing to GCS with compression #39985

dariobig · 2021-02-23T04:07:53Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
df = pd.DataFrame({'numbers': list(range(1, 10))})
df.to_json('gcs://test-bucket/test.json.gz')

Problem description

Error writing compressed stream using gcs. Removing compression works fine.

import pandas as pd...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/workspaces/charybdis/test_pg.py in 
      254 
      255 df = pd.DataFrame({'numbers': list(range(1, 10))})
----> 256 df.to_json('gcs://river-categorizer-data-us-central1/test.jsonl.gz')

/workspaces/charybdis/.venv/lib/python3.8/site-packages/pandas/core/generic.py in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index, indent, storage_options)
   2463         indent = indent or 0
   2464 
-> 2465         return json.to_json(
   2466             path_or_buf=path_or_buf,
   2467             obj=self,

/workspaces/charybdis/.venv/lib/python3.8/site-packages/pandas/io/json/_json.py in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index, indent, storage_options)
    100     if path_or_buf is not None:
    101         # apply compression and byte/text conversion
--> 102         with get_handle(
    103             path_or_buf, "wt", compression=compression, storage_options=storage_options
    104         ) as handles:

/workspaces/charybdis/.venv/lib/python3.8/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    590                 )
    591             else:
--> 592                 handle = gzip.GzipFile(
    593                     fileobj=handle,  # type: ignore[arg-type]
    594                     mode=ioargs.mode,

/usr/local/lib/python3.8/gzip.py in __init__(self, filename, mode, compresslevel, fileobj, mtime)
    202 
    203         if self.mode == WRITE:
--> 204             self._write_gzip_header(compresslevel)
    205 
    206     @property

/usr/local/lib/python3.8/gzip.py in _write_gzip_header(self, compresslevel)
    230 
    231     def _write_gzip_header(self, compresslevel):
--> 232         self.fileobj.write(b'\037\213')             # magic header
    233         self.fileobj.write(b'\010')                 # compression method

Expected Output

No error

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : 7d32926 python : 3.8.7.final.0 python-bits : 64 OS : Linux OS-release : 4.19.121-linuxkit Version : #1 SMP Tue Dec 1 17:50:32 UTC 2020 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.2.2
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0
Cython : None
pytest : 6.2.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : 0.7.2
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : 1.3.23
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

twoertwein · 2021-02-23T16:17:06Z

@dariobig thank you for the report! I assume the error is something like "str is expected but bytes is provided", is that correct? Does it work with pandas <1.2?

dariobig · 2021-02-23T16:28:24Z

@dariobig thank you for the report! I assume the error is something like "str is expected but bytes is provided", is that correct? Does it work with pandas <1.2?

Not sure, this is new code I'll give it a try.
I saw that to_json sets file mode to "wt" and didn't see any logic to change it to "wb" in case of compression, but I don't understand the codebase well enough.

dariobig · 2021-02-23T17:30:52Z

@twoertwein I've retried with pandas 1.1.5 and instead of failing I get an empty file whether I use compression or not:

These are the package versions used:

  • Installing pandas (1.1.5)
  • Installing fsspec (0.8.6)
  • Installing gcsfs (0.7.2)

twoertwein · 2021-02-23T18:09:25Z

thank for testing! I think there might be two possible ways to address this 1) look into why we need "wt" for json maybe that can be changed or 2) implement #39383 and then use it to wrap files opened in text mode to then use compression with them (gzip and so on require file handles in binary mode).

roeps · 2021-02-23T22:09:35Z

For what it's worth I can confirm dariobig's findings. This works with Pandas 1.2.1. The regression appears to be with change #39440.

My findings (a lot repeats with dariobig) -

Reproduction:

from io import BytesIO
import pandas as pd

dataframe = pd.DataFrame([1, 2, 3], columns=['a'])
object_stream = BytesIO()
dataframe.to_json(object_stream, compression='gzip', orient='records', lines=True)

Result:

        if path_or_buf is not None:
            # apply compression and byte/text conversion
            with get_handle(
                path_or_buf, "wt", compression=compression, storage_options=storage_options
            ) as handles:
>               handles.handle.write(s)
E               TypeError: a bytes-like object is required, not 'str'

../../.env/lib/python3.7/site-packages/pandas/io/json/_json.py:105: TypeError

From what I can tell, _json's get_handle uses a static 'wt' mode which now, with the latest changes to common's _is_binary_mode, will always return false rather than check the path_or_buf instance type, thereby omitting the b flag on the mode passed to the rest of get_handle.

twoertwein · 2021-02-23T23:46:04Z

thank you for your investigation! I think we do not need mode="wt" for JSON and can use mode="w" instead (currently running pytest). That will still allow users to over-write pandas's binary/text auto-detection by specifying a mode that contains a "t" or a "b" (at least for functions that expose mode).

twoertwein · 2021-02-24T00:16:24Z

@dariobig and @roeps I created a PR that should fix this #40010. Do you mind chaning "wt" to "w" in your pandas installation to confirm whether that fixes it?

I think read_json doesn't support reading from user-provided binary files, does it (or did it in <1.2.2)?

roeps · 2021-02-25T23:24:16Z

@dariobig and @roeps I created a PR that should fix this #40010. Do you mind chaning "wt" to "w" in your pandas installation to confirm whether that fixes it?

I think read_json doesn't support reading from user-provided binary files, does it (or did it in <1.2.2)?

yes, the tests In my code once again pass when I manually change to pandas 1.2.2 to just 'w'.

dariobig · 2021-02-26T00:08:31Z

@dariobig and @roeps I created a PR that should fix this #40010. Do you mind chaning "wt" to "w" in your pandas installation to confirm whether that fixes it?

I think read_json doesn't support reading from user-provided binary files, does it (or did it in <1.2.2)?

Works for me too! 🍾
I'm reading back the compressed file, so I'd say it should work now:

df = pd.DataFrame({'numbers': list(range(1, 10))})
df.to_json(gcs_path, compression='gzip')
pd.read_json(gcs_path, compression='gzip')

I don't know about before. As I said 1.1.5 doesn't work at all for me (empty file).

dariobig added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2021

twoertwein added IO JSON read_json, to_json, json_normalize and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2021

lithomas1 added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Feb 23, 2021

lithomas1 removed the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Feb 23, 2021

twoertwein mentioned this issue Feb 24, 2021

REGR: compressed to_json with URL-like paths and binary objects #40010

Merged

4 tasks

twoertwein added the Regression Functionality that used to work in a prior pandas version label Feb 24, 2021

jreback added this to the 1.2.3 milestone Feb 24, 2021

jreback closed this as completed in #40010 Feb 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: to_json fails writing to GCS with compression #39985

BUG: to_json fails writing to GCS with compression #39985

dariobig commented Feb 23, 2021 •

edited

Loading

twoertwein commented Feb 23, 2021

Uh oh!

dariobig commented Feb 23, 2021

Uh oh!

dariobig commented Feb 23, 2021 •

edited

Loading

Uh oh!

twoertwein commented Feb 23, 2021

Uh oh!

roeps commented Feb 23, 2021

Uh oh!

twoertwein commented Feb 23, 2021

Uh oh!

twoertwein commented Feb 24, 2021

Uh oh!

roeps commented Feb 25, 2021

Uh oh!

dariobig commented Feb 26, 2021 •

edited

Loading

Uh oh!

Uh oh!

BUG: to_json fails writing to GCS with compression #39985

BUG: to_json fails writing to GCS with compression #39985

Comments

dariobig commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

twoertwein commented Feb 23, 2021

Uh oh!

dariobig commented Feb 23, 2021

Uh oh!

dariobig commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

twoertwein commented Feb 23, 2021

Uh oh!

roeps commented Feb 23, 2021

Uh oh!

twoertwein commented Feb 23, 2021

Uh oh!

twoertwein commented Feb 24, 2021

Uh oh!

roeps commented Feb 25, 2021

Uh oh!

dariobig commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dariobig commented Feb 23, 2021 •

edited

Loading

Output of `pd.show_versions()`

dariobig commented Feb 23, 2021 •

edited

Loading

dariobig commented Feb 26, 2021 •

edited

Loading