BUG/ENH: compression for google cloud storage in to_csv #35681

twoertwein · 2020-08-12T01:38:04Z

closes to_csv to Google Cloud Storage ignores compression #35677, closes to_csv to Google Cloud Storage ignores encoding #26124, and closes read_csv from Google Cloud Storage ignores encoding #32392
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

By inferring the compression before converting the path to a file object df.to_csv("gs://mybucket/test2.csv.gz", compression="infer", mode="wb") works. By wrapping fsspec file-objects in a TextIOWrapper df.to_csv("gs://mybucket/test2.csv", mode="wb") works as well. Path-like objects that are internally converted to file-like objects (in get_filepath_or_buffer) are now always opened in binary mode (unless text mode is explicitly requested) and the potentially changed mode is returned (no need to specify mode="wb" for google cloud files). As long as the google file is opened in binary mode (which is now always the case), we also honor the requested encoding.

This PR also fixes Zip compression for file objects not having a name.

pandas/tests/io/test_gcs.py

pandas/io/common.py

pandas/_libs/parsers.pyx

pandas/io/common.py

twoertwein · 2020-08-14T02:22:33Z

it seems that only the windows py37 machine on azure actually tests the google cloud interface?!

EDIT: and one linux machine on travis

pandas/io/common.py

pandas/core/frame.py

pandas/io/common.py

jreback · 2020-08-19T20:35:21Z

pandas/io/common.py

+    # use binary mode when converting path-like objects to file-like objects
+    # except when text mode is explicitly requested
+    binary_mode = mode or "rb"
+    if "t" not in binary_mode and "b" not in binary_mode:


does this change a simple mode='w' ?

or are you not actually using this (except to return)?

binary_mode is used to open fsspec files and is only returned in this case. In the case that fsspec is used and the caller had mode="w", mode="wb" will be returned as fsspec files are now always opened in binary mode (unless mode contains a "t" to explicitly request text mode (to_json does that)). If fsspec isn't used to open a file, the initial mode is returned.

If we do not do that, users who want to write a compressed CSV file to the google cloud, need to specify mode="wb" manually. If we are fine with that, we wouldn't need to change the interface of get_filepath_or_buffer. But it would be a little bit un-intuitive requiring mode when a path/string is provided.

Probably should rename binary_mode to something like fsspec_mode (renamed).

pandas/io/common.py

pep8speaks · 2020-08-20T17:34:23Z

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-31 23:19:00 UTC

alimcmaster1 · 2020-08-22T16:22:00Z

pandas/tests/io/test_gcs.py

+    df.to_csv(buffer, compression=compression, encoding=encoding, mode="wb")
+
+    # emulate GCS
+    gcs_buffer = BytesIO()


We have the same MockGCSFileSystem in test_to_csv_gcs could we extract into a fixture?

I will try to do that! do you know why we need registry.target.clear() # noqa # remove state ~~Does it remove previously applied monkeypatches?~~

I'm honestly not familiar with the monkeypatch-magic, I just followed the existing pattern.

@alimcmaster1 done, but please double check the changes in pandas/tests/io/test_gcs.py

pandas/tests/io/test_gcs.py

jreback

thanks for adding IOargs. I would use it everywhere and do not use tuple unpacking at all from every call to getfilepath_or_buffer

pandas/core/frame.py

pandas/core/generic.py

pandas/io/common.py

pandas/io/excel/_base.py

pandas/io/feather_format.py

pandas/io/formats/csvs.py

pandas/io/json/_json.py

twoertwein · 2020-08-25T00:42:17Z

thanks for your review @jreback! I was lazy and used tuple unpacking to keep the diff small. I will use the named tuple explicitly in the next commits.

The return types of getfilepath_or_buffer were previously not checked (return types were only mentioned in the doc string - and were partly inconsistent with the input arguments). I will double check the types and add asserts instead of ignore comments.

jreback · 2020-08-25T01:58:00Z

thanks @twoertwein yeah i really like the named tuple for its readability

(you could actually use a DataClass) but not sure if that adds anything

pandas/io/json/_json.py

pandas/io/orc.py

twoertwein · 2020-08-26T03:14:15Z

pandas/io/common.py

@@ -162,13 +165,13 @@ def is_fsspec_url(url: FilePathOrBuffer) -> bool:
    )


-def get_filepath_or_buffer(
+def get_filepath_or_buffer(  # type: ignore[assignment]


my local mypy needs that for line 170 and 172 but the CI mypy needs it apparently at that line (TypeVars cannot have default values, could be fixed with @overload)

twoertwein · 2020-08-26T03:17:51Z

pandas/io/common.py

@@ -583,12 +638,15 @@ def __init__(
        self.archive_name = archive_name
        kwargs_zip: Dict[str, Any] = {"compression": zipfile.ZIP_DEFLATED}
        kwargs_zip.update(kwargs)
-        super().__init__(file, mode, **kwargs_zip)
+        super().__init__(file, mode, **kwargs_zip)  # type: ignore[arg-type]


complains about file being IOBase but we cannot have an assert not isisntance(file, IOBase) since io.StringIO inherits from IOBase

pandas/io/json/_json.py

twoertwein · 2020-08-26T03:28:09Z

pandas/io/stata.py

            fname, mode="wb", compression=compression, storage_options=storage_options,
        )
-        f, _ = get_handle(path_or_buf, "wb", compression=compression, is_text=False)
-        return f, True, compression
+        f, _ = get_handle(


I assume the auxiliary file handles should be closed as well? I think that only happens for compressed files. But that would require yet another interface change.

twoertwein · 2020-08-26T03:39:52Z

Typing of filepath_or_buffer is a mess. I also added IOBase to FilePathOrBuffer (io.BytesIO and io.StringIO do not seem to be covered by isinstance(..., typing.IO)).

I ended up using a dataclass for IOargs, NamedTuple does not support TypeVars in it. I use TypeVars for encoding (stays None/str) and mode (stays None/str or it changes from None to str).

If IOargs was retrieved inside an if-branch, I 'unpack' it (using the tuppe names) into the previous variables, otherwise I use ioargs. instead of the previous variable names.

jreback

looks really good. a couple of comments.

pandas/io/common.py

pandas/io/feather_format.py

pandas/io/json/_json.py

jreback · 2020-08-27T02:38:39Z

pandas/io/stata.py

+            compression=ioargs.compression,
+            is_text=False,
+        )
+        return f, True, ioargs.compression


might need a slight refactoring but can you use IOargs here?

it returns similar information but in this case f is always BinaryIO (according to the current type annotations). If we were to use IOargs, we would give it a broader type.

I think the following options would make sense for this function:

inline it (it is used only one time and it is a private function)

simplify it (the first if-block is imho not necessary, can be handled by the elif-block)

simplification from (2) and move the function to io/common and use it more than once: the pattern to convert the compression string/dict to a dict, calling get_filepath_or_buffer, and then calling get_handle is present in many to_* (and read_*) functions.

I'm tempted to go for the second option in this PR and leave the third option for a future PR.

I will not touch that for now: I don't understand the difference between self._output_file and self._file in StataWriter. Option (2) would affect these two variables. Having had only a brief look at StataWriter, it seems that this class distinguishes between writing to a compressed file and writing to a buffer (get_handle should take care of that, after that you should be able to treat them the same?).

yeah this can all addressed later. this writer might need some restructuing.

pandas/tests/io/test_common.py

jreback · 2020-08-27T02:40:20Z

pandas/tests/io/test_gcs.py

-def test_read_csv_gcs(monkeypatch):
+@pytest.fixture
+def gcs_buffer(monkeypatch):
+    """Emulate GCS using a binary buffer."""
    from fsspec import AbstractFileSystem, registry

    registry.target.clear()  # noqa  # remove state



looks fine, does this make it easier / cleaner?

@alimcmaster1 commented one some duplication across the GCS tests. I'm not 100% sure how the monkeypatch part works, the CI seems to be happy about it. Double/triple checking that I didn't invalidated these tests would be good.

jreback · 2020-08-27T02:41:48Z

@WillAyd @TomAugspurger @simonjayhawkins if any comments; also using Generic protocols now so if you guys can review types (not to get bogged down of course.....)

get_handle: fsspec file objects need to be wrapped get_filepath_or_buffer: path-like objects that are internally converted to file-like objects are opened in binary mode; named tuple _BytesZipFile: work with filename-less objects

…gnore statements (mypy will compile about filepath_or_buffer)

…es; refine type for filepath_or_buffer

jreback

lgtm @twoertwein

thanks for the patch!

jreback · 2020-09-03T03:05:02Z

pandas/io/stata.py

+            compression=ioargs.compression,
+            is_text=False,
+        )
+        return f, True, ioargs.compression


yeah this can all addressed later. this writer might need some restructuing.

…5681)

twoertwein marked this pull request as draft August 12, 2020 01:38

VelizarVESSELINOV reviewed Aug 12, 2020

View reviewed changes

pandas/tests/io/test_gcs.py Outdated Show resolved Hide resolved

VelizarVESSELINOV reviewed Aug 12, 2020

View reviewed changes

pandas/tests/io/test_gcs.py Outdated Show resolved Hide resolved

twoertwein commented Aug 12, 2020

View reviewed changes

pandas/io/common.py Show resolved Hide resolved

twoertwein commented Aug 13, 2020

View reviewed changes

pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved

twoertwein marked this pull request as ready for review August 13, 2020 22:03

jreback added Compat pandas objects compatability with Numpy or Python functions IO Google labels Aug 14, 2020

jreback added this to the 1.2 milestone Aug 14, 2020

VelizarVESSELINOV reviewed Aug 14, 2020

View reviewed changes

pandas/io/common.py Outdated Show resolved Hide resolved

twoertwein mentioned this pull request Aug 14, 2020

to_csv to Google Cloud Storage ignores compression #35677

Closed

twoertwein mentioned this pull request Aug 15, 2020

BUG/ENH: to_pickle/read_pickle support compression for file ojects #35736

Merged

5 tasks

twoertwein requested a review from jreback August 18, 2020 16:41

twoertwein commented Aug 18, 2020

View reviewed changes

pandas/io/common.py Show resolved Hide resolved

twoertwein commented Aug 18, 2020

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

jreback requested changes Aug 19, 2020

View reviewed changes

alimcmaster1 requested changes Aug 22, 2020

View reviewed changes

alimcmaster1 reviewed Aug 22, 2020

View reviewed changes

pandas/tests/io/test_gcs.py Outdated Show resolved Hide resolved

jreback requested changes Aug 25, 2020

View reviewed changes

twoertwein commented Aug 25, 2020

View reviewed changes

pandas/io/json/_json.py Show resolved Hide resolved

twoertwein commented Aug 25, 2020

View reviewed changes

pandas/io/orc.py Outdated Show resolved Hide resolved

twoertwein commented Aug 26, 2020

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

twoertwein commented Aug 26, 2020

View reviewed changes

jreback reviewed Aug 27, 2020

View reviewed changes

twoertwein added 3 commits August 31, 2020 16:14

bind input type of encding and mode with the returned type; removed i…

935fc4b

…gnore statements (mypy will compile about filepath_or_buffer)

use named tuple; remove some unused variables; closed some file handl…

475e8e8

…es; refine type for filepath_or_buffer

jreback approved these changes Sep 3, 2020

View reviewed changes

jreback merged commit 361166f into pandas-dev:master Sep 3, 2020

twoertwein deleted the google_storage branch September 3, 2020 15:56

arw2019 mentioned this pull request Sep 23, 2020

TST: DataFrame.to_parquet accepts pathlib.Path with partition_cols defined #36491

Merged

4 tasks

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

BUG/ENH: compression for google cloud storage in to_csv (pandas-dev#3…

327707b

…5681)

Uh oh!

BUG/ENH: compression for google cloud storage in to_csv #35681

BUG/ENH: compression for google cloud storage in to_csv #35681

Uh oh!

Conversation

twoertwein commented Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

twoertwein commented Aug 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

twoertwein Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pep8speaks commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-08-31 23:19:00 UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

twoertwein Aug 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

twoertwein commented Aug 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Aug 25, 2020

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

twoertwein commented Aug 26, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

twoertwein commented Aug 12, 2020 •

edited

Loading

twoertwein commented Aug 14, 2020 •

edited

Loading

twoertwein Aug 19, 2020 •

edited

Loading

pep8speaks commented Aug 20, 2020 •

edited

Loading

twoertwein Aug 22, 2020 •

edited

Loading

twoertwein commented Aug 25, 2020 •

edited

Loading