Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add read support for Google Cloud Storage #20729

Merged
merged 3 commits into from
Jun 26, 2018

Conversation

bnaul
Copy link
Contributor

@bnaul bnaul commented Apr 18, 2018

Couple of remaining issues:

  • I had to do some weird logging magic to suppress the output of gcsfs when catching authentication exceptions
  • Proper Parquet support seems to require more specialization so I didn't try
  • I'm not sure how to properly test this functionality without a pandas-test bucket and/or a mock GCS library like moto

cc @martindurant who might have some thoughts on other things I am doing wrong

@bnaul
Copy link
Contributor Author

bnaul commented Apr 18, 2018

I'm also not sure what the all the various requirements files I touched are, I just mimicked the s3fs pattern; looks like Circle is unhappy with that, should I revert everything in ci/ or is there some other step I'm missing?

@@ -275,6 +275,7 @@ Optional Dependencies

* `Jinja2 <http://jinja.pocoo.org/>`__: Template engine for conditional HTML formatting.
* `s3fs <http://s3fs.readthedocs.io/>`__: necessary for Amazon S3 access (s3fs >= 0.0.7).
* `gcsfs <http://gcsfs.readthedocs.io/>`__: necessary for Google Cloud Storage access (gcsfs >= 0.6.0).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.0.6?

@martindurant
Copy link
Contributor

Testing in gcsfs is done via the VCR package, which records all HTTP calls and their responses from the server, for playback during the running of a test. It is rather finicky/awkward to use.

Parquet is difficult because you need to supply the open function rather than file-like objects?

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about testing yet. VCR has sounded like more trouble than it's worth.

"""Check for a gcs url"""
try:
return parse_url(url).scheme in ['gcs', 'gs']
except: # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this except Exception, or even better figure out what parse_url can raise, and catch those (plus AttributeError).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From looking over urlparse it seems that it will also only raise AttributeError so I used that. For consistency I changed the other url parsers excepts as well.

pandas/io/gcs.py Outdated
mode = 'rb'

gcsfs_logger_disabled = gcsfs.core.logger.disabled
gcsfs.core.logger.disabled = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why disable this? I'd rather leave that up to the user if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was probably an overly complicated solution; the problem I was facing was that gcsfs retries several times if it fails to authenticate, and each failure prints a long traceback. But if that's the behavior of the library then I guess there's no sense in using hacks to get around it. For now I removed all of the logging magic.

pandas/io/gcs.py Outdated
try:
filepath_or_buffer = fs.open(filepath_or_buffer, mode)
except (compat.FileNotFoundError, GoogleAuthError, gcsfs.utils.HtmlError):
fs = gcsfs.GCSFileSystem(token='anon')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this was originally a hack for not breaking backwards compat with pre-s3fs pandas. I'd like to eventually move to something like dask's store_options here, a dict of keywords that gets passed down to the backend (#19904)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this would be the best way to do things. For now I simplified things and am just using gcsfs's default authentication preferences. If allowing anonymous access to public files is important we could try to think of a better way to accomplish that, but arguably gcsfs should handle that logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that gcsfs defaults to anon access if no other mechanisms worked - but this will not happen if one of the other mechanisms contained real credentials that turned out to be no longer valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I thought but it didn't seem to be the case when I tested it. In a python:3.6.4 Docker image w/ gcsfs installed I get:

# default auth fails
In [2]: gcsfs.GCSFileSystem().open('gs://gcp-public-data-landsat/index.csv.gz').read(10)
# lots of error logs from failing to access /computeMetadata/v1/instance/service-accounts/default/?recursive=true
...
RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fab36881518>: Failed to establish a new connection: [Errno -2] Name or service not known',))

# specifying token='anon' works
In [3]: gcsfs.GCSFileSystem(token='anon').open('gs://gcp-public-data-landsat/index.csv.gz').read(10)
Out[3]: b'\x1f\x8b\x08\x08\x89\xd8\xd6Z\x02\xff'

Should I create a separate gcsfs issue for this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, OK, it should be an issue - but worth checking whether it's still the case with current master.

pandas/io/gcs.py Outdated
fs = gcsfs.GCSFileSystem(token='anon')
filepath_or_buffer = fs.open(filepath_or_buffer, mode)
finally:
gcsfs.core.logger.disabled = gcsfs_logger_disabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the anon version fails, this line is not hit, correct? It would have to be wrapped in a try / finally as well (or ideally the logger stuff would be removed).

@codecov
Copy link

codecov bot commented Apr 18, 2018

Codecov Report

Merging #20729 into master will decrease coverage by <.01%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20729      +/-   ##
==========================================
- Coverage   91.91%    91.9%   -0.01%     
==========================================
  Files         153      154       +1     
  Lines       49532    49550      +18     
==========================================
+ Hits        45525    45541      +16     
- Misses       4007     4009       +2
Flag Coverage Δ
#multiple 90.3% <90%> (-0.01%) ⬇️
#single 41.86% <40%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/util/_print_versions.py 15.71% <ø> (ø) ⬆️
pandas/io/json/json.py 92.47% <ø> (ø) ⬆️
pandas/io/common.py 71.2% <100%> (+0.92%) ⬆️
pandas/io/gcs.py 80% <80%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1033e8b...2745359. Read the comment docs.

@bnaul bnaul force-pushed the gcsfs branch 2 times, most recently from 23e34bc to 0168784 Compare April 20, 2018 16:42
@bnaul
Copy link
Contributor Author

bnaul commented Apr 20, 2018

The issue with anonymous credentials in gcsfs appears to be resolved by fsspec/gcsfs#100, so once that is released I think this PR should cover most use cases.

Any more thoughts re: testing? Would simply mocking out gcsfs.GCSFileSystem.read to return a simple CSV string be sufficient?

@martindurant
Copy link
Contributor

martindurant commented Apr 20, 2018

This was not implemented in Dask (yet), but the MemoryFileSystem might be a nice and simple mock for GCSFileSystem.

@martindurant
Copy link
Contributor

gcsfs v0.0.7 being released: conda-forge/gcsfs-feedstock#7

@bnaul
Copy link
Contributor Author

bnaul commented May 7, 2018

Any more thoughts about testing? Would a limited set of tests based on mock.patching GCSFileSystem.open be sufficient for now?

@jreback
Copy link
Contributor

jreback commented May 8, 2018

I think we would need some tests, mocking would be ok

mode = 'rb'

fs = gcsfs.GCSFileSystem()
filepath_or_buffer = fs.open(filepath_or_buffer, mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnaul do you think this is an appropriate place to mock? fs.open could return a BytesIO object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a simple mock-based test here; any other methods I should include besides read_csv?

@bnaul bnaul force-pushed the gcsfs branch 2 times, most recently from b11b2f8 to c722d60 Compare May 23, 2018 21:15
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also

  • add a whatsnew 0.24 probably
  • rerun python scripts/convert_deps.py so that gcsfs is in the optional pip deps.

with patch('gcsfs.GCSFileSystem') as MockFileSystem:
instance = MockFileSystem.return_value
instance.open.return_value = BytesIO(b'a,b\n1,2\n3,4')
df = read_csv('gs://test/test.csv')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mock let's you assert that certain functions are called, right?

I guess, maybe we could have a second test that ensures pandas.io.gcs.get_filepath_or_buffer is called for gs:// type filepath_or_buffer?

@pep8speaks
Copy link

pep8speaks commented Jun 7, 2018

Hello @bnaul! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 25, 2018 at 23:43 Hours UTC

@bnaul bnaul force-pushed the gcsfs branch 2 times, most recently from 31e7ef5 to 6d6e727 Compare June 7, 2018 00:42
@bnaul
Copy link
Contributor Author

bnaul commented Jun 7, 2018

@TomAugspurger sorry for the delay, was away for a while; made the requested changes!

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 7, 2018 via email

@bnaul bnaul force-pushed the gcsfs branch 2 times, most recently from 5e270e8 to f6aba54 Compare June 8, 2018 21:21
@@ -9,4 +9,4 @@ python-dateutil>=2.5.0
pytz
setuptools>=24.2.0
sphinx
sphinxcontrib-spelling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what changed here? can you revert this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -5,6 +5,7 @@ channels:
dependencies:
- beautifulsoup4
- cython
- gcsfs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would remove gcsfs from the slow ci files (not tested anyhow as these tests are not slow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -5,6 +5,7 @@ channels:
dependencies:
- beautifulsoup4
- cython
- gcsfs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove from here to make sure the tests are skipped properly (as tested in 2.7 / 3.6 builds already)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@td.skip_if_no('gcsfs')
def test_read_csv_gcs():
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm we do install mock in py27, can you move this to pandas.util.testing (and then just import patch from there)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a custom version is already in util.testing; it's only used in one place so I swapped it out for mock.patch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should make changes to util/testing.py. We don't want to require mock (for py27), and testing will be awkward if it's in pandas/util.

Better to make a little fixture or something for it.

Copy link
Contributor Author

@bnaul bnaul Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger do you mean a fixture that does use mock, but in another directory that won't get imported outside of testing? Or a fixture that doesn't use mock at all?

Edit: for now I just reverted to the try/except Import in the test. I'm not sure how to reconcile this w/ @jreback 's comment but it seems like any way is fine (I'm surprised there aren't already more places in the tests where mock objects need to be used...).

instance = MockFileSystem.return_value
instance.open.return_value = BytesIO(b'a,b\n1,2\n3,4')
df = read_csv('gs://test/test.csv')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use tm.assert_from_equal so dtypes are tested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

False)
df = read_csv('gs://test/test.csv')

assert isinstance(df, DataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add other dtypes such as float, datetime, string, and some NaNs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@bnaul bnaul force-pushed the gcsfs branch 5 times, most recently from f996427 to 456734b Compare June 25, 2018 16:49
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 25, 2018 via email

@bnaul
Copy link
Contributor Author

bnaul commented Jun 25, 2018

Made a pytest fixture; if it belongs in somewhere more generic let me know, the various testing utils seem spread out over quite a few different places and it's a bit confusing to me

@TomAugspurger
Copy link
Contributor

Not quite what I had in mind. I'll push a change in a minute :)

@bnaul
Copy link
Contributor Author

bnaul commented Jun 25, 2018

Sounds good 🙃

@TomAugspurger
Copy link
Contributor

OK, pushed. Only real changes were

  1. returning mock instead of mock.patch
  2. moving to conftest.py, since it's a generally useful fixture
  3. skipping if neither unittest.mock nor mock are available

@TomAugspurger
Copy link
Contributor

Fixed the linting error I introduced.

May have a random failure on PY37, will investigate more later.

@bnaul
Copy link
Contributor Author

bnaul commented Jun 25, 2018

Oops might have overwritten that w/ the same change but I assume it's fine

@bnaul
Copy link
Contributor Author

bnaul commented Jun 25, 2018

@jreback looking 🍏

@TomAugspurger TomAugspurger merged commit 95427d5 into pandas-dev:master Jun 26, 2018
@TomAugspurger
Copy link
Contributor

Thanks @bnaul!

@bnaul bnaul deleted the gcsfs branch June 26, 2018 18:23
@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Jul 2, 2018
This was referenced Sep 13, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH Adding read_* support for Google Cloud Storage gs:// URLs
6 participants