-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Add read support for Google Cloud Storage #20729
Conversation
I'm also not sure what the all the various requirements files I touched are, I just mimicked the s3fs pattern; looks like Circle is unhappy with that, should I revert everything in |
doc/source/install.rst
Outdated
@@ -275,6 +275,7 @@ Optional Dependencies | |||
|
|||
* `Jinja2 <http://jinja.pocoo.org/>`__: Template engine for conditional HTML formatting. | |||
* `s3fs <http://s3fs.readthedocs.io/>`__: necessary for Amazon S3 access (s3fs >= 0.0.7). | |||
* `gcsfs <http://gcsfs.readthedocs.io/>`__: necessary for Google Cloud Storage access (gcsfs >= 0.6.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.0.6?
Testing in gcsfs is done via the VCR package, which records all HTTP calls and their responses from the server, for playback during the running of a test. It is rather finicky/awkward to use. Parquet is difficult because you need to supply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about testing yet. VCR has sounded like more trouble than it's worth.
pandas/io/common.py
Outdated
"""Check for a gcs url""" | ||
try: | ||
return parse_url(url).scheme in ['gcs', 'gs'] | ||
except: # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this except Exception
, or even better figure out what parse_url
can raise, and catch those (plus AttributeError
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From looking over urlparse
it seems that it will also only raise AttributeError
so I used that. For consistency I changed the other url parsers except
s as well.
pandas/io/gcs.py
Outdated
mode = 'rb' | ||
|
||
gcsfs_logger_disabled = gcsfs.core.logger.disabled | ||
gcsfs.core.logger.disabled = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why disable this? I'd rather leave that up to the user if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was probably an overly complicated solution; the problem I was facing was that gcsfs
retries several times if it fails to authenticate, and each failure prints a long traceback. But if that's the behavior of the library then I guess there's no sense in using hacks to get around it. For now I removed all of the logging magic.
pandas/io/gcs.py
Outdated
try: | ||
filepath_or_buffer = fs.open(filepath_or_buffer, mode) | ||
except (compat.FileNotFoundError, GoogleAuthError, gcsfs.utils.HtmlError): | ||
fs = gcsfs.GCSFileSystem(token='anon') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this was originally a hack for not breaking backwards compat with pre-s3fs pandas. I'd like to eventually move to something like dask's store_options
here, a dict of keywords that gets passed down to the backend (#19904)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this would be the best way to do things. For now I simplified things and am just using gcsfs
's default authentication preferences. If allowing anonymous access to public files is important we could try to think of a better way to accomplish that, but arguably gcsfs
should handle that logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that gcsfs defaults to anon access if no other mechanisms worked - but this will not happen if one of the other mechanisms contained real credentials that turned out to be no longer valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I thought but it didn't seem to be the case when I tested it. In a python:3.6.4
Docker image w/ gcsfs
installed I get:
# default auth fails
In [2]: gcsfs.GCSFileSystem().open('gs://gcp-public-data-landsat/index.csv.gz').read(10)
# lots of error logs from failing to access /computeMetadata/v1/instance/service-accounts/default/?recursive=true
...
RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fab36881518>: Failed to establish a new connection: [Errno -2] Name or service not known',))
# specifying token='anon' works
In [3]: gcsfs.GCSFileSystem(token='anon').open('gs://gcp-public-data-landsat/index.csv.gz').read(10)
Out[3]: b'\x1f\x8b\x08\x08\x89\xd8\xd6Z\x02\xff'
Should I create a separate gcsfs
issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, OK, it should be an issue - but worth checking whether it's still the case with current master.
pandas/io/gcs.py
Outdated
fs = gcsfs.GCSFileSystem(token='anon') | ||
filepath_or_buffer = fs.open(filepath_or_buffer, mode) | ||
finally: | ||
gcsfs.core.logger.disabled = gcsfs_logger_disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the anon version fails, this line is not hit, correct? It would have to be wrapped in a try / finally
as well (or ideally the logger stuff would be removed).
Codecov Report
@@ Coverage Diff @@
## master #20729 +/- ##
==========================================
- Coverage 91.91% 91.9% -0.01%
==========================================
Files 153 154 +1
Lines 49532 49550 +18
==========================================
+ Hits 45525 45541 +16
- Misses 4007 4009 +2
Continue to review full report at Codecov.
|
23e34bc
to
0168784
Compare
The issue with anonymous credentials in Any more thoughts re: testing? Would simply mocking out |
This was not implemented in Dask (yet), but the MemoryFileSystem might be a nice and simple mock for GCSFileSystem. |
gcsfs v0.0.7 being released: conda-forge/gcsfs-feedstock#7 |
Any more thoughts about testing? Would a limited set of tests based on |
I think we would need some tests, mocking would be ok |
mode = 'rb' | ||
|
||
fs = gcsfs.GCSFileSystem() | ||
filepath_or_buffer = fs.open(filepath_or_buffer, mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bnaul do you think this is an appropriate place to mock? fs.open
could return a BytesIO object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a simple mock-based test here; any other methods I should include besides read_csv
?
b11b2f8
to
c722d60
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also
- add a whatsnew 0.24 probably
- rerun
python scripts/convert_deps.py
so that gcsfs is in the optional pip deps.
pandas/tests/io/test_gcs.py
Outdated
with patch('gcsfs.GCSFileSystem') as MockFileSystem: | ||
instance = MockFileSystem.return_value | ||
instance.open.return_value = BytesIO(b'a,b\n1,2\n3,4') | ||
df = read_csv('gs://test/test.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mock let's you assert that certain functions are called, right?
I guess, maybe we could have a second test that ensures pandas.io.gcs.get_filepath_or_buffer
is called for gs://
type filepath_or_buffer
?
Hello @bnaul! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on June 25, 2018 at 23:43 Hours UTC |
31e7ef5
to
6d6e727
Compare
@TomAugspurger sorry for the delay, was away for a while; made the requested changes! |
Changes look good, but the CI is failing on builds that don't have gcsfs.
Could you
1. Add a `td.skip_if_no('gcsfs')` decorator to the tests that use gcsfs
2. Add a test that is skipped if gcsfs *is* present, and checks that not
having gcsfs raises with the right error message.
…On Wed, Jun 6, 2018 at 7:44 PM, Brett Naul ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> sorry for the delay,
was away for a while; made the requested changes!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20729 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIrs1pKdLMMEHHo_6qsHegV6FrlDyks5t6HdugaJpZM4TZS32>
.
|
5e270e8
to
f6aba54
Compare
ci/requirements_dev.txt
Outdated
@@ -9,4 +9,4 @@ python-dateutil>=2.5.0 | |||
pytz | |||
setuptools>=24.2.0 | |||
sphinx | |||
sphinxcontrib-spelling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what changed here? can you revert this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ci/travis-36-slow.yaml
Outdated
@@ -5,6 +5,7 @@ channels: | |||
dependencies: | |||
- beautifulsoup4 | |||
- cython | |||
- gcsfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I would remove gcsfs from the slow ci files (not tested anyhow as these tests are not slow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ci/circle-36-locale.yaml
Outdated
@@ -5,6 +5,7 @@ channels: | |||
dependencies: | |||
- beautifulsoup4 | |||
- cython | |||
- gcsfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove from here to make sure the tests are skipped properly (as tested in 2.7 / 3.6 builds already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/tests/io/test_gcs.py
Outdated
|
||
@td.skip_if_no('gcsfs') | ||
def test_read_csv_gcs(): | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm we do install mock in py27, can you move this to pandas.util.testing (and then just import patch from there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a custom version is already in util.testing
; it's only used in one place so I swapped it out for mock.patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should make changes to util/testing.py
. We don't want to require mock
(for py27), and testing will be awkward if it's in pandas/util
.
Better to make a little fixture or something for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger do you mean a fixture that does use mock, but in another directory that won't get imported outside of testing? Or a fixture that doesn't use mock at all?
Edit: for now I just reverted to the try/except Import in the test. I'm not sure how to reconcile this w/ @jreback 's comment but it seems like any way is fine (I'm surprised there aren't already more places in the tests where mock objects need to be used...).
pandas/tests/io/test_gcs.py
Outdated
instance = MockFileSystem.return_value | ||
instance.open.return_value = BytesIO(b'a,b\n1,2\n3,4') | ||
df = read_csv('gs://test/test.csv') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use tm.assert_from_equal
so dtypes are tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/tests/io/test_gcs.py
Outdated
False) | ||
df = read_csv('gs://test/test.csv') | ||
|
||
assert isinstance(df, DataFrame) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add other dtypes such as float, datetime, string, and some NaNs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
f996427
to
456734b
Compare
I meant a fixture that provides the imported module. Like pytest imporotskip, but with the try / except for the two locations.
________________________________
From: Brett Naul <notifications@github.com>
Sent: Monday, June 25, 2018 11:40:55 AM
To: pandas-dev/pandas
Cc: Tom Augspurger; Mention
Subject: Re: [pandas-dev/pandas] [ENH] Add read support for Google Cloud Storage (#20729)
@bnaul commented on this pull request.
________________________________
In pandas/tests/io/test_gcs.py<#20729 (comment)>:
+
+from pandas import DataFrame, read_csv
+from pandas.compat import BytesIO
+from pandas.io.common import is_gcs_url
+from pandas.util import _test_decorators as td
+
+
+def test_is_gcs_url():
+ assert is_gcs_url("gcs://pandas/somethingelse.com")
+ assert is_gcs_url("gs://pandas/somethingelse.com")
+ assert not is_gcs_url("s3://pandas/somethingelse.com")
+
+
+@td.skip_if_no('gcsfs')
+def test_read_csv_gcs():
+ try:
@TomAugspurger<https://github.com/TomAugspurger> do you mean a fixture that does use mock, but in another directory that won't get imported outside of testing? Or a fixture that doesn't use mock at all?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#20729 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIvOshqCcoEHlHlqX7eiP8SMY_LhIks5uARKXgaJpZM4TZS32>.
|
Made a pytest fixture; if it belongs in somewhere more generic let me know, the various testing utils seem spread out over quite a few different places and it's a bit confusing to me |
Not quite what I had in mind. I'll push a change in a minute :) |
Sounds good 🙃 |
OK, pushed. Only real changes were
|
Fixed the linting error I introduced. May have a random failure on PY37, will investigate more later. |
Oops might have overwritten that w/ the same change but I assume it's fine |
@jreback looking 🍏 |
Thanks @bnaul! |
* Google Cloud Storage support using gcsfs
git diff upstream/master -u -- "*.py" | flake8 --diff
Couple of remaining issues:
gcsfs
when catching authentication exceptionspandas-test
bucket and/or a mock GCS library likemoto
cc @martindurant who might have some thoughts on other things I am doing wrong