-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COMPAT/REF: Use s3fs for s3 IO #13137
Conversation
return filepath_or_buffer, None, compression | ||
return filepath_or_buffer, encoding, compression | ||
except FileNotFoundError: | ||
raise boto.exception.S3ResponseError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this an appropriate error? why is FileNotFoundError
ok here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually is this try/except needed? (as its not around fs.open
anyhow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why I added that... Doesn't seem to be necessary.
need to add |
This is just refactoring the reading for now. We should be able to support writing to s3 buckets pretty easily now too. A few things
|
s3fs now includes original boto3 exceptions when something goes wrong fsspec/s3fs#42 For your question above, I believe s3: and s3n: work exactly the same from our point of view; but we don't touch the block filesystem emulation (this is ancient, I don't know if it's used in the wild). The failure above appears to be a typo:
|
I haven't yet tried this PR, but I've had some issues with s3fs lately [1], and I'd like to try this out before it gets merged. Assuming liberal permissions on s3 buckets and objects and dealing with profiles has been a bit of an issue across third-party tools for me lately. It could be an issue with permissions, but I suspect it's going to continue to come up. E.g., I can do this
but not
and I don't see a way to set the profile explicitly in S3File and it doesn't seem to matter if I set my [1] fsspec/s3fs#38 |
@jseabold yeah, I've been hitting the same issue. For your specific case, maybe try import s3fs
fs = s3fs.S3FileSystem(profile_name='myprofile', anon=False)
f = fs.open('s3://' + BUCKET + '/' + KEY) If I use I spent last night trying to figure out a way to handle both, no luck yet. |
Ah, sure. Isn't there a try/except for that in the boto code that's removed by this PR? I'm just really worried that this is going to break everything for me. After having to downgrade dask, use a patched conda, use a patched odo, and run boto master on Python 3 until the last release, I'm super wary about these things. |
this won't be till 0.19 I think we have it marked in any event |
Yep, I was hoping to avoid that in pandas, but it might be necessary. @jseabold I have a bunch of stuff on AWS too and I share your concern about this breaking things. |
I'm not sure if we're dealing with one issue here or more:
I don't have boto profiles, just a [default] section in my .aws/credentials (no relevant env variables). How should I go about recreating your situation? In any case, I suggest the discussion should happen over on s3fs. |
@martindurant I've started today trying to collect the info for my typical permissions, so you can replicate this stuff. |
Is the original motivating bug in boto fixed with the newest release? I'm not sure if it's the same config parsing one I alluded to with conda that I fixed with a monkey patch. What's the motivation for this over using boto3? I need to read the linked thread more carefully. I'd tentatively suggest not to add reliance on another third party package and introduce some more complexity here. The |
Yep, not wanting to manage the boto to boto3 switch within pandas was what initially prompted this. |
That's true, but you have control over what is getting implemented, so you don't have extra calls around things that a filesystem mimic might want, but a single stream a key to memory thing might not. To be fair, it looks like the This is probably out of scope for this PR, but I've often wanted to be able to pass a boto(3) |
@jseabold this is to REDUCE complexity in the code. This makes it so pandas can simply use these w/o having to worry about messy details like authentication and such which the service can simply care about (and upgrade / fix as needed). |
You might be interested that https://github.com/dask/hdfs3 provides the very same interface to HDFS - perhaps a less used data source for pandas, but an option that might be useful in the future. We may also make the same interface for other cloud storage services, depending on demand/complexity. |
I agree with the sentiment and am all for reducing the complexity (in pandas) around csv reading, file-like objects, location discovery, and authentication around those locations. |
@TomAugspurger This may be what's going on with |
update for 0.19.0? |
@@ -15,7 +15,7 @@ | |||
from pandas.tseries.offsets import Day, MonthEnd | |||
|
|||
|
|||
class TestPickle(): | |||
class TestPickle(tm.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it run test generators? see docstring:
NOTE: TestPickle can't be a subclass of tm.Testcase to use test generator.
http://stackoverflow.com/questions/6689537/
nose-test-generators-inside-class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was a stray commit from a different branch, trying to get pytest running. Will remove.
On Jul 14, 2016, at 19:23, Sinhrks notifications@github.com wrote:
In pandas/io/tests/test_pickle.py:
@@ -15,7 +15,7 @@
from pandas.tseries.offsets import Day, MonthEnd-class TestPickle():
+class TestPickle(tm.TestCase):
Does it run test generators? see docstring:NOTE: TestPickle can't be a subclass of tm.Testcase to use test generator. http://stackoverflow.com/questions/6689537/ nose-test-generators-inside-class
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
I'll revisit this mid next week.
|
@TomAugspurger What's the status of this? |
@@ -40,6 +40,12 @@ | |||
import pandas.lib as lib | |||
import pandas.parser as _parser | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to io/common.py
let's do this after #14576 as that's changes the paths for the io code |
Current coverage is 84.64% (diff: 21.05%)@@ master #13137 diff @@
==========================================
Files 144 144
Lines 51057 51016 -41
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43180 43184 +4
+ Misses 7877 7832 -45
Partials 0 0
|
332fcd6
to
d28bd7c
Compare
I think this is ready to go, if anyone has further comments. |
@@ -93,17 +93,15 @@ Backwards incompatible API changes | |||
|
|||
.. _whatsnew_0200.api: | |||
|
|||
|
|||
- pandas now uses `s3fs <http://s3fs.readthedocs.io/>`_ for handling S3 connections. This shouldn't break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe make this a note / warning box?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. |
@@ -97,13 +97,16 @@ Backwards incompatible API changes | |||
- ``CParserError`` has been renamed to ``ParserError`` in ``pd.read_csv`` and will be removed in the future (:issue:`12665`) | |||
- ``SparseArray.cumsum()`` and ``SparseSeries.cumsum()`` will now always return ``SparseArray`` and ``SparseSeries`` respectively (:issue:`12855`) | |||
|
|||
S3 File Handling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a ref tag
|
||
|
||
pandas now uses `s3fs <http://s3fs.readthedocs.io/>`_ for handling S3 connections. This shouldn't break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a doc section itself somewhere? so that we can put a url in various doc-strings, e.g. read_csv? (can always do as a followup)
No comments from me on the pandas code. |
@@ -262,7 +262,7 @@ Optional Dependencies | |||
* `XlsxWriter <https://pypi.python.org/pypi/XlsxWriter>`__: Alternative Excel writer | |||
|
|||
* `Jinja2 <http://jinja.pocoo.org/>`__: Template engine for conditional HTML formatting. | |||
* `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3 access. | |||
* `s3fs <http://s3fs.readthedocs.io/>`__: necessary for Amazon S3 access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a min requirement for s3fs? (0.7.0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, may as well require the latest? (0.0.7
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I anticipate making a 0.0.8 soon, as there were a couple of additions since August, including boto3 version stuff. Best make it the latest, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think older versions work though, so maybe this is not an issue.
@TomAugspurger give this a rebase and test. |
e5f6b18
to
5c66be6
Compare
Travis is green. |
thanks @TomAugspurger I changed a name in the excel tests. |
closes pandas-dev#11915 Author: Tom Augspurger <tom.augspurger88@gmail.com> Closes pandas-dev#13137 from TomAugspurger/s3fs and squashes the following commits: 92ac063 [Tom Augspurger] CI: Update deps, docs 81690b5 [Tom Augspurger] COMPAT/REF: Use s3fs for s3 IO
git diff upstream/master | flake8 --diff