Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT/REF: Use s3fs for s3 IO #13137

Closed
wants to merge 2 commits into from

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented May 11, 2016

@TomAugspurger TomAugspurger added the IO Data IO issues that don't fit into a more specific label label May 11, 2016
@TomAugspurger TomAugspurger added this to the 0.19.0 milestone May 11, 2016
return filepath_or_buffer, None, compression
return filepath_or_buffer, encoding, compression
except FileNotFoundError:
raise boto.exception.S3ResponseError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an appropriate error? why is FileNotFoundError ok here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually is this try/except needed? (as its not around fs.open anyhow)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why I added that... Doesn't seem to be necessary.

@jreback
Copy link
Contributor

jreback commented May 11, 2016

need to add s3fs to .pip (where we have boto now)

@TomAugspurger
Copy link
Contributor Author

This is just refactoring the reading for now. We should be able to support writing to s3 buckets pretty easily now too. A few things

  • pandas accepts s3://, s3n://, and s3a:// protocols, but s3fs only expects the regular s3://. I work around that here. This was the best source I found on the differences, but I don't know if simply chaning the protocol will break anything. At the end of the day we're just pulling down bytes, which should presumably represent a CSV
  • API Changes: Before we raised OSError when the s3 file wasn't found, and reraised botos PermissionError when you didn't have permission to read the file. I think s3fs exposes both of these as IOErrors here. We might be able to work around this

cc @jreback @martindurant

@martindurant
Copy link
Contributor

s3fs now includes original boto3 exceptions when something goes wrong fsspec/s3fs#42

For your question above, I believe s3: and s3n: work exactly the same from our point of view; but we don't touch the block filesystem emulation (this is ancient, I don't know if it's used in the wild).

The failure above appears to be a typo:

File "/home/travis/build/pydata/pandas/pandas/io/parsers.py", line 36, in <module>
    need_text_wrapping = (compat.BtyesIO,)
AttributeError: module 'pandas.compat' has no attribute 'BtyesIO'

@jseabold
Copy link
Contributor

I haven't yet tried this PR, but I've had some issues with s3fs lately [1], and I'd like to try this out before it gets merged. Assuming liberal permissions on s3 buckets and objects and dealing with profiles has been a bit of an issue across third-party tools for me lately. It could be an issue with permissions, but I suspect it's going to continue to come up.

E.g., I can do this

import boto3
session = boto3.Session(profile_name='myprofile')
s3 = session.client('s3')
key = s3.get_object(Bucket=BUCKET, Key=KEY)

but not

import s3fs
fs = s3fs.S3FileSystem(profile_name='myprofile')
f = fs.open('s3://' + BUCKET + '/' + KEY)

and I don't see a way to set the profile explicitly in S3File and it doesn't seem to matter if I set my AWS_PROFILE correctly, but I haven't gone deep here.

[1] fsspec/s3fs#38

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented May 18, 2016

@jseabold yeah, I've been hitting the same issue. For your specific case, maybe try

import s3fs
fs = s3fs.S3FileSystem(profile_name='myprofile', anon=False)
f = fs.open('s3://' + BUCKET + '/' + KEY)

If I use anon=False though in pandas though, I'm unable to read from publicly available files like s3://pandas-test/tips.csv'. Withanon=None` (the default) I can't read from private buckets.

I spent last night trying to figure out a way to handle both, no luck yet.

@jseabold
Copy link
Contributor

jseabold commented May 18, 2016

Ah, sure. Isn't there a try/except for that in the boto code that's removed by this PR?

I'm just really worried that this is going to break everything for me. After having to downgrade dask, use a patched conda, use a patched odo, and run boto master on Python 3 until the last release, I'm super wary about these things.

@jreback
Copy link
Contributor

jreback commented May 18, 2016

this won't be till 0.19 I think we have it marked in any event

@TomAugspurger
Copy link
Contributor Author

Isn't there a try/except for that in the boto code that's removed by this PR?

Yep, I was hoping to avoid that in pandas, but it might be necessary.

@jseabold I have a bunch of stuff on AWS too and I share your concern about this breaking things.
I'll be writing many more tests. This is still a ways from being done, but I'll keep you updated.

@martindurant
Copy link
Contributor

martindurant commented May 18, 2016

I'm not sure if we're dealing with one issue here or more:

  • the profile_name keyword to S3FileSystem is designed to be passed along to boto3.Session, just like above
  • connections are attempted with anon=False first, and then fallback to True if not specified;
  • publicly available buckets should always be available, regardless of anon.

I don't have boto profiles, just a [default] section in my .aws/credentials (no relevant env variables). How should I go about recreating your situation? In any case, I suggest the discussion should happen over on s3fs.

@jseabold
Copy link
Contributor

@martindurant I've started today trying to collect the info for my typical permissions, so you can replicate this stuff.

@jseabold
Copy link
Contributor

Is the original motivating bug in boto fixed with the newest release? I'm not sure if it's the same config parsing one I alluded to with conda that I fixed with a monkey patch.

What's the motivation for this over using boto3? I need to read the linked thread more carefully. I'd tentatively suggest not to add reliance on another third party package and introduce some more complexity here. The read_csv code is already pretty hard to follow with the conflation of source location with file format and all the options without having to dig through another package in addition to the Amazon ones.

@TomAugspurger
Copy link
Contributor Author

Is the original motivating bug in boto fixed with the newest release? I'm not sure if it's the same config parsing one I alluded to with conda that I fixed with a monkey patch.

Yep, not wanting to manage the boto to boto3 switch within pandas was what initially prompted this.
It's a tradeoff between the complexity of (re)implementing it within pandas vs. the complexity of accepting a dependency. I'll try to get something more polished done tonight.

@jseabold
Copy link
Contributor

It's a tradeoff between the complexity of (re)implementing it within pandas vs. the complexity of accepting a dependency.

That's true, but you have control over what is getting implemented, so you don't have extra calls around things that a filesystem mimic might want, but a single stream a key to memory thing might not. To be fair, it looks like the S3File objects avoid some of the problems that I've been running in to with permissions.

This is probably out of scope for this PR, but I've often wanted to be able to pass a boto(3) Key object to read_csv rather than having to check how pandas is doing the session creation. If S3File is sufficiently file-like I'd love to skip the discovery and make me an s3 client steps, so I don't have to remind myself.

@jreback
Copy link
Contributor

jreback commented May 18, 2016

@jseabold this is to REDUCE complexity in the code. boto3 is somewhat raw. This makes us just treat it directly as a filesystem and is pretty generic. This is how all file-ilke things should be done, abstract them away from pandas directly. (e.g.this is done implicity with compression / unicode conversions, though these should be abstracted a bit more, but they are usually just a line or 2).

This makes it so pandas can simply use these w/o having to worry about messy details like authentication and such which the service can simply care about (and upgrade / fix as needed).

@martindurant
Copy link
Contributor

You might be interested that https://github.com/dask/hdfs3 provides the very same interface to HDFS - perhaps a less used data source for pandas, but an option that might be useful in the future. We may also make the same interface for other cloud storage services, depending on demand/complexity.

@jseabold
Copy link
Contributor

I agree with the sentiment and am all for reducing the complexity (in pandas) around csv reading, file-like objects, location discovery, and authentication around those locations.

@jseabold
Copy link
Contributor

@TomAugspurger This may be what's going on with anon for you.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Jul 8, 2016
@jreback
Copy link
Contributor

jreback commented Jul 15, 2016

update for 0.19.0?

@@ -15,7 +15,7 @@
from pandas.tseries.offsets import Day, MonthEnd


class TestPickle():
class TestPickle(tm.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it run test generators? see docstring:

    NOTE: TestPickle can't be a subclass of tm.Testcase to use test generator.
    http://stackoverflow.com/questions/6689537/
    nose-test-generators-inside-class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was a stray commit from a different branch, trying to get pytest running. Will remove.

On Jul 14, 2016, at 19:23, Sinhrks notifications@github.com wrote:

In pandas/io/tests/test_pickle.py:

@@ -15,7 +15,7 @@
from pandas.tseries.offsets import Day, MonthEnd

-class TestPickle():
+class TestPickle(tm.TestCase):
Does it run test generators? see docstring:

NOTE: TestPickle can't be a subclass of tm.Testcase to use test generator.
http://stackoverflow.com/questions/6689537/
nose-test-generators-inside-class


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@TomAugspurger
Copy link
Contributor Author

I'll revisit this mid next week.

On Jul 14, 2016, at 19:10, Jeff Reback notifications@github.com wrote:

update for 0.19.0?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2016

@TomAugspurger ?

@jorisvandenbossche
Copy link
Member

@TomAugspurger What's the status of this?

@@ -40,6 +40,12 @@
import pandas.lib as lib
import pandas.parser as _parser

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to io/common.py

@jreback
Copy link
Contributor

jreback commented Dec 7, 2016

let's do this after #14576 as that's changes the paths for the io code

@codecov-io
Copy link

codecov-io commented Dec 7, 2016

Current coverage is 84.64% (diff: 21.05%)

Merging #13137 into master will increase coverage by 0.07%

@@             master     #13137   diff @@
==========================================
  Files           144        144          
  Lines         51057      51016    -41   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43180      43184     +4   
+ Misses         7877       7832    -45   
  Partials          0          0          

Powered by Codecov. Last update f1cfe5b...92ac063

@TomAugspurger
Copy link
Contributor Author

I think this is ready to go, if anyone has further comments.

@@ -93,17 +93,15 @@ Backwards incompatible API changes

.. _whatsnew_0200.api:


- pandas now uses `s3fs <http://s3fs.readthedocs.io/>`_ for handling S3 connections. This shouldn't break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make this a note / warning box?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning looked a bit strange, so I made it a separate section (same level as "Other API Changes").

screen shot 2016-12-16 at 7 54 30 am

@jreback
Copy link
Contributor

jreback commented Dec 16, 2016

lgtm.

@martindurant ?

@@ -97,13 +97,16 @@ Backwards incompatible API changes
- ``CParserError`` has been renamed to ``ParserError`` in ``pd.read_csv`` and will be removed in the future (:issue:`12665`)
- ``SparseArray.cumsum()`` and ``SparseSeries.cumsum()`` will now always return ``SparseArray`` and ``SparseSeries`` respectively (:issue:`12855`)

S3 File Handling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a ref tag



pandas now uses `s3fs <http://s3fs.readthedocs.io/>`_ for handling S3 connections. This shouldn't break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a doc section itself somewhere? so that we can put a url in various doc-strings, e.g. read_csv? (can always do as a followup)

@martindurant
Copy link
Contributor

No comments from me on the pandas code.
I do expect potential problems in the future as boto3/botocore changes internal things, but that would have been true for any wrapping. Will need to keep an eye on that.

@@ -262,7 +262,7 @@ Optional Dependencies
* `XlsxWriter <https://pypi.python.org/pypi/XlsxWriter>`__: Alternative Excel writer

* `Jinja2 <http://jinja.pocoo.org/>`__: Template engine for conditional HTML formatting.
* `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3 access.
* `s3fs <http://s3fs.readthedocs.io/>`__: necessary for Amazon S3 access.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a min requirement for s3fs? (0.7.0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, may as well require the latest? (0.0.7)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I anticipate making a 0.0.8 soon, as there were a couple of additions since August, including boto3 version stuff. Best make it the latest, then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think older versions work though, so maybe this is not an issue.

@dhimmel
Copy link
Contributor

dhimmel commented Dec 16, 2016

It probably makes sense to merge #14880 before this, which is part 2 of the compression refactoring. #14880 changes a few of the S3 tests.

@jreback
Copy link
Contributor

jreback commented Dec 17, 2016

@TomAugspurger give this a rebase and test.

@TomAugspurger
Copy link
Contributor Author

Travis is green.

@jreback jreback closed this in dc4b070 Dec 19, 2016
@jreback
Copy link
Contributor

jreback commented Dec 19, 2016

thanks @TomAugspurger

I changed a name in the excel tests.

ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
closes pandas-dev#11915

Author: Tom Augspurger <tom.augspurger88@gmail.com>

Closes pandas-dev#13137 from TomAugspurger/s3fs and squashes the following commits:

92ac063 [Tom Augspurger] CI: Update deps, docs
81690b5 [Tom Augspurger] COMPAT/REF: Use s3fs for s3 IO
@TomAugspurger TomAugspurger deleted the s3fs branch April 5, 2017 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

io/common.py: boto3 with python 3.5
8 participants