COMPAT/REF: Use s3fs for s3 IO #13137

TomAugspurger · 2016-05-11T12:30:13Z

closes io/common.py: boto3 with python 3.5 #11915
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback · 2016-05-11T12:35:07Z

pandas/io/common.py

-        return filepath_or_buffer, None, compression
+            return filepath_or_buffer, encoding, compression
+        except FileNotFoundError:
+            raise boto.exception.S3ResponseError


is this an appropriate error? why is FileNotFoundError ok here?

actually is this try/except needed? (as its not around fs.open anyhow)

Not sure why I added that... Doesn't seem to be necessary.

jreback · 2016-05-11T12:36:40Z

need to add s3fs to .pip (where we have boto now)

TomAugspurger · 2016-05-11T12:37:04Z

This is just refactoring the reading for now. We should be able to support writing to s3 buckets pretty easily now too. A few things

pandas accepts s3://, s3n://, and s3a:// protocols, but s3fs only expects the regular s3://. I work around that here. This was the best source I found on the differences, but I don't know if simply chaning the protocol will break anything. At the end of the day we're just pulling down bytes, which should presumably represent a CSV
API Changes: Before we raised OSError when the s3 file wasn't found, and reraised botos PermissionError when you didn't have permission to read the file. I think s3fs exposes both of these as IOErrors here. We might be able to work around this

cc @jreback @martindurant

martindurant · 2016-05-17T14:25:00Z

s3fs now includes original boto3 exceptions when something goes wrong fsspec/s3fs#42

For your question above, I believe s3: and s3n: work exactly the same from our point of view; but we don't touch the block filesystem emulation (this is ancient, I don't know if it's used in the wild).

The failure above appears to be a typo:

File "/home/travis/build/pydata/pandas/pandas/io/parsers.py", line 36, in <module>
    need_text_wrapping = (compat.BtyesIO,)
AttributeError: module 'pandas.compat' has no attribute 'BtyesIO'

jseabold · 2016-05-18T02:48:26Z

I haven't yet tried this PR, but I've had some issues with s3fs lately [1], and I'd like to try this out before it gets merged. Assuming liberal permissions on s3 buckets and objects and dealing with profiles has been a bit of an issue across third-party tools for me lately. It could be an issue with permissions, but I suspect it's going to continue to come up.

E.g., I can do this

import boto3
session = boto3.Session(profile_name='myprofile')
s3 = session.client('s3')
key = s3.get_object(Bucket=BUCKET, Key=KEY)

but not

import s3fs
fs = s3fs.S3FileSystem(profile_name='myprofile')
f = fs.open('s3://' + BUCKET + '/' + KEY)

and I don't see a way to set the profile explicitly in S3File and it doesn't seem to matter if I set my AWS_PROFILE correctly, but I haven't gone deep here.

[1] fsspec/s3fs#38

TomAugspurger · 2016-05-18T03:08:50Z

@jseabold yeah, I've been hitting the same issue. For your specific case, maybe try

import s3fs
fs = s3fs.S3FileSystem(profile_name='myprofile', anon=False)
f = fs.open('s3://' + BUCKET + '/' + KEY)

If I use anon=False though in pandas though, I'm unable to read from publicly available files like s3://pandas-test/tips.csv'. Withanon=None` (the default) I can't read from private buckets.

I spent last night trying to figure out a way to handle both, no luck yet.

jseabold · 2016-05-18T03:19:35Z

Ah, sure. Isn't there a try/except for that in the boto code that's removed by this PR?

I'm just really worried that this is going to break everything for me. After having to downgrade dask, use a patched conda, use a patched odo, and run boto master on Python 3 until the last release, I'm super wary about these things.

jreback · 2016-05-18T03:21:21Z

this won't be till 0.19 I think we have it marked in any event

TomAugspurger · 2016-05-18T03:32:41Z

Isn't there a try/except for that in the boto code that's removed by this PR?

Yep, I was hoping to avoid that in pandas, but it might be necessary.

@jseabold I have a bunch of stuff on AWS too and I share your concern about this breaking things.
I'll be writing many more tests. This is still a ways from being done, but I'll keep you updated.

martindurant · 2016-05-18T03:43:24Z

I'm not sure if we're dealing with one issue here or more:

the profile_name keyword to S3FileSystem is designed to be passed along to boto3.Session, just like above
connections are attempted with anon=False first, and then fallback to True if not specified;
publicly available buckets should always be available, regardless of anon.

I don't have boto profiles, just a [default] section in my .aws/credentials (no relevant env variables). How should I go about recreating your situation? In any case, I suggest the discussion should happen over on s3fs.

jseabold · 2016-05-18T03:53:54Z

@martindurant I've started today trying to collect the info for my typical permissions, so you can replicate this stuff.

jseabold · 2016-05-18T05:16:08Z

Is the original motivating bug in boto fixed with the newest release? I'm not sure if it's the same config parsing one I alluded to with conda that I fixed with a monkey patch.

What's the motivation for this over using boto3? I need to read the linked thread more carefully. I'd tentatively suggest not to add reliance on another third party package and introduce some more complexity here. The read_csv code is already pretty hard to follow with the conflation of source location with file format and all the options without having to dig through another package in addition to the Amazon ones.

TomAugspurger · 2016-05-18T12:13:08Z

Is the original motivating bug in boto fixed with the newest release? I'm not sure if it's the same config parsing one I alluded to with conda that I fixed with a monkey patch.

Yep, not wanting to manage the boto to boto3 switch within pandas was what initially prompted this.
It's a tradeoff between the complexity of (re)implementing it within pandas vs. the complexity of accepting a dependency. I'll try to get something more polished done tonight.

jseabold · 2016-05-18T13:45:56Z

It's a tradeoff between the complexity of (re)implementing it within pandas vs. the complexity of accepting a dependency.

That's true, but you have control over what is getting implemented, so you don't have extra calls around things that a filesystem mimic might want, but a single stream a key to memory thing might not. To be fair, it looks like the S3File objects avoid some of the problems that I've been running in to with permissions.

This is probably out of scope for this PR, but I've often wanted to be able to pass a boto(3) Key object to read_csv rather than having to check how pandas is doing the session creation. If S3File is sufficiently file-like I'd love to skip the discovery and make me an s3 client steps, so I don't have to remind myself.

jreback · 2016-05-18T14:00:49Z

@jseabold this is to REDUCE complexity in the code. boto3 is somewhat raw. This makes us just treat it directly as a filesystem and is pretty generic. This is how all file-ilke things should be done, abstract them away from pandas directly. (e.g.this is done implicity with compression / unicode conversions, though these should be abstracted a bit more, but they are usually just a line or 2).

This makes it so pandas can simply use these w/o having to worry about messy details like authentication and such which the service can simply care about (and upgrade / fix as needed).

martindurant · 2016-05-18T14:04:27Z

You might be interested that https://github.com/dask/hdfs3 provides the very same interface to HDFS - perhaps a less used data source for pandas, but an option that might be useful in the future. We may also make the same interface for other cloud storage services, depending on demand/complexity.

jseabold · 2016-05-18T14:51:19Z

I agree with the sentiment and am all for reducing the complexity (in pandas) around csv reading, file-like objects, location discovery, and authentication around those locations.

jseabold · 2016-05-21T13:23:46Z

@TomAugspurger This may be what's going on with anon for you.

jreback · 2016-07-15T00:10:52Z

update for 0.19.0?

sinhrks · 2016-07-15T00:23:46Z

pandas/io/tests/test_pickle.py

@@ -15,7 +15,7 @@
 from pandas.tseries.offsets import Day, MonthEnd


-class TestPickle():
+class TestPickle(tm.TestCase):


Does it run test generators? see docstring:

NOTE: TestPickle can't be a subclass of tm.Testcase to use test generator. http://stackoverflow.com/questions/6689537/ nose-test-generators-inside-class

I think this was a stray commit from a different branch, trying to get pytest running. Will remove.

On Jul 14, 2016, at 19:23, Sinhrks notifications@github.com wrote:

In pandas/io/tests/test_pickle.py:

@@ -15,7 +15,7 @@
from pandas.tseries.offsets import Day, MonthEnd

-class TestPickle():
+class TestPickle(tm.TestCase):
Does it run test generators? see docstring:

NOTE: TestPickle can't be a subclass of tm.Testcase to use test generator. http://stackoverflow.com/questions/6689537/ nose-test-generators-inside-class

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

TomAugspurger · 2016-07-15T02:27:52Z

I'll revisit this mid next week.

On Jul 14, 2016, at 19:10, Jeff Reback notifications@github.com wrote:

update for 0.19.0?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

jreback · 2016-07-24T15:47:46Z

@TomAugspurger ?

jorisvandenbossche · 2016-08-18T21:37:28Z

@TomAugspurger What's the status of this?

jreback · 2016-12-07T11:58:37Z

pandas/io/parsers.py

@@ -40,6 +40,12 @@
 import pandas.lib as lib
 import pandas.parser as _parser

+try:


move to io/common.py

jreback · 2016-12-07T12:02:15Z

let's do this after #14576 as that's changes the paths for the io code

codecov-io · 2016-12-07T14:04:49Z

Current coverage is 84.64% (diff: 21.05%)

Merging #13137 into master will increase coverage by 0.07%

@@             master     #13137   diff @@
==========================================
  Files           144        144          
  Lines         51057      51016    -41   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43180      43184     +4   
+ Misses         7877       7832    -45   
  Partials          0          0

Powered by Codecov. Last update f1cfe5b...92ac063

TomAugspurger · 2016-12-16T13:03:35Z

I think this is ready to go, if anyone has further comments.

jreback · 2016-12-16T13:37:08Z

doc/source/whatsnew/v0.20.0.txt

@@ -93,17 +93,15 @@ Backwards incompatible API changes

 .. _whatsnew_0200.api:

-
+- pandas now uses `s3fs <http://s3fs.readthedocs.io/>`_ for handling S3 connections. This shouldn't break


maybe make this a note / warning box?

Warning looked a bit strange, so I made it a separate section (same level as "Other API Changes").

jreback · 2016-12-16T13:38:18Z

lgtm.

@martindurant ?

jreback · 2016-12-16T14:29:40Z

doc/source/whatsnew/v0.20.0.txt

@@ -97,13 +97,16 @@ Backwards incompatible API changes
 - ``CParserError`` has been renamed to ``ParserError`` in ``pd.read_csv`` and will be removed in the future (:issue:`12665`)
 - ``SparseArray.cumsum()`` and ``SparseSeries.cumsum()`` will now always return ``SparseArray`` and ``SparseSeries`` respectively (:issue:`12855`)

+S3 File Handling


maybe add a ref tag

jreback · 2016-12-16T14:30:21Z

doc/source/whatsnew/v0.20.0.txt


-
+pandas now uses `s3fs <http://s3fs.readthedocs.io/>`_ for handling S3 connections. This shouldn't break


do we need a doc section itself somewhere? so that we can put a url in various doc-strings, e.g. read_csv? (can always do as a followup)

martindurant · 2016-12-16T14:40:42Z

No comments from me on the pandas code.
I do expect potential problems in the future as boto3/botocore changes internal things, but that would have been true for any wrapping. Will need to keep an eye on that.

jreback · 2016-12-16T14:59:51Z

doc/source/install.rst

@@ -262,7 +262,7 @@ Optional Dependencies
  * `XlsxWriter <https://pypi.python.org/pypi/XlsxWriter>`__: Alternative Excel writer

 * `Jinja2 <http://jinja.pocoo.org/>`__: Template engine for conditional HTML formatting.
-* `boto <https://pypi.python.org/pypi/boto>`__: necessary for Amazon S3 access.
+* `s3fs <http://s3fs.readthedocs.io/>`__: necessary for Amazon S3 access.


do we have a min requirement for s3fs? (0.7.0)?

Not yet, may as well require the latest? (0.0.7)

I anticipate making a 0.0.8 soon, as there were a couple of additions since August, including boto3 version stuff. Best make it the latest, then.

I think older versions work though, so maybe this is not an issue.

dhimmel · 2016-12-16T20:57:07Z

It probably makes sense to merge #14880 before this, which is part 2 of the compression refactoring. #14880 changes a few of the S3 tests.

jreback · 2016-12-17T23:10:09Z

@TomAugspurger give this a rebase and test.

TomAugspurger · 2016-12-19T12:30:34Z

Travis is green.

jreback · 2016-12-19T12:56:33Z

thanks @TomAugspurger

I changed a name in the excel tests.

closes pandas-dev#11915 Author: Tom Augspurger <tom.augspurger88@gmail.com> Closes pandas-dev#13137 from TomAugspurger/s3fs and squashes the following commits: 92ac063 [Tom Augspurger] CI: Update deps, docs 81690b5 [Tom Augspurger] COMPAT/REF: Use s3fs for s3 IO

TomAugspurger added the IO Data IO issues that don't fit into a more specific label label May 11, 2016

TomAugspurger added this to the 0.19.0 milestone May 11, 2016

jreback reviewed May 11, 2016
View reviewed changes

jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Jul 8, 2016

sinhrks reviewed Jul 15, 2016
View reviewed changes

jreback mentioned this pull request Aug 8, 2016

Reading SAS files from AWS S3 #13939

Closed

jreback reviewed Dec 7, 2016

View reviewed changes

pandas/io/parsers.py

@@ -40,6 +40,12 @@

import pandas.lib as lib

import pandas.parser as _parser

try:

Copy link

Contributor

jreback Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to io/common.py

TomAugspurger force-pushed the s3fs branch from 63f8239 to 6dcaa0d Compare December 7, 2016 14:03

TomAugspurger force-pushed the s3fs branch from 6dcaa0d to e5f33f4 Compare December 8, 2016 02:45

dhimmel mentioned this pull request Dec 13, 2016

Create tests for reading compressed tables from S3 URLs #14875

Closed

TomAugspurger force-pushed the s3fs branch 2 times, most recently from 332fcd6 to d28bd7c Compare December 15, 2016 15:30

jreback reviewed Dec 16, 2016

View reviewed changes

TomAugspurger force-pushed the s3fs branch from d28bd7c to c844a66 Compare December 16, 2016 13:55

jreback reviewed Dec 16, 2016

View reviewed changes

TomAugspurger force-pushed the s3fs branch from c844a66 to 2da3e7b Compare December 16, 2016 15:38

jreback mentioned this pull request Dec 16, 2016

DOC for refactored compression (GH14576) + BUG: bz2-compressed URL with C engine (GH14874) #14880

Closed

1 task

COMPAT/REF: Use s3fs for s3 IO

81690b5

TomAugspurger force-pushed the s3fs branch 2 times, most recently from e5f6b18 to 5c66be6 Compare December 18, 2016 23:48

CI: Update deps, docs

92ac063

TomAugspurger force-pushed the s3fs branch from 5c66be6 to 92ac063 Compare December 18, 2016 23:49

jreback closed this in dc4b070 Dec 19, 2016

TomAugspurger deleted the s3fs branch April 5, 2017 02:07

		@@ -93,17 +93,15 @@ Backwards incompatible API changes

		.. _whatsnew_0200.api:


		- pandas now uses `s3fs <http://s3fs.readthedocs.io/>`_ for handling S3 connections. This shouldn't break



		pandas now uses `s3fs <http://s3fs.readthedocs.io/>`_ for handling S3 connections. This shouldn't break

COMPAT/REF: Use s3fs for s3 IO #13137

COMPAT/REF: Use s3fs for s3 IO #13137

Conversation

TomAugspurger commented May 11, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 11, 2016

TomAugspurger commented May 11, 2016

martindurant commented May 17, 2016

jseabold commented May 18, 2016

TomAugspurger commented May 18, 2016 • edited Loading

jseabold commented May 18, 2016 • edited Loading

jreback commented May 18, 2016

TomAugspurger commented May 18, 2016

martindurant commented May 18, 2016 • edited Loading

jseabold commented May 18, 2016

jseabold commented May 18, 2016

TomAugspurger commented May 18, 2016

jseabold commented May 18, 2016

jreback commented May 18, 2016

martindurant commented May 18, 2016

jseabold commented May 18, 2016

jseabold commented May 21, 2016

jreback commented Jul 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jul 15, 2016

jreback commented Jul 24, 2016

jorisvandenbossche commented Aug 18, 2016

Choose a reason for hiding this comment

jreback commented Dec 7, 2016

codecov-io commented Dec 7, 2016 • edited Loading

Current coverage is 84.64% (diff: 21.05%)

TomAugspurger commented Dec 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Dec 16, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhimmel commented Dec 16, 2016

jreback commented Dec 17, 2016

TomAugspurger commented Dec 19, 2016

jreback commented Dec 19, 2016

TomAugspurger commented May 11, 2016 •

edited

Loading

TomAugspurger commented May 18, 2016 •

edited

Loading

jseabold commented May 18, 2016 •

edited

Loading

martindurant commented May 18, 2016 •

edited

Loading

codecov-io commented Dec 7, 2016 •

edited

Loading