Allow writing to S3 paths #8508

hunterowens · 2014-10-08T15:48:29Z

It would be really great if to_(filetype) supported writing to S3.

Here is an example upload to s3 function that takes in a local file and places it on a s3 bucket.

def upload_to_s3(local_file_path, file_name, bucket_name, s3_directory):
    """
    Returns
    ----------
    Uploads local file to appropriate s3 key, and prints status 

    Parameters
    ----------
    local_file_path : str
        ex. 'my/local/path'
    file_name : str
        ex. 'cleaned_data.csv'
    bucket_name : str
        ex. 'dsapp-edu-data'
    s3_directory : str
        ex. 'NC-Cabarrus/cleaned_data'
    """

    def percent_cb(complete, total):
        """
        Helper function that prints progress
        """
        sys.stdout.write('.')
        sys.stdout.flush()

    conn = boto.connect_s3()
    bucket = conn.get_bucket(bucket_name)
    full_key_name = os.path.join(s3_directory, file_name)
    k = bucket.new_key(full_key_name)
    full_filepath = os.path.join(local_file_path, file_name)
    k.set_contents_from_filename(full_filepath, cb=percent_cb, num_cb=10)

    return None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2014-11-11T16:33:17Z

I'll try to get to this by 0.16

andrewgiessel · 2015-05-27T14:07:27Z

any PRs open on this? I thought of this too and think it'd be great. I can take a shot at it if not.

TomAugspurger · 2015-05-27T14:55:50Z

Go for it!

TomAugspurger · 2015-05-27T15:08:58Z

https://github.com/wrobstory/pgshift/blob/master/pgshift/pgshift.py#L99 could be a helpful template

swarajban · 2016-02-11T06:38:47Z

I also wanted the ability to write a DF as a CSV to S3 (acronym overload...), and wrote the following snippet:

# Write dataframe to buffer
csv_buffer = StringIO()
df.to_csv(csv_buffer, index=False)

# Upload CSV to S3
s3_key = 'test.csv'
s3_resource = aws_session.resource('s3')
s3_resource.Object(s3_bucket, s3_key).put(Body=csv_buffer.getvalue())

Hope this helps anyone trying to do the same thing. IMO, this features shouldn't be added to pandas. I think this snippet (or a better version) should simply be documented

Edit: This is using Python 3.5 & boto3. I'm sure a similar snippet will work for 2.7 or the old boto

maximveksler · 2017-11-26T20:43:48Z

Any hope of supporting writing to S3 for the new release? Now that parquet is supported this becomes double interesting.

jreback · 2017-11-26T21:49:40Z

@maximveksler this is not very hard to do as we already have a dep for S3 interactions with https://pypi.python.org/pypi/s3fs, and pyarrow also supports s3. want to do a pull request?

maximveksler · 2017-11-27T09:48:06Z

@jreback sure, please point me into some relevant locations and I'll PR gladly.

Would appreciate focus about what from s3fs and pyarrow I should be looking into for more details.

jreback · 2017-11-27T12:10:44Z

http://pandas.pydata.org/pandas-docs/stable/contributing.html#

here are tests for reading

pandas/tests/io/parser/test_network.py:    def test_parse_public_s3n_bucket(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3a_bucket(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_nrows(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_chunked(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_chunked_python(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_python(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_infer_s3_compression(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_nrows_python(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_s3_fails(self, s3_resource):
pandas/tests/io/parser/test_network.py:                                             s3_resource,
pandas/tests/io/parser/test_network.py:        s3_object = s3_resource.meta.client.get_object(

writing routines should be in pandas/io/s3.py

CrossNox · 2018-04-10T14:20:59Z

Hi! Was the PR created?

maximveksler · 2018-04-10T15:03:50Z

This is related #19135

CrossNox · 2018-04-10T15:16:30Z

@maximveksler what about other writers like .to_csv() ?

maximveksler · 2018-04-11T06:29:15Z

@CrossNox reading at the implementation I think it "should work". Could you please test and update?

CrossNox · 2018-04-11T13:47:30Z

Python v2.7.12
Pandas v0.22.0
Example snippet:

import pandas as pd
import s3fs

fs = s3fs.S3FileSystem(anon=False)
df = pd.read_csv('s3://***/xxx.csv') #on a jupyter notebook .head(5) displays the df nicely
df.to_csv("s3://***/xxx2.csv")

Raises the following error:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-4-51fa14b98bc4> in <module>()
----> 1 df.to_csv("s3://***/xxx2.csv")

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1522                                      doublequote=doublequote,
   1523                                      escapechar=escapechar, decimal=decimal)
-> 1524         formatter.save()
   1525 
   1526         if path_or_buf is None:

/usr/local/lib/python2.7/dist-packages/pandas/io/formats/format.pyc in save(self)
   1635             f, handles = _get_handle(self.path_or_buf, self.mode,
   1636                                      encoding=encoding,
-> 1637                                      compression=self.compression)
   1638             close = True
   1639 

/usr/local/lib/python2.7/dist-packages/pandas/io/common.pyc in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
    385         if compat.PY2:
    386             # Python 2
--> 387             f = open(path_or_buf, mode)
    388         elif encoding:
    389             # Python 3 and encoding

IOError: [Errno 2] No such file or directory: 's3://***/xxx2.csv'

So, right now i'm saving the file as:

with fs.open('s3://***/xxx2.csv','wb') as f:
    df.to_csv(f)

bnaul · 2018-09-13T23:34:08Z

@TomAugspurger I would like to sort this out as well as writing to GCS as a follow-up to #20729. Is there a reason that reading/writing generally seem to use two different methods for accessing file-like objects (get_filepath_or_buffer vs _get_handle)? If that logic was unified then there'd be no extra work needed.

For S3 specifically there is another issue (sort of captured by #9712) which is that writing CSVs in mode='wb' currently doesn't work, and s3fs only supports binary writes. For GCS just adding

path_or_buf, *_ = get_filepath_or_buffer(path_or_buf, encoding=encoding,
                                         compression=compression, mode=mode)

to CSVFormatter or to_csv seems like it would be enough.

TomAugspurger · 2018-09-14T01:41:34Z

Great! I'm not sure about get_file_path_or_buffer vs. get_handle. I'm not that familiar with the parser code.

prakhar19 · 2019-07-30T07:58:14Z

The code df.to_csv("s3://***/xxx2.csv") still does not work. Please, can this be fixed?

oguzhanogreden · 2019-11-26T15:36:56Z

take

oguzhanogreden · 2019-11-27T20:58:25Z

Using version 0.25.1, I can do the following:

df = pd.DataFrame({"a": range(5)})

df.to_csv("s3://test-key/test.csv")

So only the documentation is missing.

TomAugspurger added the IO Data IO issues that don't fit into a more specific label label Nov 11, 2014

TomAugspurger added this to the 0.16.0 milestone Nov 11, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Mar 23, 2016

io/common.py: boto3 with python 3.5 #11915

Closed

TomAugspurger added IO CSV read_csv, to_csv Difficulty Intermediate labels Jul 6, 2018

datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018

bnaul mentioned this issue Sep 14, 2018

Support writing CSV to GCS #22704

Merged

4 tasks

bnaul mentioned this issue Oct 11, 2018

Allow writing to GCS paths #23094

Closed

jbrockmendel removed Effort Medium labels Oct 21, 2019

github-actions bot assigned oguzhanogreden Nov 26, 2019

oguzhanogreden mentioned this issue Nov 28, 2019

Document S3 and GCS path functionality of DataFrame.to_csv() #29920

Merged

1 task

jbrockmendel added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Dec 11, 2019

WillAyd closed this as completed in #29920 Dec 17, 2019

Uh oh!

Allow writing to S3 paths #8508

Allow writing to S3 paths #8508

Comments

hunterowens commented Oct 8, 2014

TomAugspurger commented Nov 11, 2014

Uh oh!

andrewgiessel commented May 27, 2015

Uh oh!

TomAugspurger commented May 27, 2015

Uh oh!

TomAugspurger commented May 27, 2015

Uh oh!

swarajban commented Feb 11, 2016

Uh oh!

maximveksler commented Nov 26, 2017

Uh oh!

jreback commented Nov 26, 2017

Uh oh!

maximveksler commented Nov 27, 2017

Uh oh!

jreback commented Nov 27, 2017

Uh oh!

CrossNox commented Apr 10, 2018

Uh oh!

maximveksler commented Apr 10, 2018

Uh oh!

CrossNox commented Apr 10, 2018

Uh oh!

maximveksler commented Apr 11, 2018

Uh oh!

CrossNox commented Apr 11, 2018

Uh oh!

bnaul commented Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 14, 2018

Uh oh!

prakhar19 commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oguzhanogreden commented Nov 26, 2019

Uh oh!

oguzhanogreden commented Nov 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnaul commented Sep 13, 2018 •

edited

Loading

prakhar19 commented Jul 30, 2019 •

edited

Loading

oguzhanogreden commented Nov 27, 2019 •

edited

Loading