Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow writing to S3 paths #8508

Closed
hunterowens opened this issue Oct 8, 2014 · 19 comments · Fixed by #29920
Closed

Allow writing to S3 paths #8508

hunterowens opened this issue Oct 8, 2014 · 19 comments · Fixed by #29920
Assignees
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label IO Network Local or Cloud (AWS, GCS, etc.) IO Issues

Comments

@hunterowens
Copy link
Contributor

It would be really great if to_(filetype) supported writing to S3.

Here is an example upload to s3 function that takes in a local file and places it on a s3 bucket.

def upload_to_s3(local_file_path, file_name, bucket_name, s3_directory):
    """
    Returns
    ----------
    Uploads local file to appropriate s3 key, and prints status 

    Parameters
    ----------
    local_file_path : str
        ex. 'my/local/path'
    file_name : str
        ex. 'cleaned_data.csv'
    bucket_name : str
        ex. 'dsapp-edu-data'
    s3_directory : str
        ex. 'NC-Cabarrus/cleaned_data'
    """

    def percent_cb(complete, total):
        """
        Helper function that prints progress
        """
        sys.stdout.write('.')
        sys.stdout.flush()

    conn = boto.connect_s3()
    bucket = conn.get_bucket(bucket_name)
    full_key_name = os.path.join(s3_directory, file_name)
    k = bucket.new_key(full_key_name)
    full_filepath = os.path.join(local_file_path, file_name)
    k.set_contents_from_filename(full_filepath, cb=percent_cb, num_cb=10)

    return None 
@TomAugspurger TomAugspurger added the IO Data IO issues that don't fit into a more specific label label Nov 11, 2014
@TomAugspurger TomAugspurger added this to the 0.16.0 milestone Nov 11, 2014
@TomAugspurger
Copy link
Contributor

I'll try to get to this by 0.16

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@andrewgiessel
Copy link

any PRs open on this? I thought of this too and think it'd be great. I can take a shot at it if not.

@TomAugspurger
Copy link
Contributor

Go for it!

@TomAugspurger
Copy link
Contributor

@swarajban
Copy link

I also wanted the ability to write a DF as a CSV to S3 (acronym overload...), and wrote the following snippet:

# Write dataframe to buffer
csv_buffer = StringIO()
df.to_csv(csv_buffer, index=False)

# Upload CSV to S3
s3_key = 'test.csv'
s3_resource = aws_session.resource('s3')
s3_resource.Object(s3_bucket, s3_key).put(Body=csv_buffer.getvalue())

Hope this helps anyone trying to do the same thing. IMO, this features shouldn't be added to pandas. I think this snippet (or a better version) should simply be documented

Edit: This is using Python 3.5 & boto3. I'm sure a similar snippet will work for 2.7 or the old boto

@maximveksler
Copy link
Contributor

Any hope of supporting writing to S3 for the new release? Now that parquet is supported this becomes double interesting.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2017

@maximveksler this is not very hard to do as we already have a dep for S3 interactions with https://pypi.python.org/pypi/s3fs, and pyarrow also supports s3. want to do a pull request?

@maximveksler
Copy link
Contributor

@jreback sure, please point me into some relevant locations and I'll PR gladly.

Would appreciate focus about what from s3fs and pyarrow I should be looking into for more details.

@jreback
Copy link
Contributor

jreback commented Nov 27, 2017

http://pandas.pydata.org/pandas-docs/stable/contributing.html#

here are tests for reading

pandas/tests/io/parser/test_network.py:    def test_parse_public_s3n_bucket(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3a_bucket(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_nrows(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_chunked(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_chunked_python(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_python(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_infer_s3_compression(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_parse_public_s3_bucket_nrows_python(self, s3_resource):
pandas/tests/io/parser/test_network.py:    def test_s3_fails(self, s3_resource):
pandas/tests/io/parser/test_network.py:                                             s3_resource,
pandas/tests/io/parser/test_network.py:        s3_object = s3_resource.meta.client.get_object(

writing routines should be in pandas/io/s3.py

@CrossNox
Copy link

Hi! Was the PR created?

@maximveksler
Copy link
Contributor

This is related #19135

@CrossNox
Copy link

@maximveksler what about other writers like .to_csv() ?

@maximveksler
Copy link
Contributor

@CrossNox reading at the implementation I think it "should work". Could you please test and update?

@CrossNox
Copy link

Python v2.7.12
Pandas v0.22.0
Example snippet:

import pandas as pd
import s3fs

fs = s3fs.S3FileSystem(anon=False)
df = pd.read_csv('s3://***/xxx.csv') #on a jupyter notebook .head(5) displays the df nicely
df.to_csv("s3://***/xxx2.csv")

Raises the following error:

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-4-51fa14b98bc4> in <module>()
----> 1 df.to_csv("s3://***/xxx2.csv")

/usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1522                                      doublequote=doublequote,
   1523                                      escapechar=escapechar, decimal=decimal)
-> 1524         formatter.save()
   1525 
   1526         if path_or_buf is None:

/usr/local/lib/python2.7/dist-packages/pandas/io/formats/format.pyc in save(self)
   1635             f, handles = _get_handle(self.path_or_buf, self.mode,
   1636                                      encoding=encoding,
-> 1637                                      compression=self.compression)
   1638             close = True
   1639 

/usr/local/lib/python2.7/dist-packages/pandas/io/common.pyc in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
    385         if compat.PY2:
    386             # Python 2
--> 387             f = open(path_or_buf, mode)
    388         elif encoding:
    389             # Python 3 and encoding

IOError: [Errno 2] No such file or directory: 's3://***/xxx2.csv'

So, right now i'm saving the file as:

with fs.open('s3://***/xxx2.csv','wb') as f:
    df.to_csv(f)

@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@bnaul
Copy link
Contributor

bnaul commented Sep 13, 2018

@TomAugspurger I would like to sort this out as well as writing to GCS as a follow-up to #20729. Is there a reason that reading/writing generally seem to use two different methods for accessing file-like objects (get_filepath_or_buffer vs _get_handle)? If that logic was unified then there'd be no extra work needed.

For S3 specifically there is another issue (sort of captured by #9712) which is that writing CSVs in mode='wb' currently doesn't work, and s3fs only supports binary writes. For GCS just adding

path_or_buf, *_ = get_filepath_or_buffer(path_or_buf, encoding=encoding,
                                         compression=compression, mode=mode)

to CSVFormatter or to_csv seems like it would be enough.

@bnaul bnaul mentioned this issue Sep 14, 2018
4 tasks
@TomAugspurger
Copy link
Contributor

Great! I'm not sure about get_file_path_or_buffer vs. get_handle. I'm not that familiar with the parser code.

@prakhar19
Copy link

prakhar19 commented Jul 30, 2019

The code df.to_csv("s3://***/xxx2.csv") still does not work. Please, can this be fixed?

@oguzhanogreden
Copy link
Contributor

take

@oguzhanogreden
Copy link
Contributor

oguzhanogreden commented Nov 27, 2019

Using version 0.25.1, I can do the following:

df = pd.DataFrame({"a": range(5)})

df.to_csv("s3://test-key/test.csv")

So only the documentation is missing.

@jbrockmendel jbrockmendel added the IO Network Local or Cloud (AWS, GCS, etc.) IO Issues label Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label IO Network Local or Cloud (AWS, GCS, etc.) IO Issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.