Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing CSV to GCS #22704

Merged
merged 2 commits into from
Oct 12, 2018
Merged

Support writing CSV to GCS #22704

merged 2 commits into from
Oct 12, 2018

Conversation

bnaul
Copy link
Contributor

@bnaul bnaul commented Sep 14, 2018

This seems to work as-is and doesn't break any of the IO tests; as I mentioned in #8508 (comment) getting S3 to work is a little more complicated but maybe still not bad. But this would be a step in the right direction regardless.

cc @TomAugspurger

@pep8speaks
Copy link

Hello @bnaul! Thanks for submitting the PR.

@bnaul bnaul changed the title [WIP] Support writing CSV to GCS Support writing CSV to GCS Sep 14, 2018
@gfyoung gfyoung added Enhancement IO Data IO issues that don't fit into a more specific label labels Sep 14, 2018
def test_to_csv_gcs(mock):
df1 = DataFrame({'int': [1, 3], 'float': [2.0, np.nan], 'str': ['t', 's'],
'dt': date_range('2018-06-18', periods=2)})
with mock.patch('gcsfs.GCSFileSystem') as MockFileSystem:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be missing the point but if you are patching this what is actually getting tested for gcs?

Copy link
Contributor Author

@bnaul bnaul Sep 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's kind of the same problem that we discussed in #20729. This does at least test the logic that I touched here; I think ultimately what the mocks assume is that gcsfs.GCSFileSystem can read/write strings and everything else is using the real pandas methods.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification on the mock

instance = MockFileSystem.return_value
instance.open.return_value = s

df1.to_csv('gs://test/test.csv', index=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you are explicitly stating index=True here instead of using the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.to_csv(f) and pd.read_csv(f) handle the index differently so I wanted to be extra clear that the index is also being checked in the round tripping

instance.open.return_value = s

df1.to_csv('gs://test/test.csv', index=True)
df2 = read_csv(StringIO(s.getvalue()), parse_dates=['dt'], index_col=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to above comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

@WillAyd
Copy link
Member

WillAyd commented Sep 19, 2018

Could also use a related issue and whatsnew note for v0.24

@TomAugspurger
Copy link
Contributor

When reading S3, we have to wrap it in a TextIOWrapper. Do we need to do the same for writing?

@codecov
Copy link

codecov bot commented Oct 11, 2018

Codecov Report

Merging #22704 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22704   +/-   ##
=======================================
  Coverage    92.2%    92.2%           
=======================================
  Files         169      169           
  Lines       50924    50924           
=======================================
  Hits        46952    46952           
  Misses       3972     3972
Flag Coverage Δ
#multiple 90.62% <100%> (ø) ⬆️
#single 42.3% <100%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/csvs.py 98.21% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8ce3d0...b2f97cc. Read the comment docs.

@bnaul
Copy link
Contributor Author

bnaul commented Oct 11, 2018

Hi @WillAyd @TomAugspurger,

  • Just added a whatsnew/related issue
  • The TextIOWrapper thing would indeed be needed for s3fs for but gcsfs it isn't; I'd like to keep this small so for now I'm going to take care of the easier case first

If anyone else wants to take a look at this I would be grateful since touching anything related to to_csv makes me a bit nervous 😬

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to merge if you're comfortable @WillAyd.

@WillAyd WillAyd merged commit 241bde1 into pandas-dev:master Oct 12, 2018
@WillAyd
Copy link
Member

WillAyd commented Oct 12, 2018

Thanks @bnaul !

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
@bnaul bnaul deleted the gcsfs branch January 17, 2019 18:21
@5amfung
Copy link

5amfung commented Aug 16, 2019

@bnaul Looks like there's no documentation on how to use this feature. Hate to see this being implemented but no one is aware of it at all.

@TomAugspurger
Copy link
Contributor

@5amfung can you submit a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow writing to GCS paths
6 participants