-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support writing CSV to GCS #22704
Support writing CSV to GCS #22704
Conversation
Hello @bnaul! Thanks for submitting the PR.
|
def test_to_csv_gcs(mock): | ||
df1 = DataFrame({'int': [1, 3], 'float': [2.0, np.nan], 'str': ['t', 's'], | ||
'dt': date_range('2018-06-18', periods=2)}) | ||
with mock.patch('gcsfs.GCSFileSystem') as MockFileSystem: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be missing the point but if you are patching this what is actually getting tested for gcs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's kind of the same problem that we discussed in #20729. This does at least test the logic that I touched here; I think ultimately what the mocks assume is that gcsfs.GCSFileSystem
can read
/write
strings and everything else is using the real pandas
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification on the mock
instance = MockFileSystem.return_value | ||
instance.open.return_value = s | ||
|
||
df1.to_csv('gs://test/test.csv', index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason you are explicitly stating index=True
here instead of using the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.to_csv(f)
and pd.read_csv(f)
handle the index differently so I wanted to be extra clear that the index is also being checked in the round tripping
instance.open.return_value = s | ||
|
||
df1.to_csv('gs://test/test.csv', index=True) | ||
df2 = read_csv(StringIO(s.getvalue()), parse_dates=['dt'], index_col=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
Could also use a related issue and whatsnew note for v0.24 |
When reading S3, we have to wrap it in a TextIOWrapper. Do we need to do the same for writing? |
Codecov Report
@@ Coverage Diff @@
## master #22704 +/- ##
=======================================
Coverage 92.2% 92.2%
=======================================
Files 169 169
Lines 50924 50924
=======================================
Hits 46952 46952
Misses 3972 3972
Continue to review full report at Codecov.
|
If anyone else wants to take a look at this I would be grateful since touching anything related to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to merge if you're comfortable @WillAyd.
Thanks @bnaul ! |
@bnaul Looks like there's no documentation on how to use this feature. Hate to see this being implemented but no one is aware of it at all. |
@5amfung can you submit a PR? |
git diff upstream/master -u -- "*.py" | flake8 --diff
This seems to work as-is and doesn't break any of the IO tests; as I mentioned in #8508 (comment) getting S3 to work is a little more complicated but maybe still not bad. But this would be a step in the right direction regardless.
cc @TomAugspurger