Skip to content

Read and write compressed CSVs to S3 #359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JasonSanchez opened this issue Aug 24, 2020 · 10 comments
Closed

Read and write compressed CSVs to S3 #359

JasonSanchez opened this issue Aug 24, 2020 · 10 comments
Assignees
Labels
enhancement New feature or request feature minor release Will be addressed in the next minor release ready to release
Milestone

Comments

@JasonSanchez
Copy link

pandas-dev/pandas#35129 was recently merged into pandas-dev:master.

fixes #22555: some compression algorithms did not set the mode to be writeable for file handles. Together with the re-factoring in the second commit, it is now possible to write to binary file handles with compression!

The comment below is unclear and makes it seem as if the feature were removed from pandas instead of that the feature will soon be available in pandas (and hopefully therefore in wrangler):

By now Pandas does not support in-memory CSV compression. pandas-dev/pandas#22555 So the compression will not be supported on Wrangler too.

Anyways, hoping this would be on your roadmap. Thanks for the great tool!

@JasonSanchez JasonSanchez added the enhancement New feature or request label Aug 24, 2020
@igorborgest
Copy link
Contributor

@JasonSanchez good to know about this PR, for sure it is something we have interest. We will add support for it as soon Pandas 1.2.0 became available.

P.S. If we have a new Wrangler's release before it, we will update/fix the comment mentioned.

Thanks!

@igorborgest igorborgest added the blocked Something is blocking the development label Aug 26, 2020
@gvermillion
Copy link

As pandas 1.2.0 has been released, has there been any updates to when awswrangler will be able to write compressed CSVs directly to S3?

@igorborgest
Copy link
Contributor

Hi @gvermillion @JasonSanchez

We've added support to it in the PR above 👆 .
Could you give it a try before the official release? You can install that directly from the dev branch:

pip install git+https://github.com/awslabs/aws-data-wrangler.git@write-compressed-text

@igorborgest igorborgest self-assigned this Jan 4, 2021
@igorborgest igorborgest added feature minor release Will be addressed in the next minor release ready to release and removed blocked Something is blocking the development labels Jan 4, 2021
@igorborgest igorborgest added this to the 2.3.0 milestone Jan 4, 2021
@gvermillion
Copy link

Hey, @igorborgest. Thanks for the quick response!

I'll give this a go later this afternoon when I get pandas updated.

@gvermillion
Copy link

gvermillion commented Jan 4, 2021

@igorborgest ,

It appears to working if I just read/write a gzip-compressed CSV directly to S3. However, I was following this notebook you authored to attempt to do a CSV partition (literally replaced [to,from]_parquet with [to,from]_csv). When I attempt to read in the partition with a filter, I lose header information and the formatting is not as it was when I call .to_csv(). This is the case both with and without compression=gzip.

I'm not sure if this later bit is expected behavior or not. I have not used CSV partitions before.

EDIT: For completeness, when I attempt to read in the partition with the filter lambda x: x['value'].endswith('oo') I get the following results:

Empty DataFrame
Columns: [1, 2, value, 0]
Index: []

@igorborgest
Copy link
Contributor

Hi @gvermillion , thanks for testing it.

Actually it is not related with the compression itself, it was a limitation in the CSV Datasets implementation that didn't support CSV headers. But I've just updated the branch adding support for that.
I also update the two tutorials that I recommend you to take a look:

Could you give it another try?

p.s. Uninstall the previous installation explicitly (pip uninstall awswrangler) before install again. We are keeping the same version number, so pip will not update it automatically.

@gvermillion
Copy link

Hello @igorborgest ,

The updates worked as expected for me. Is there any estimate for when this update will be released?

Thanks for the quick turnaround and addressing my follow-up question!

@igorborgest
Copy link
Contributor

It should be released on version 1.3.0 next week.

@gvermillion
Copy link

Great. Thanks so much!

@igorborgest
Copy link
Contributor

Released on version 2.3.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature minor release Will be addressed in the next minor release ready to release
Projects
None yet
Development

No branches or pull requests

3 participants