Read and write compressed CSVs to S3 #359

JasonSanchez · 2020-08-24T06:53:19Z

pandas-dev/pandas#35129 was recently merged into pandas-dev:master.

fixes #22555: some compression algorithms did not set the mode to be writeable for file handles. Together with the re-factoring in the second commit, it is now possible to write to binary file handles with compression!

The comment below is unclear and makes it seem as if the feature were removed from pandas instead of that the feature will soon be available in pandas (and hopefully therefore in wrangler):

By now Pandas does not support in-memory CSV compression. pandas-dev/pandas#22555 So the compression will not be supported on Wrangler too.

Anyways, hoping this would be on your roadmap. Thanks for the great tool!

igorborgest · 2020-08-24T13:22:06Z

@JasonSanchez good to know about this PR, for sure it is something we have interest. We will add support for it as soon Pandas 1.2.0 became available.

P.S. If we have a new Wrangler's release before it, we will update/fix the comment mentioned.

Thanks!

gvermillion · 2021-01-04T17:57:15Z

As pandas 1.2.0 has been released, has there been any updates to when awswrangler will be able to write compressed CSVs directly to S3?

igorborgest · 2021-01-04T21:52:33Z

Hi @gvermillion @JasonSanchez

We've added support to it in the PR above 👆 .
Could you give it a try before the official release? You can install that directly from the dev branch:

pip install git+https://github.com/awslabs/aws-data-wrangler.git@write-compressed-text

gvermillion · 2021-01-04T21:58:12Z

Hey, @igorborgest. Thanks for the quick response!

I'll give this a go later this afternoon when I get pandas updated.

gvermillion · 2021-01-04T23:08:02Z

@igorborgest ,

It appears to working if I just read/write a gzip-compressed CSV directly to S3. However, I was following this notebook you authored to attempt to do a CSV partition (literally replaced [to,from]_parquet with [to,from]_csv). When I attempt to read in the partition with a filter, I lose header information and the formatting is not as it was when I call .to_csv(). This is the case both with and without compression=gzip.

I'm not sure if this later bit is expected behavior or not. I have not used CSV partitions before.

EDIT: For completeness, when I attempt to read in the partition with the filter lambda x: x['value'].endswith('oo') I get the following results:

Empty DataFrame
Columns: [1, 2, value, 0]
Index: []

igorborgest · 2021-01-05T22:13:47Z

Hi @gvermillion , thanks for testing it.

Actually it is not related with the compression itself, it was a limitation in the CSV Datasets implementation that didn't support CSV headers. But I've just updated the branch adding support for that.
I also update the two tutorials that I recommend you to take a look:

CSV Datasets (Last 3 paragraphs)
Flexible Partitions Filter (Last paragraph)

Could you give it another try?

p.s. Uninstall the previous installation explicitly (pip uninstall awswrangler) before install again. We are keeping the same version number, so pip will not update it automatically.

gvermillion · 2021-01-06T17:36:48Z

Hello @igorborgest ,

The updates worked as expected for me. Is there any estimate for when this update will be released?

Thanks for the quick turnaround and addressing my follow-up question!

igorborgest · 2021-01-06T18:38:04Z

It should be released on version 1.3.0 next week.

gvermillion · 2021-01-06T18:54:52Z

Great. Thanks so much!

igorborgest · 2021-01-10T14:39:48Z

Released on version 2.3.0 🚀

JasonSanchez added the enhancement New feature or request label Aug 24, 2020

igorborgest assigned igorborgest and unassigned igorborgest Aug 24, 2020

igorborgest added the blocked Something is blocking the development label Aug 26, 2020

igorborgest mentioned this issue Jan 4, 2021

Add compression for to_csv() and to_json() #502

Merged

igorborgest self-assigned this Jan 4, 2021

igorborgest added feature minor release Will be addressed in the next minor release ready to release and removed blocked Something is blocking the development labels Jan 4, 2021

igorborgest added this to the 2.3.0 milestone Jan 4, 2021

igorborgest closed this as completed Jan 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Read and write compressed CSVs to S3 #359

Read and write compressed CSVs to S3 #359

JasonSanchez commented Aug 24, 2020

igorborgest commented Aug 24, 2020

Uh oh!

gvermillion commented Jan 4, 2021

Uh oh!

igorborgest commented Jan 4, 2021

Uh oh!

gvermillion commented Jan 4, 2021

Uh oh!

gvermillion commented Jan 4, 2021 •

edited

Loading

Uh oh!

igorborgest commented Jan 5, 2021

Uh oh!

gvermillion commented Jan 6, 2021

Uh oh!

igorborgest commented Jan 6, 2021

Uh oh!

gvermillion commented Jan 6, 2021

Uh oh!

igorborgest commented Jan 10, 2021

Uh oh!

Read and write compressed CSVs to S3 #359

Read and write compressed CSVs to S3 #359

Comments

JasonSanchez commented Aug 24, 2020

igorborgest commented Aug 24, 2020

Uh oh!

gvermillion commented Jan 4, 2021

Uh oh!

igorborgest commented Jan 4, 2021

Uh oh!

gvermillion commented Jan 4, 2021

Uh oh!

gvermillion commented Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

igorborgest commented Jan 5, 2021

Uh oh!

gvermillion commented Jan 6, 2021

Uh oh!

igorborgest commented Jan 6, 2021

Uh oh!

gvermillion commented Jan 6, 2021

Uh oh!

igorborgest commented Jan 10, 2021

Uh oh!

gvermillion commented Jan 4, 2021 •

edited

Loading