Reading remote file with content-encoding: gzip throws error #720

yuhuishi-convect · 2021-07-30T18:07:41Z

Problem

I have a pre-compressed (with gzip) CSV file on s3: https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz
where the meta is set as

content-type: text/csv
content-encoding: gzip

When reading it as a dataframe

import pandas as pd
import fsspec as fs

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"
with fs.open(url, 'r') as f:
  df = pd.read_csv(f)

throws the following error


/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    686     )
    687
--> 688     return _read(filepath_or_buffer, kwds)
    689
    690

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    458
    459     try:
--> 460         data = parser.read(nrows)
    461     finally:
    462         parser.close()

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/pandas/io/parsers.py in read(self, nrows)
   1196     def read(self, nrows=None):
   1197         nrows = _validate_integer("nrows", nrows)
-> 1198         ret = self._engine.read(nrows)
   1199
   1200         # May alter columns / col_dict

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/pandas/io/parsers.py in read(self, nrows)
   2155     def read(self, nrows=None):
   2156         try:
-> 2157             data = self._reader.read(nrows)
   2158         except StopIteration:
   2159             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/fsspec/asyn.py in f()
     53             if callback_timeout is not None:
     54                 future = asyncio.wait_for(future, callback_timeout)
---> 55             result[0] = await future
     56         except Exception:
     57             error[0] = sys.exc_info()

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/fsspec/implementations/http.py in async_fetch_range(self, start, end)
    385             if r.status == 206:
    386                 # partial content, as expected
--> 387                 out = await r.read()
    388             elif "Content-Length" in r.headers:
    389                 cl = int(r.headers["Content-Length"])

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/aiohttp/client_reqrep.py in read(self)
   1030         if self._body is None:
   1031             try:
-> 1032                 self._body = await self.content.read()
   1033                 for trace in self._traces:
   1034                     await trace.send_response_chunk_received(

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/aiohttp/streams.py in read(self, n)
    368             blocks = []
    369             while True:
--> 370                 block = await self.readany()
    371                 if not block:
    372                     break

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/aiohttp/streams.py in readany(self)
    390         # without feeding any data
    391         while not self._buffer and not self._eof:
--> 392             await self._wait("readany")
    393
    394         return self._read_nowait(-1)

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/aiohttp/streams.py in _wait(self, func_name)
    304             if self._timer:
    305                 with self._timer:
--> 306                     await waiter
    307             else:
    308                 await waiter

ClientPayloadError: 400, message='Can not decode content-encoding: gzip'

While the following is fine

import fsspec as fs 

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"
with fs.open(url, 'r') as f:
  print(f.readline())

prints out the normal csv file content. It looks like the file is already decompressed when opened.

Reading using pandas's reader is also fine:

import pandas as pd

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"
df = pd.read_csv(url)

If I removed the content-encoding: gzip from the s3 meta, then the following is fine

import pandas as pd
import fsspec as fs

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"
with fs.open(url, 'r') as f:
  df = pd.read_csv(f, compression='gzip')

Related issues

I have raised this issue to dask community: dask/dask#7959
I think this issue might also related to aio-libs/aiohttp#4462

environment

aiohttp                   3.7.4.post0              pypi_0    pypi
fsspec                    0.8.4                      py_0    conda-forge

The text was updated successfully, but these errors were encountered:

martindurant · 2021-07-30T18:29:02Z

I believe you have misunderstood the intent of the Content-encoding header (e.g., https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding ). This refers to additional encoding done to the payload of the transfer, which must be reversed upon receipt. In your case, the payload is pre-compressed, it is not a compression applied to the transfer.

Comment from the aiohttp issue:
" The Content-Encoding header is not related to data contents actually. It is connected with a way data is transferred over HTTP."
Exactly.

Note that when you specify "gzip" to read_csv or file open, it is understood as file compression. There may be a difference between gzip (the file format) and gzip (the stream compression codec).

yuhuishi-convect · 2021-07-30T18:42:39Z

Thanks for the reply @martindurant

What I still don't quite understand is the discrepancy of the code behaviors:

import fsspec as fs 

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"

# this will return the correct decompressed file content
with fs.open(url, 'r') as f:
  print(f.readline())

# this throws the 400 cannot decode content-encoding: gzip error
with fs.open(url, 'r') as f:
  df = pd.read_csv(f)

jsignell mentioned this issue Jul 30, 2021

DaskDataframe is not able to read remote csv file with content-encoding: gzip dask/dask#7959

Closed

martindurant closed this as completed Aug 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading remote file with content-encoding: gzip throws error #720

Reading remote file with content-encoding: gzip throws error #720

yuhuishi-convect commented Jul 30, 2021

martindurant commented Jul 30, 2021 •

edited

Loading

yuhuishi-convect commented Jul 30, 2021

Reading remote file with content-encoding: gzip throws error #720

Reading remote file with content-encoding: gzip throws error #720

Comments

yuhuishi-convect commented Jul 30, 2021

Problem

Related issues

environment

martindurant commented Jul 30, 2021 • edited Loading

yuhuishi-convect commented Jul 30, 2021

martindurant commented Jul 30, 2021 •

edited

Loading