Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading remote file with content-encoding: gzip throws error #720

Closed
yuhuishi-convect opened this issue Jul 30, 2021 · 2 comments
Closed

Comments

@yuhuishi-convect
Copy link

Problem

I have a pre-compressed (with gzip) CSV file on s3: https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz
where the meta is set as

content-type: text/csv
content-encoding: gzip

When reading it as a dataframe

import pandas as pd
import fsspec as fs

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"
with fs.open(url, 'r') as f:
  df = pd.read_csv(f)
throws the following error

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    686     )
    687
--> 688     return _read(filepath_or_buffer, kwds)
    689
    690

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    458
    459     try:
--> 460         data = parser.read(nrows)
    461     finally:
    462         parser.close()

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/pandas/io/parsers.py in read(self, nrows)
   1196     def read(self, nrows=None):
   1197         nrows = _validate_integer("nrows", nrows)
-> 1198         ret = self._engine.read(nrows)
   1199
   1200         # May alter columns / col_dict

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/pandas/io/parsers.py in read(self, nrows)
   2155     def read(self, nrows=None):
   2156         try:
-> 2157             data = self._reader.read(nrows)
   2158         except StopIteration:
   2159             if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/fsspec/asyn.py in f()
     53             if callback_timeout is not None:
     54                 future = asyncio.wait_for(future, callback_timeout)
---> 55             result[0] = await future
     56         except Exception:
     57             error[0] = sys.exc_info()

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/fsspec/implementations/http.py in async_fetch_range(self, start, end)
    385             if r.status == 206:
    386                 # partial content, as expected
--> 387                 out = await r.read()
    388             elif "Content-Length" in r.headers:
    389                 cl = int(r.headers["Content-Length"])

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/aiohttp/client_reqrep.py in read(self)
   1030         if self._body is None:
   1031             try:
-> 1032                 self._body = await self.content.read()
   1033                 for trace in self._traces:
   1034                     await trace.send_response_chunk_received(

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/aiohttp/streams.py in read(self, n)
    368             blocks = []
    369             while True:
--> 370                 block = await self.readany()
    371                 if not block:
    372                     break

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/aiohttp/streams.py in readany(self)
    390         # without feeding any data
    391         while not self._buffer and not self._eof:
--> 392             await self._wait("readany")
    393
    394         return self._read_nowait(-1)

/usr/local/anaconda3/envs/dask-sql/lib/python3.9/site-packages/aiohttp/streams.py in _wait(self, func_name)
    304             if self._timer:
    305                 with self._timer:
--> 306                     await waiter
    307             else:
    308                 await waiter

ClientPayloadError: 400, message='Can not decode content-encoding: gzip'

While the following is fine

import fsspec as fs 

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"
with fs.open(url, 'r') as f:
  print(f.readline())

prints out the normal csv file content. It looks like the file is already decompressed when opened.

Reading using pandas's reader is also fine:

import pandas as pd

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"
df = pd.read_csv(url)

If I removed the content-encoding: gzip from the s3 meta, then the following is fine

import pandas as pd
import fsspec as fs

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"
with fs.open(url, 'r') as f:
  df = pd.read_csv(f, compression='gzip')

Related issues

I have raised this issue to dask community: dask/dask#7959
I think this issue might also related to aio-libs/aiohttp#4462

environment

aiohttp                   3.7.4.post0              pypi_0    pypi
fsspec                    0.8.4                      py_0    conda-forge
@martindurant
Copy link
Member

martindurant commented Jul 30, 2021

I believe you have misunderstood the intent of the Content-encoding header (e.g., https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding ). This refers to additional encoding done to the payload of the transfer, which must be reversed upon receipt. In your case, the payload is pre-compressed, it is not a compression applied to the transfer.

Comment from the aiohttp issue:
" The Content-Encoding header is not related to data contents actually. It is connected with a way data is transferred over HTTP."
Exactly.

Note that when you specify "gzip" to read_csv or file open, it is understood as file compression. There may be a difference between gzip (the file format) and gzip (the stream compression codec).

@yuhuishi-convect
Copy link
Author

Thanks for the reply @martindurant

What I still don't quite understand is the discrepancy of the code behaviors:

import fsspec as fs 

url = "https://convect-test-data.s3.us-west-2.amazonaws.com/tx_3_target_time_series.csv.gz"

# this will return the correct decompressed file content
with fs.open(url, 'r') as f:
  print(f.readline())

# this throws the 400 cannot decode content-encoding: gzip error
with fs.open(url, 'r') as f:
  df = pd.read_csv(f)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants