The "file" source is not able to open with pandas CDS files previously downloaded #180

malmans2 · 2023-09-15T10:09:27Z

What happened?

I'm able to open CDS files downloaded using from_source("cds", ...), but I'm not able to open them if they've been previously downloaded (i.e., using from_source("file", ...)).

In this specific case, when using from_source("cds", ...) looks like additional arguments are passed to the libraries used under the hood to read data (pandas, comment="#").

Is there a way to open a local file previously downloaded from the CDS exactly as from_source("cds", ...) would do?

What are the steps to reproduce the bug?

import cdsapi
import earthkit.data

collection_id = "insitu-observations-gruan-reference-network"
request = {
    "format": "csv-lev.zip",
    "year": "2006",
    "month": "05",
    "variable": ["air_temperature", "altitude"],
    "day": ["21", "22"],
}

data_cds = earthkit.data.from_source("cds", collection_id, **request)
data_cds.to_pandas()  # OK

data_file = earthkit.data.from_source("file", data_cds.path)
data_file.to_pandas()  # ParserError

client = cdsapi.Client()
data_cdsapi = earthkit.data.from_source(
    "file", client.retrieve(collection_id, request).download()
)
data_cdsapi.to_pandas()  # ParserError

Version

0.3.1

Platform (OS and architecture)

Darwin MacBook-Pro-3.local 22.6.0 Darwin Kernel Version 22.6.0: Wed Jul 5 22:21:56 PDT 2023; root:xnu-8796.141.3~6/RELEASE_X86_64 x86_64

Relevant log output

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[1], line 17
     14 data_cds.to_pandas()  # OK
     16 data_file = earthkit.data.from_source("file", data_cds.path)
---> 17 data_file.to_pandas()  # ParserError

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/file.py:107, in FileSource.to_pandas(self, **kwargs)
    105 def to_pandas(self, **kwargs):
    106     LOG.debug("Calling reader.to_pandas %s", self)
--> 107     return self._reader.to_pandas(**kwargs)

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/earthkit/data/readers/csv.py:144, in CSVReader.to_pandas(self, **kwargs)
    141     pandas_read_csv_kwargs["compression"] = self.compression
    143 LOG.debug("pandas.read_csv(%s,%s)", self.path, pandas_read_csv_kwargs)
--> 144 return pandas.read_csv(self.path, **pandas_read_csv_kwargs)

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    944     dtype_backend=dtype_backend,
    945 )
    946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds)
    614     return parser
    616 with parser:
--> 617     return parser.read(nrows)

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1748, in TextFileReader.read(self, nrows)
   1741 nrows = validate_integer("nrows", nrows)
   1742 try:
   1743     # error: "ParserBase" has no attribute "read"
   1744     (
   1745         index,
   1746         columns,
   1747         col_dict,
-> 1748     ) = self._engine.read(  # type: ignore[attr-defined]
   1749         nrows
   1750     )
   1751 except Exception:
   1752     self.close()

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
    232 try:
    233     if self.low_memory:
--> 234         chunks = self._reader.read_low_memory(nrows)
    235         # destructive to chunks
    236         data = _concatenate_chunks(chunks)

File parsers.pyx:843, in pandas._libs.parsers.TextReader.read_low_memory()

File parsers.pyx:904, in pandas._libs.parsers.TextReader._read_rows()

File parsers.pyx:879, in pandas._libs.parsers.TextReader._tokenize_rows()

File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()

File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 15, saw 11

Accompanying data

No response

Organisation

B-Open / CADS-EQC

The text was updated successfully, but these errors were encountered:

sandorkertesz · 2023-09-18T09:33:17Z

Dear @malmans2, thank you for reporting this issue.

The method to save the results from CDS to a given file is save:

collection_id = "insitu-observations-gruan-reference-network"
request = {
    "format": "csv-lev.zip",
    "year": "2006",
    "month": "05",
    "variable": ["air_temperature", "altitude"],
    "day": ["21", "22"],
}

data_cds = earthkit.data.from_source("cds", collection_id, **request)
data_cds.to_pandas()  # OK
data_cds.save("my_cds_data.zip")

The path on data_cds points to a cache file describing the retrieval and its results and it cannot be used as an input to from_source("file", ...).

malmans2 · 2023-09-18T09:39:43Z

Got it, thanks.

What about the last method in my comment?

import cdsapi
import earthkit.data

client = cdsapi.Client()
data_cdsapi = earthkit.data.from_source(
    "file", client.retrieve(collection_id, request).download()
)
data_cdsapi.to_pandas()  # ParserError

I.e., earthkit users are not supposed to read CDS data already available on disk?

sandorkertesz · 2023-09-18T10:02:28Z

You should be able to read the previously downloaded CSD data as a "file" source. The problem is that pandas' read_csv() method that is called under the hood from the to_pandas() method requires extra arguments to handle your data. You can pass them with pandas_read_csv_kwargs. This code works:

import cdsapi
import earthkit.data

client = cdsapi.Client()
data_cdsapi = earthkit.data.from_source(
    "file", client.retrieve(collection_id, request).download()
)
df = data_cdsapi.to_pandas(pandas_read_csv_kwargs={"comment": "#"})

Now, it is a good question if "comment": "#" should be set by default inside the to_pandas call. It requires further consideration.

malmans2 · 2023-09-18T10:18:45Z

Understood. Thanks for the clarification.

EddyCMWF · 2023-10-02T13:53:12Z

I am reopening as I think there are some issues that we can address here to attempt some consistency accross sources. There may be some differences, but I think we can do better than the current implementation.

Further details, the current default pandas_read_csv_kwargs for the different sources:

file source:

pandas_read_csv_kwargs = {}

cds_source:

pandas_read_csv_kwargs =dict(
    comment="#",                 # This is what creates the inconsistency, I like it but may be safer to drop
    parse_dates=["report_timestamp"],  # This is not even correct for all CDS csv datasets, so should go
    skip_blank_lines=True,        # This is the default value, so unnecessary
    compression="zip",              # the csv-reader overwrites this value, so unnecessary here
)

ecmwf_api source:

pandas_read_csv_kwargs =dict(
    sep="\t",              # This creates inconsistency
    comment="#",       # This creates inconsistency
    skip_blank_lines=True,      # This is the default value, so unnecessary
    skipinitialspace=True,        # This creates inconsistency
    compression="zip",     # the csv-reader overwrites this value, so unnecessary here
)

malmans2 added the bug Something isn't working label Sep 15, 2023

malmans2 changed the title ~~The "file" source is not able to open CDS files previously downloaded~~ The "file" source is not able to open with pandas CDS files previously downloaded Sep 15, 2023

sandorkertesz self-assigned this Sep 19, 2023

sandorkertesz closed this as completed Sep 21, 2023

EddyCMWF reopened this Oct 2, 2023

EddyCMWF mentioned this issue Oct 2, 2023

consistent pandas_read_csv_kwargs for file and CDS sources #211

Merged

EddyCMWF linked a pull request Oct 2, 2023 that will close this issue

consistent pandas_read_csv_kwargs for file and CDS sources #211

Merged

sandorkertesz closed this as completed in #211 Oct 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The "file" source is not able to open with pandas CDS files previously downloaded #180

The "file" source is not able to open with pandas CDS files previously downloaded #180

malmans2 commented Sep 15, 2023 •

edited

Loading

sandorkertesz commented Sep 18, 2023

malmans2 commented Sep 18, 2023

sandorkertesz commented Sep 18, 2023 •

edited

Loading

malmans2 commented Sep 18, 2023

EddyCMWF commented Oct 2, 2023 •

edited

Loading

The "file" source is not able to open with pandas CDS files previously downloaded #180

The "file" source is not able to open with pandas CDS files previously downloaded #180

Comments

malmans2 commented Sep 15, 2023 • edited Loading

What happened?

What are the steps to reproduce the bug?

Version

Platform (OS and architecture)

Relevant log output

Accompanying data

Organisation

sandorkertesz commented Sep 18, 2023

malmans2 commented Sep 18, 2023

sandorkertesz commented Sep 18, 2023 • edited Loading

malmans2 commented Sep 18, 2023

EddyCMWF commented Oct 2, 2023 • edited Loading

malmans2 commented Sep 15, 2023 •

edited

Loading

sandorkertesz commented Sep 18, 2023 •

edited

Loading

EddyCMWF commented Oct 2, 2023 •

edited

Loading