Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "file" source is not able to open with pandas CDS files previously downloaded #180

Closed
malmans2 opened this issue Sep 15, 2023 · 5 comments · Fixed by #211
Closed

The "file" source is not able to open with pandas CDS files previously downloaded #180

malmans2 opened this issue Sep 15, 2023 · 5 comments · Fixed by #211
Assignees
Labels
bug Something isn't working

Comments

@malmans2
Copy link
Contributor

malmans2 commented Sep 15, 2023

What happened?

I'm able to open CDS files downloaded using from_source("cds", ...), but I'm not able to open them if they've been previously downloaded (i.e., using from_source("file", ...)).

In this specific case, when using from_source("cds", ...) looks like additional arguments are passed to the libraries used under the hood to read data (pandas, comment="#").

Is there a way to open a local file previously downloaded from the CDS exactly as from_source("cds", ...) would do?

What are the steps to reproduce the bug?

import cdsapi
import earthkit.data

collection_id = "insitu-observations-gruan-reference-network"
request = {
    "format": "csv-lev.zip",
    "year": "2006",
    "month": "05",
    "variable": ["air_temperature", "altitude"],
    "day": ["21", "22"],
}

data_cds = earthkit.data.from_source("cds", collection_id, **request)
data_cds.to_pandas()  # OK

data_file = earthkit.data.from_source("file", data_cds.path)
data_file.to_pandas()  # ParserError

client = cdsapi.Client()
data_cdsapi = earthkit.data.from_source(
    "file", client.retrieve(collection_id, request).download()
)
data_cdsapi.to_pandas()  # ParserError

Version

0.3.1

Platform (OS and architecture)

Darwin MacBook-Pro-3.local 22.6.0 Darwin Kernel Version 22.6.0: Wed Jul 5 22:21:56 PDT 2023; root:xnu-8796.141.3~6/RELEASE_X86_64 x86_64

Relevant log output

---------------------------------------------------------------------------
ParserError                               Traceback (most recent call last)
Cell In[1], line 17
     14 data_cds.to_pandas()  # OK
     16 data_file = earthkit.data.from_source("file", data_cds.path)
---> 17 data_file.to_pandas()  # ParserError

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/earthkit/data/sources/file.py:107, in FileSource.to_pandas(self, **kwargs)
    105 def to_pandas(self, **kwargs):
    106     LOG.debug("Calling reader.to_pandas %s", self)
--> 107     return self._reader.to_pandas(**kwargs)

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/earthkit/data/readers/csv.py:144, in CSVReader.to_pandas(self, **kwargs)
    141     pandas_read_csv_kwargs["compression"] = self.compression
    143 LOG.debug("pandas.read_csv(%s,%s)", self.path, pandas_read_csv_kwargs)
--> 144 return pandas.read_csv(self.path, **pandas_read_csv_kwargs)

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/pandas/io/parsers/readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    935 kwds_defaults = _refine_defaults_read(
    936     dialect,
    937     delimiter,
   (...)
    944     dtype_backend=dtype_backend,
    945 )
    946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/pandas/io/parsers/readers.py:617, in _read(filepath_or_buffer, kwds)
    614     return parser
    616 with parser:
--> 617     return parser.read(nrows)

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/pandas/io/parsers/readers.py:1748, in TextFileReader.read(self, nrows)
   1741 nrows = validate_integer("nrows", nrows)
   1742 try:
   1743     # error: "ParserBase" has no attribute "read"
   1744     (
   1745         index,
   1746         columns,
   1747         col_dict,
-> 1748     ) = self._engine.read(  # type: ignore[attr-defined]
   1749         nrows
   1750     )
   1751 except Exception:
   1752     self.close()

File ~/mambaforge/envs/earthkit/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
    232 try:
    233     if self.low_memory:
--> 234         chunks = self._reader.read_low_memory(nrows)
    235         # destructive to chunks
    236         data = _concatenate_chunks(chunks)

File parsers.pyx:843, in pandas._libs.parsers.TextReader.read_low_memory()

File parsers.pyx:904, in pandas._libs.parsers.TextReader._read_rows()

File parsers.pyx:879, in pandas._libs.parsers.TextReader._tokenize_rows()

File parsers.pyx:890, in pandas._libs.parsers.TextReader._check_tokenize_status()

File parsers.pyx:2058, in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 15, saw 11

Accompanying data

No response

Organisation

B-Open / CADS-EQC

@malmans2 malmans2 added the bug Something isn't working label Sep 15, 2023
@malmans2 malmans2 changed the title The "file" source is not able to open CDS files previously downloaded The "file" source is not able to open with pandas CDS files previously downloaded Sep 15, 2023
@sandorkertesz
Copy link
Collaborator

Dear @malmans2, thank you for reporting this issue.

The method to save the results from CDS to a given file is save:

collection_id = "insitu-observations-gruan-reference-network"
request = {
    "format": "csv-lev.zip",
    "year": "2006",
    "month": "05",
    "variable": ["air_temperature", "altitude"],
    "day": ["21", "22"],
}

data_cds = earthkit.data.from_source("cds", collection_id, **request)
data_cds.to_pandas()  # OK
data_cds.save("my_cds_data.zip")

The path on data_cds points to a cache file describing the retrieval and its results and it cannot be used as an input to from_source("file", ...).

@malmans2
Copy link
Contributor Author

Got it, thanks.

What about the last method in my comment?

import cdsapi
import earthkit.data

client = cdsapi.Client()
data_cdsapi = earthkit.data.from_source(
    "file", client.retrieve(collection_id, request).download()
)
data_cdsapi.to_pandas()  # ParserError

I.e., earthkit users are not supposed to read CDS data already available on disk?

@sandorkertesz
Copy link
Collaborator

sandorkertesz commented Sep 18, 2023

You should be able to read the previously downloaded CSD data as a "file" source. The problem is that pandas' read_csv() method that is called under the hood from the to_pandas() method requires extra arguments to handle your data. You can pass them with pandas_read_csv_kwargs. This code works:

import cdsapi
import earthkit.data

client = cdsapi.Client()
data_cdsapi = earthkit.data.from_source(
    "file", client.retrieve(collection_id, request).download()
)
df = data_cdsapi.to_pandas(pandas_read_csv_kwargs={"comment": "#"}) 

Now, it is a good question if "comment": "#" should be set by default inside the to_pandas call. It requires further consideration.

@malmans2
Copy link
Contributor Author

Understood. Thanks for the clarification.

@sandorkertesz sandorkertesz self-assigned this Sep 19, 2023
@EddyCMWF
Copy link
Contributor

EddyCMWF commented Oct 2, 2023

I am reopening as I think there are some issues that we can address here to attempt some consistency accross sources. There may be some differences, but I think we can do better than the current implementation.

Further details, the current default pandas_read_csv_kwargs for the different sources:

file source:

pandas_read_csv_kwargs = {}

cds_source:

pandas_read_csv_kwargs =dict(
    comment="#",                 # This is what creates the inconsistency, I like it but may be safer to drop
    parse_dates=["report_timestamp"],  # This is not even correct for all CDS csv datasets, so should go
    skip_blank_lines=True,        # This is the default value, so unnecessary
    compression="zip",              # the csv-reader overwrites this value, so unnecessary here
)

ecmwf_api source:

pandas_read_csv_kwargs =dict(
    sep="\t",              # This creates inconsistency
    comment="#",       # This creates inconsistency
    skip_blank_lines=True,      # This is the default value, so unnecessary
    skipinitialspace=True,        # This creates inconsistency
    compression="zip",     # the csv-reader overwrites this value, so unnecessary here
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants