Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata missing timezone breaks s3.read_parquet in awswrangler 3 #2667

Closed
MSDuncan82 opened this issue Feb 9, 2024 · 1 comment
Closed
Labels
bug Something isn't working

Comments

@MSDuncan82
Copy link

Describe the bug

We recently bumped our major version of awswrangler to major version 3.5.2 from 2.20.1 and pyarrow 15.0.0 we're seeing an issue when calling s3.read_parquet on parquet files created with pyarrow 8.0.0.

We have files that when read have metadata like below:

IN: print(table.schema.metadata)
{b'pandas': b'{"index_columns": [], "column_indexes": [], "columns": [{"name": "finish_time", "field_name": "finish_time", "pandas_type": "datetimetz", "numpy_type": "object", "metadata": null}], 
"creator": {"library": "pyarrow", "version": "8.0.0"}, "pandas_version": "1.5.1"}'}

You can see metadata" null in the data above.

This used to be okay in awswrangler 2:

def _apply_timezone(df: pd.DataFrame, metadata: Dict[str, Any]) -> pd.DataFrame:
for c in metadata["columns"]:
if "field_name" in c and c["field_name"] is not None:
col_name = str(c["field_name"])
elif "name" in c and c["name"] is not None:
col_name = str(c["name"])
else:
continue
if col_name in df.columns and c["pandas_type"] == "datetimetz":
column_metadata: Dict[str, Any] = c["metadata"] if c.get("metadata") else {}
timezone_str: Optional[str] = column_metadata.get("timezone")
if timezone_str:
timezone: datetime.tzinfo = pa.lib.string_to_tzinfo(timezone_str)
_logger.debug("applying timezone (%s) on column %s", timezone, col_name)
if hasattr(df[col_name].dtype, "tz") is False:
df[col_name] = df[col_name].dt.tz_localize(tz="UTC")
df[col_name] = df[col_name].dt.tz_convert(tz=timezone)
return df

But doesn't work in awswrangler 3:

def _apply_timezone(df: pd.DataFrame, metadata: dict[str, Any]) -> pd.DataFrame:
for c in metadata["columns"]:
if "field_name" in c and c["field_name"] is not None:
col_name = str(c["field_name"])
elif "name" in c and c["name"] is not None:
col_name = str(c["name"])
else:
continue
if col_name in df.columns and c["pandas_type"] == "datetimetz":
timezone: datetime.tzinfo = pa.lib.string_to_tzinfo(c["metadata"]["timezone"])
_logger.debug("applying timezone (%s) on column %s", timezone, col_name)
if hasattr(df[col_name].dtype, "tz") is False:
df[col_name] = df[col_name].dt.tz_localize(tz="UTC")
if timezone is not None and timezone != pytz.UTC and hasattr(df[col_name].dt, "tz_convert"):
df[col_name] = df[col_name].dt.tz_convert(tz=timezone)
return df

The issue is this line: @

timezone: datetime.tzinfo = pa.lib.string_to_tzinfo(c["metadata"]["timezone"])

It throws an error because c["metadata"] is null and doesn't have a key ["timezone"].

Why did this functionality change and can we change it back?

How to Reproduce

Can't reproduce without attaching file

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.10.12

AWS SDK for pandas version

3.5.1

Additional context

No response

@MSDuncan82 MSDuncan82 added the bug Something isn't working label Feb 9, 2024
@kukushking
Copy link
Contributor

Hi @MSDuncan82 , this was fixed in https://github.com/aws/aws-sdk-pandas/pull/1840/files but looks like an earlier version missing that fix ended up in _arrow.py after a refactor. I've opened a PR to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants