Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added timezone type to dfs when the corresponding pd Df also has timezones #1954

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

frederiksteiner
Copy link

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1444940: Write to pandas with datetime with timezone has no timezone in the resulting dataframe #1952

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am modifying authorization mechanisms
    • I am adding new credentials
    • I am modifying OCSP code
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

    Checks if there are any tz_columns and changes the column mapping such that the resulting type is a timezone type

Copy link
Collaborator

@sfc-gh-aalam sfc-gh-aalam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall change looks good to me. Just need elaboration on one change

Comment on lines 355 to 357
tz_columns = [
str(c).replace('"', '""') for c in df.columns if pandas.api.types.is_datetime64tz_dtype(df[c])
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why this change is necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as explained in the comment starting at line 352

# if the column name contains a double quote, we need to escape it by replacing with two double quotes

@frederiksteiner
Copy link
Author

Didn‘t manage to install tox on my device, so some checks might be failing :/

Copy link
Collaborator

@sfc-gh-aalam sfc-gh-aalam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I retested this with your changes and I'm afraid this is not enough to fix the issue. For example if you create dataframe like so

df = pd.DataFrame({"DT": [
    datetime.now(tz=pytz.timezone("Europe/Amsterdam")),
    datetime.now(tz=pytz.timezone("UTC")),
    ]})

print("is tz type =", pd.api.types.is_datetime64tz_dtype(df["DT"]))

the result here is False so we will miss this case. Even if this is fixed, I notice that we are not correctly reading timezone information from the parquet file. I'll check with internal team and update this.

@frederiksteiner
Copy link
Author

Isn't that to be expected? Aren't types per column and hence it cannot be done correclty anyways. But I think it gets handled correctly when saving to parquet:

import pytz
import pandas as pd
import pyarrow.parquet as pq
from datetime import datetime
df = pd.DataFrame({"DT": [
    datetime.now(tz=pytz.timezone("Europe/Amsterdam")),
    datetime.now(tz=pytz.timezone("UTC")),
    ]})

df.to_parquet("test.parquet")
pq.read_schema("test.parquet")

This gives the following output:
DT: timestamp[us, tz=Europe/Amsterdam]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 366

And the resulting dataframe looks as follows if re reading it:
DT
0 2024-07-01 08:54:18.603081+02:00
1 2024-07-01 08:54:18.603099+02:00
with dtypes: DT datetime64[us, Europe/Amsterdam].

@frederiksteiner
Copy link
Author

@pytest.mark.parametrize("use_logical_type", [None, True, False])
def test_write_pandas_use_logical_type(
    conn_cnx: Callable[..., Generator[SnowflakeConnection, None, None]],
    use_logical_type: bool | None,
):
    table_name = random_string(5, "USE_LOCAL_TYPE_").upper()
    col_name = "DT"
    create_sql = f"CREATE OR REPLACE TABLE {table_name} ({col_name} TIMESTAMP_TZ)"
    select_sql = f"SELECT * FROM {table_name}"
    drop_sql = f"DROP TABLE IF EXISTS {table_name}"
    timestamp = datetime(
        year=2020,
        month=1,
        day=2,
        hour=3,
        minute=4,
        second=5,
        microsecond=6,
        tzinfo=timezone(timedelta(hours=2)),
    )
###################### changed/new lines start
    timestamp_2 = datetime(
        year=2020,
        month=1,
        day=2,
        hour=3,
        minute=4,
        second=5,
        microsecond=6,
        tzinfo=timezone(timedelta(hours=4)),
    )
    df_write = pandas.DataFrame({col_name: [timestamp, timestamp_2]})
#####changed lines end

    with conn_cnx() as cnx:  # type: SnowflakeConnection
        cnx.cursor().execute(create_sql).fetchall()

        write_pandas_kwargs = dict(
            conn=cnx,
            df=df_write,
            use_logical_type=use_logical_type,
            auto_create_table=False,
            table_name=table_name,
        )

        try:
            # When use_logical_type = True, datetimes with timestamps should be
            # correctly written to Snowflake.
            if use_logical_type:
                write_pandas(**write_pandas_kwargs)
                df_read = cnx.cursor().execute(select_sql).fetch_pandas_all()
                assert all(df_write == df_read)
                assert pandas.api.types.is_datetime64tz_dtype(df_read[col_name])
            # For other use_logical_type values, a UserWarning should be displayed.
            else:
                with pytest.warns(UserWarning, match="Dataframe contains a datetime.*"):
                    write_pandas(**write_pandas_kwargs)
        finally:
            cnx.execute_string(drop_sql)

When adding a similar example as yours to the unittest, then everything works when setting use_logical_type=True. But not when it is None or False, since then the check here fails.

Hence this check probably needs adaptation.

… adapted test such that we have a case with two different timezones
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SNOW-1444940: Write to pandas with datetime with timezone has no timezone in the resulting dataframe
3 participants