Added timezone type to dfs when the corresponding pd Df also has timezones #1954

frederiksteiner · 2024-05-24T08:15:16Z

Please answer these questions before submitting your pull requests. Thanks!

What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1444940: Write to pandas with datetime with timezone has no timezone in the resulting dataframe #1952
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
- I am adding new logging messages
- I am adding a new telemetry message
- I am modifying authorization mechanisms
- I am adding new credentials
- I am modifying OCSP code
- I am adding a new dependency
Please describe how your code solves the related issue.

Checks if there are any tz_columns and changes the column mapping such that the resulting type is a timezone type

…so has timezone

sfc-gh-aalam

The overall change looks good to me. Just need elaboration on one change

sfc-gh-aalam · 2024-06-28T17:20:41Z

src/snowflake/connector/pandas_tools.py

+        tz_columns = [
+            str(c).replace('"', '""') for c in df.columns if pandas.api.types.is_datetime64tz_dtype(df[c])
+        ]


can you explain why this change is necessary?

Same reason as explained in the comment starting at line 352

snowflake-connector-python/src/snowflake/connector/pandas_tools.py

Line 352 in 92808e5

# if the column name contains a double quote, we need to escape it by replacing with two double quotes

frederiksteiner · 2024-06-28T20:20:34Z

Didn‘t manage to install tox on my device, so some checks might be failing :/

sfc-gh-aalam

I retested this with your changes and I'm afraid this is not enough to fix the issue. For example if you create dataframe like so

df = pd.DataFrame({"DT": [
    datetime.now(tz=pytz.timezone("Europe/Amsterdam")),
    datetime.now(tz=pytz.timezone("UTC")),
    ]})

print("is tz type =", pd.api.types.is_datetime64tz_dtype(df["DT"]))

the result here is False so we will miss this case. Even if this is fixed, I notice that we are not correctly reading timezone information from the parquet file. I'll check with internal team and update this.

frederiksteiner · 2024-07-01T07:44:12Z

Isn't that to be expected? Aren't types per column and hence it cannot be done correclty anyways. But I think it gets handled correctly when saving to parquet:

import pytz
import pandas as pd
import pyarrow.parquet as pq
from datetime import datetime
df = pd.DataFrame({"DT": [
    datetime.now(tz=pytz.timezone("Europe/Amsterdam")),
    datetime.now(tz=pytz.timezone("UTC")),
    ]})

df.to_parquet("test.parquet")
pq.read_schema("test.parquet")

This gives the following output:
DT: timestamp[us, tz=Europe/Amsterdam]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 366
And the resulting dataframe looks as follows if re reading it:
DT
0 2024-07-01 08:54:18.603081+02:00
1 2024-07-01 08:54:18.603099+02:00
with dtypes: DT datetime64[us, Europe/Amsterdam].

frederiksteiner · 2024-07-01T07:57:40Z

@pytest.mark.parametrize("use_logical_type", [None, True, False])
def test_write_pandas_use_logical_type(
    conn_cnx: Callable[..., Generator[SnowflakeConnection, None, None]],
    use_logical_type: bool | None,
):
    table_name = random_string(5, "USE_LOCAL_TYPE_").upper()
    col_name = "DT"
    create_sql = f"CREATE OR REPLACE TABLE {table_name} ({col_name} TIMESTAMP_TZ)"
    select_sql = f"SELECT * FROM {table_name}"
    drop_sql = f"DROP TABLE IF EXISTS {table_name}"
    timestamp = datetime(
        year=2020,
        month=1,
        day=2,
        hour=3,
        minute=4,
        second=5,
        microsecond=6,
        tzinfo=timezone(timedelta(hours=2)),
    )
###################### changed/new lines start
    timestamp_2 = datetime(
        year=2020,
        month=1,
        day=2,
        hour=3,
        minute=4,
        second=5,
        microsecond=6,
        tzinfo=timezone(timedelta(hours=4)),
    )
    df_write = pandas.DataFrame({col_name: [timestamp, timestamp_2]})
#####changed lines end

    with conn_cnx() as cnx:  # type: SnowflakeConnection
        cnx.cursor().execute(create_sql).fetchall()

        write_pandas_kwargs = dict(
            conn=cnx,
            df=df_write,
            use_logical_type=use_logical_type,
            auto_create_table=False,
            table_name=table_name,
        )

        try:
            # When use_logical_type = True, datetimes with timestamps should be
            # correctly written to Snowflake.
            if use_logical_type:
                write_pandas(**write_pandas_kwargs)
                df_read = cnx.cursor().execute(select_sql).fetch_pandas_all()
                assert all(df_write == df_read)
                assert pandas.api.types.is_datetime64tz_dtype(df_read[col_name])
            # For other use_logical_type values, a UserWarning should be displayed.
            else:
                with pytest.warns(UserWarning, match="Dataframe contains a datetime.*"):
                    write_pandas(**write_pandas_kwargs)
        finally:
            cnx.execute_string(drop_sql)

When adding a similar example as yours to the unittest, then everything works when setting use_logical_type=True. But not when it is None or False, since then the check here fails.

Hence this check probably needs adaptation.

… adapted test such that we have a case with two different timezones

Added timezone type to dfs when the corresponding pandas dataframe al…

92808e5

…so has timezone

sfc-gh-sghosh mentioned this pull request May 29, 2024

SNOW-1444940: Write to pandas with datetime with timezone has no timezone in the resulting dataframe #1952

Closed

sfc-gh-aalam reviewed Jun 28, 2024

View reviewed changes

sfc-gh-aalam approved these changes Jun 28, 2024

View reviewed changes

sfc-gh-aalam requested changes Jun 28, 2024

View reviewed changes

Updated check of datetimes when not setting use_logical_type=True and…

7a944a1

… adapted test such that we have a case with two different timezones

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added timezone type to dfs when the corresponding pd Df also has timezones #1954

Added timezone type to dfs when the corresponding pd Df also has timezones #1954

frederiksteiner commented May 24, 2024

sfc-gh-aalam left a comment

sfc-gh-aalam Jun 28, 2024

frederiksteiner Jun 28, 2024

frederiksteiner commented Jun 28, 2024

sfc-gh-aalam left a comment

frederiksteiner commented Jul 1, 2024

frederiksteiner commented Jul 1, 2024

Added timezone type to dfs when the corresponding pd Df also has timezones #1954

Are you sure you want to change the base?

Added timezone type to dfs when the corresponding pd Df also has timezones #1954

Conversation

frederiksteiner commented May 24, 2024

sfc-gh-aalam left a comment

Choose a reason for hiding this comment

sfc-gh-aalam Jun 28, 2024

Choose a reason for hiding this comment

frederiksteiner Jun 28, 2024

Choose a reason for hiding this comment

frederiksteiner commented Jun 28, 2024

sfc-gh-aalam left a comment

Choose a reason for hiding this comment

frederiksteiner commented Jul 1, 2024

frederiksteiner commented Jul 1, 2024