Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert time columns to dbtime by default in to_dataframe #862

Closed
tswast opened this issue Aug 6, 2021 · 19 comments
Closed

convert time columns to dbtime by default in to_dataframe #862

tswast opened this issue Aug 6, 2021 · 19 comments
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. semver: major Hint for users that this is an API breaking change. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@tswast
Copy link
Contributor

tswast commented Aug 6, 2021

Currently TIME columns are just exposed as string objects. This would be a better experience and align with better with the expectations for working with timeseries in pandas https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

Presumably one could combine a date column with a time column to create a datetime by adding them.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Aug 6, 2021
@tswast tswast added semver: major Hint for users that this is an API breaking change. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Aug 6, 2021
@tswast tswast self-assigned this Aug 6, 2021
@tswast tswast assigned jimfulton and plamut and unassigned tswast Aug 31, 2021
@jimfulton
Copy link
Contributor

Why timedelta and not time?

Do people use BQ TIME columns to represent time deltas?

If we're looking for a way to express timedeltas, I wonder if it might be better to leverage structs somehow.

@tswast
Copy link
Contributor Author

tswast commented Sep 1, 2021

Is there a time dtype? I didn't see it here: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes

timedelta seemed to make the most sense to me as date + timedelta = datetime

@tswast
Copy link
Contributor Author

tswast commented Sep 1, 2021

Note: this is related to the expected dtypes in the v3 branch:

# Check for expected dtypes.
# Keep these in sync with tests/system/test_pandas.py
assert df.dtypes["bignumeric_col"].name == "object"
assert df.dtypes["bool_col"].name == "boolean"
assert df.dtypes["bytes_col"].name == "object"
assert df.dtypes["date_col"].name == "object"
assert df.dtypes["datetime_col"].name == "datetime64[ns]"
assert df.dtypes["float64_col"].name == "float64"
assert df.dtypes["int64_col"].name == "Int64"
assert df.dtypes["numeric_col"].name == "object"
assert df.dtypes["string_col"].name == "object"
assert df.dtypes["time_col"].name == "object"
assert df.dtypes["timestamp_col"].name == "datetime64[ns, UTC]"

@tswast
Copy link
Contributor Author

tswast commented Sep 1, 2021

Looking at that, I think date_col should have a datetime64[ns] dtype too (at least when within allowed range in pandas)

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#generating-ranges-of-timestamps

That's what #861 is about -- picking the right dtype for date columns by default.

@jimfulton
Copy link
Contributor

Is there a time dtype? I didn't see it here: https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes

Well, I was thinking of datetime.time, which would have a dtype of object.

@jimfulton
Copy link
Contributor

timedelta seemed to make the most sense to me as date + timedelta = datetime

Sure, but would BQ users expect TIME columns to be used that way? I think I'd be surprised.

>>> df = pandas.DataFrame(dict(t=[datetime.time(0, 0, 0), datetime.time(14, 40, 0)]))
>>> df
          t
0  00:00:00
1  14:40:00
>>> df = pandas.DataFrame(dict(t=[datetime.timedelta(hours=0, minutes=0, seconds=0), datetime.timedelta(hours=14, minutes=40, seconds=0)]))
>>> df
                t
0 0 days 00:00:00
1 0 days 14:40:00

🤷

In Python, you can't add datetime.datetime and datetime.time values.

@jimfulton
Copy link
Contributor

jimfulton commented Sep 1, 2021

Looking at that, I think date_col should have a datetime64[ns] dtype too (at least when within allowed range in pandas)

A datetime/timestamp represents an instantaneous point in time.

A date represents a time interval (typically from midnight to midnight). Mapping a date to a datetime is similar, if less egregious :), to mapping a month or year to a datetime.

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#generating-ranges-of-timestamps

That's what #861 is about -- picking the right dtype for date columns by default.

Sure, but "right" is somewhat relative, and sometimes hard to pin down. Is rightness judged by accuracy or convenience?

@jimfulton
Copy link
Contributor

Anyway, I'll implement whatever you wish. :)

@jimfulton
Copy link
Contributor

@tswast at your convenience, please let me know what you want me to do or if you'd like to discuss.

@tswast
Copy link
Contributor Author

tswast commented Sep 3, 2021

Some thoughts on the pandas connector:

  • I'd like to avoid object dtype whenever there exists a lossless, null-safe conversion to a native pandas dtype (or possibly extension type if widely used and compatible license, etc).
  • (1) object dtype uses quite a bit more memory than a native dtype.
  • (2) object columns are much more likely to be slower, as pandas can't "vectorize" operations on them.
  • When deciding between dtypes, if there are common operations we expect folks to be doing with them, the most common operations should be made easy.
  • If there is a convention in the pandas community, we should follow that (unless it is not lossless, such as with using float64 to represent nullable integers).

In favor of timedelta:

  • Not object :-)
  • Easy to combine with a date column (represented as datetime64 dtype) to create a datetime column.

In favor of datetime.time (object dtype):

Confusing parts:

Conclusion

We should probably match read_sql and use object dtype with datetime.time objects. I'm still not certain about this... Hmm...

Longer-term, we should probably make a native time dtype or extension package and encourage the pandas community to adopt it.

@tswast tswast changed the title convert time columns to timedelta by default in to_dataframe convert time columns to datetime.time by default in to_dataframe Sep 3, 2021
@tswast
Copy link
Contributor Author

tswast commented Sep 3, 2021

Unless it turns out to be relatively easy to make a time dtype... In that case, maybe we should just skip straight to that???

@tswast
Copy link
Contributor Author

tswast commented Sep 3, 2021

Pro timedelta: pandas-dev/pandas#10329 -- folks do want to add them...

@tswast tswast changed the title convert time columns to datetime.time by default in to_dataframe convert time columns to timedelta by default in to_dataframe Sep 3, 2021
@jimfulton
Copy link
Contributor

Unless it turns out to be relatively easy to make a time dtype... In that case, maybe we should just skip straight to that???

I think adding a time dtype will be straightforward using the newish extension types (ExtensionDTtype and ExtensionArray). This is the mechanism used to implement the new StringDtype. which, BTW. still uses the numpy object type under the hood.

@jimfulton
Copy link
Contributor

jimfulton commented Sep 7, 2021

Pro timedelta: pandas-dev/pandas#10329 -- folks do want to add them...

In that issue, they acknowledge that "I should have converted it to a timedelta" and are mainly complaining about the error message. Just because someone expects something to work doesn't mean it should. :)

IMO, it would be most natural to convert interval data to timedelta, however #949 and #950.

IMO it should also be possible to combine data and time values to create datetime values.

@tswast tswast assigned tswast and unassigned jimfulton Dec 15, 2021
@tswast
Copy link
Contributor Author

tswast commented Dec 15, 2021

Reopening, as I notice when I checkout v3 locally, I don't always get dbtime dtype back.

(dev-3.9) ➜  python-bigquery-pandas git:(issue365-extreme-datetimes) ✗ pytest tests/system/test_read_gbq.py::test_default_dtypes'[times-nonnull-normal-range-True]'
=============================================== test session starts ================================================
platform darwin -- Python 3.9.5, pytest-6.2.5, py-1.10.0, pluggy-0.13.1
rootdir: /Users/swast/src/github.com/googleapis/python-bigquery-pandas
plugins: cov-2.12.1, asyncio-0.15.1, anyio-3.3.0, requests-mock-1.9.3
collected 1 item                                                                                                   

tests/system/test_read_gbq.py F                                                                              [100%]

===================================================== FAILURES =====================================================
_______________________________ test_default_dtypes[times-nonnull-normal-range-True] _______________________________

read_gbq = functools.partial(<function read_gbq at 0x117b27b80>, project_id='swast-scratch', credentials=<google.oauth2.credentials.Credentials object at 0x117b0d0d0>)
query = "\nSELECT\n  row_num,\n  time_col\nFROM\n  UNNEST([\n      STRUCT(1 AS row_num, CAST('00:00:00.000000' AS TIME) AS tim...   STRUCT(3 AS row_num, CAST('23:59:59.999999' AS TIME) AS time_col) ]) AS `times`\nORDER BY row_num ASC\n            "
use_bqstorage_api = True
expected =    row_num         time_col
0        1         00:00:00
1        2  09:08:07.654321
2        3  23:59:59.999999

    @pytest.mark.parametrize(["use_bqstorage_api"], [(True,), (False,)])
    @pytest.mark.parametrize(
        ["query", "expected"],
        [
.
.
.
            pytest.param(
                """
    SELECT
      row_num,
      time_col
    FROM
      UNNEST([
          STRUCT(1 AS row_num, CAST('00:00:00.000000' AS TIME) AS time_col),
          STRUCT(2 AS row_num, CAST('09:08:07.654321' AS TIME) AS time_col),
          STRUCT(3 AS row_num, CAST('23:59:59.999999' AS TIME) AS time_col) ]) AS `times`
    ORDER BY row_num ASC
                """,
                pandas.DataFrame(
                    {
                        "row_num": pandas.Series([1, 2, 3], dtype="Int64"),
                        "time_col": pandas.Series(
                            ["00:00:00.000000", "09:08:07.654321", "23:59:59.999999"],
                            dtype=db_dtypes.TimeDtype(),
                        )
                    }
                ),
                id="times-nonnull-normal-range",
            ),
        ],
    )
    def test_default_dtypes(read_gbq, query, use_bqstorage_api, expected):
        result = read_gbq(query, use_bqstorage_api=use_bqstorage_api)
>       pandas.testing.assert_frame_equal(result, expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 1] (column name="time_col") are different
E       
E       Attribute "dtype" are different
E       [left]:  object
E       [right]: dbtime

tests/system/test_read_gbq.py:158: AssertionError
----------------------------------------------- Captured stderr call -----------------------------------------------
Downloading: 100%|██████████| 3/3 [00:02<00:00,  1.13rows/s]
============================================= short test summary info ==============================================
FAILED tests/system/test_read_gbq.py::test_default_dtypes[times-nonnull-normal-range-True] - AssertionError: Attr...
================================================ 1 failed in 5.05s =================================================

@tswast tswast changed the title convert time columns to timedelta by default in to_dataframe convert time columns to dbtime by default in to_dataframe Dec 15, 2021
@plamut
Copy link
Contributor

plamut commented Dec 15, 2021

@tswast Is it possible that the most recent sync with main (through the type-check PR) caused this? Is this reproducible on v3~1, i.e. 3b3ebff?

@tswast
Copy link
Contributor Author

tswast commented Dec 15, 2021

Good idea. I'll give that a try.

@tswast
Copy link
Contributor Author

tswast commented Dec 15, 2021

I get the same error with 3d1af95. Seeing as we have a test for this dtype at https://github.com/googleapis/python-bigquery/blob/v3/tests/system/test_pandas.py#L1033, I'll look more closely at googleapis/python-bigquery-pandas#444, why I'm getting this. Possible pandas-gbq is doing something to cast TIME columns to object.

@tswast
Copy link
Contributor Author

tswast commented Dec 15, 2021

@tswast tswast closed this as completed Dec 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. semver: major Hint for users that this is an API breaking change. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants