Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandera timezone-agnostic datetime type #1352

Open
max-raphael opened this issue Sep 26, 2023 · 8 comments
Open

Pandera timezone-agnostic datetime type #1352

max-raphael opened this issue Sep 26, 2023 · 8 comments
Labels
enhancement New feature or request

Comments

@max-raphael
Copy link

max-raphael commented Sep 26, 2023

Is your feature request related to a problem? Please describe.
When defining a class that inherits from DataFrameModel, I want to define a field whose values are datetimes. Moreover, those values will have timezones. However, I will not be able to define during the class definition what timezone that may be. In other words, in dataframe A, they may be datetimes with tz="America/New_York. In dataframe B, they may be datetiems with tz="America/Los_Angeles". As far as I can tell, there is no type that I can assign that will allow me to pass datetimes with timezones, but not specify which timezone within the type hint.

Describe the solution you'd like
I would like there to be a type that I can use to say "this field will be datetimes, but I can't say what the timezone will be."

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

When setting the type of the field to datetime.datetime, pandera.dtypes.DateTime, etc. I get a pandera SchemaError that the series was expected to have type datetime64[ns], but got datetime64[ns, America/New_York] (for example).

I have also tried with DatetimeTZDtype, but that won't work because I need to specify the timezone I want (which I can't do upfront).

Additional context

Example Schema:

class MySchema(DataFrameModel):
local_datetime: <what type do I set here?>

@max-raphael max-raphael added the enhancement New feature or request label Sep 26, 2023
@cosmicBboy
Copy link
Collaborator

Hi @max-raphael this is somewhat of a challenging use case to fulfill with datetimes because if we have a timezone-agnostic datetime, how do we deal with coercion?

Imagine we support something like:

class MySchema(DataFrameModel):
    local_datetime: DateTime(has_tz=True)  # just checks that the datetimes have any timezone

    class Config:
        coerce = True

If we do coerce=True, what timezone should we coerce to? Solutions here would be:

  • Default to UTC
  • Raise an exception

@cosmicBboy
Copy link
Collaborator

This is similar to the problem of having a generic Number type: this can check if the data type is any of the int or float types, but when we coerce, what data type should it default to?

@max-raphael
Copy link
Author

max-raphael commented Sep 27, 2023

I hear you, that does pose a tricky problem. Thinking about it from my perspective as a user, I think I would prefer to have this as an option but be disallowed from coercing this field (via some Exception) due to the ambiguous nature of the data type rather than not have it accessible to me at all.

Perhaps even an Exception is too much. Pandera could still allow users to specify coerce=True and coerce other fields, and add a warning level log statement that informs the user that this field cannot be coerced due to its data type.

@cosmicBboy
Copy link
Collaborator

Perhaps even an Exception is too much. We could still allow users to specify coerce=True and coerce other fields, and add a warning level log statement that informs the user that this field cannot be coerced due to its data type.

How would you feel about defaulting to UTC on coercion (if the incoming raw data is not TZ-aware) and raising a warning that the dtypes are coerced to UTC? I generally like to do something rather than nothing on coercion to prevent propagation of surprise (i.e. a non-TZ aware dataframe after validation with coerce=True).

@max-raphael
Copy link
Author

That seems acceptable to me. I think if incoming data is not tz-aware, then that's a reasonable approach so long as Pandera logs the warning and includes it in the documentation!

@max-raphael
Copy link
Author

@cosmicBboy Hi, just following up here. Are we aligned on the feature? If so, what are the next steps? Thanks again for engaging with this, I think it would be helpful to many Pandera users.

@cosmicBboy
Copy link
Collaborator

Yep! Feel free to make a PR with changes to the DateTime type: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/pandas_engine.py#L792C3-L792C3 and add new unit tests in the appropriate test module.

Also check out the contributing guide if it's your first time contributing: https://pandera.readthedocs.io/en/stable/CONTRIBUTING.html

max-raphael added a commit to max-raphael/pandera that referenced this issue Nov 1, 2023
@erinboyle
Copy link

I'm interested in this, too. Adding a comment under the "Union" issue for this reason. #1152 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants