Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats for datetimes #3007

Open
wants to merge 52 commits into
base: main
Choose a base branch
from
Open

Stats for datetimes #3007

wants to merge 52 commits into from

Conversation

polinaeterna
Copy link
Contributor

No description provided.



class Histogram(TypedDict):
hist: list[int]
bin_edges: list[Union[int, float]]


class DatetimeHistogram(TypedDict):
hist: list[int]
bin_edges: list[str] # edges are string representations of dates
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orjson supports datetime serialization though, maybe I should return datetimes then?

transformed_column_name: str,
min_date: datetime.datetime,
) -> pl.DataFrame:
return data.select((pl.col(column_name) - min_date).dt.total_seconds().alias(transformed_column_name))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operate on seconds

not sure it works as expected
test fail is not reproduced locally
to debug why test is not passed in CI but works locally
@polinaeterna
Copy link
Contributor Author

polinaeterna commented Jan 13, 2025

I have no idea how datetime.strptime("2024-01-01 00:00:00CET", "%Y-%m-%d %H:%M:%S%Z") might work locally but fail on CI 😭

upd: this is actually not a common timestamp format, so maybe i'd ignore it

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, not sure we can merge as is though, does it need any change on the front end side ? cc @severo for viz

Also this change requires implementing its associated filtering for /filter. I think DuckDB can filter after casting the strings to timestamps

services/worker/src/worker/statistics_utils.py Outdated Show resolved Hide resolved
@severo
Copy link
Collaborator

severo commented Jan 14, 2025

Super useful. For the frontend, first: we will need the openapi spec to be updated, so that the API is clear. But I think it should work directly. One detail though: the edges are long strings, and THIS might break the design, so, surely something to add in moonlanding.

n_samples=n_samples,
)
return stats
except Exception as error:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to fallback to string stats if datetime compute failed.
This would allow cases like this (here there are multiple datetime formats in Gold published date column) to display something instead of empty stats.
I test this case in "datetime_string_error" column in test dataset.

This will probably make filter part trickier though.

if isinstance(data[column_name].dtype, pl.String):
# let polars identify format itself. provide manually in case of error
try:
original_timezone = get_timezone(data[column_name][0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I track original timezone here because polars converts datetimes to UTC when format is not provided and we need to switch it back later

@polinaeterna polinaeterna marked this pull request as ready for review January 17, 2025 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants