-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stats for datetimes #3007
base: main
Are you sure you want to change the base?
Stats for datetimes #3007
Conversation
06cbfed
to
79790e0
Compare
|
||
|
||
class Histogram(TypedDict): | ||
hist: list[int] | ||
bin_edges: list[Union[int, float]] | ||
|
||
|
||
class DatetimeHistogram(TypedDict): | ||
hist: list[int] | ||
bin_edges: list[str] # edges are string representations of dates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
orjson supports datetime serialization though, maybe I should return datetimes then?
transformed_column_name: str, | ||
min_date: datetime.datetime, | ||
) -> pl.DataFrame: | ||
return data.select((pl.col(column_name) - min_date).dt.total_seconds().alias(transformed_column_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operate on seconds
currently not passing, bug on polars side?
…er into datetime-stats
not sure it works as expected
test fail is not reproduced locally
to debug why test is not passed in CI but works locally
I have no idea how upd: this is actually not a common timestamp format, so maybe i'd ignore it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, not sure we can merge as is though, does it need any change on the front end side ? cc @severo for viz
Also this change requires implementing its associated filtering for /filter. I think DuckDB can filter after casting the strings to timestamps
Super useful. For the frontend, first: we will need the openapi spec to be updated, so that the API is clear. But I think it should work directly. One detail though: the edges are long strings, and THIS might break the design, so, surely something to add in moonlanding. |
Co-authored-by: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com>
polars switches dates to utc when casting from string while we want to preserve original dates
provide manually only in case of failure
n_samples=n_samples, | ||
) | ||
return stats | ||
except Exception as error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to fallback to string stats if datetime compute failed.
This would allow cases like this (here there are multiple datetime formats in Gold published date
column) to display something instead of empty stats.
I test this case in "datetime_string_error"
column in test dataset.
This will probably make filter part trickier though.
if isinstance(data[column_name].dtype, pl.String): | ||
# let polars identify format itself. provide manually in case of error | ||
try: | ||
original_timezone = get_timezone(data[column_name][0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I track original timezone here because polars converts datetimes to UTC when format is not provided and we need to switch it back later
No description provided.