-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: resolver for unique values for string dimensions #303
Changes from all commits
70c7c57
26a4f92
d924e31
373786e
1ffc4d6
0abb5a1
d6a67c0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,9 @@ | ||
from dataclasses import dataclass | ||
from functools import cached_property | ||
from itertools import chain | ||
from typing import Any, Callable, List | ||
|
||
import pandas as pd | ||
|
||
from .dimension_data_type import DimensionDataType | ||
from .dimension_type import DimensionType | ||
|
@@ -9,3 +14,14 @@ class Dimension: | |
name: str | ||
data_type: DimensionDataType | ||
type: DimensionType | ||
data: Callable[[], List["pd.Series[Any]"]] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually a bit confused about having the data be attached to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The string values themselves are cached below, so this part is just a pointer. Dimensions are singletons themselves at least on the conceptual level, so they can cache things just fine. While datasets can come and go, Model and Dimensions are platonic abstractions, so it makes sense (at least to me) to have them hold pointers to "implementations" which are the datasets. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think where things might get messy here is if we need to delineate between the primary and the reference data (e.g. figure out distinct values) - then having the |
||
|
||
@cached_property | ||
def categories(self) -> List[str]: | ||
if self.data_type != DimensionDataType.CATEGORICAL: | ||
return [] | ||
return sorted( | ||
value | ||
for value in set(chain.from_iterable(series.unique() for series in self.data())) | ||
if isinstance(value, str) | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -77,6 +77,18 @@ def _get_dimensions( | |
name=name, | ||
data_type=self._infer_dimension_data_type(name), | ||
type=dimension_type, | ||
data=( | ||
lambda name: ( | ||
lambda: ( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use lazy eval so it doesn't fail existing unit tests |
||
[primary_dataset.dataframe.loc[:, name]] | ||
+ ( | ||
[reference_dataset.dataframe.loc[:, name]] | ||
if reference_dataset is not None | ||
else [] | ||
) | ||
) | ||
) | ||
)(name), | ||
) | ||
) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follows the terminology in pandas api
The difference here is that we only return observed values, while pandas returns all allowed categories regardless whether they have any observations.