-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: resolver for unique values for string dimensions #303
Conversation
@@ -77,6 +77,18 @@ def _get_dimensions( | |||
name=name, | |||
data_type=self._infer_dimension_data_type(name), | |||
type=dimension_type, | |||
data=( | |||
lambda name: ( | |||
lambda: ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use lazy eval so it doesn't fail existing unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor thoughts, feel free to come to a good conclusion. Happy to re-visit too.
@@ -9,3 +14,14 @@ class Dimension: | |||
name: str | |||
data_type: DimensionDataType | |||
type: DimensionType | |||
data: Callable[[], List["pd.Series[Any]"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually a bit confused about having the data be attached to the Dimension
here - feels more like the model would just keep track if the basic structure (e.g. the schema) and we could resolve the values in the resolver itself. That way this data structure can stay a bit more light-weight. It certainly wouldn't be coachable using static_property
but I think it would make the code cleaner and I have a fair amount of control over the caching on the frontend. We can always build a separate caching mechanism for the API itself too if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string values themselves are cached below, so this part is just a pointer. Dimensions are singletons themselves at least on the conceptual level, so they can cache things just fine. While datasets can come and go, Model and Dimensions are platonic abstractions, so it makes sense (at least to me) to have them hold pointers to "implementations" which are the datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think where things might get messy here is if we need to delineate between the primary and the reference data (e.g. figure out distinct values) - then having the data
be the merge of the two dataframe columns gets a bit messy here. No need to refactor now but just thinking a bit ahead.
app/schema.graphql
Outdated
""" | ||
Returns unique string values in lexicographical order. Non-string values and missing values are ignored. | ||
""" | ||
uniqueStringValues: [String!]! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 I think the API now has two terms being used "categorical" and "string". Let's pick one for consistency as it will inform the domain model and make things a bit more intuitive on the data consumption side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Categorical" is more generally applicable. Values can be numbers but are still considered categorical (like hurricane categories), and this is in fact how it's understood in pandas (https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html), i.e. a numeric array equipped with a dictionary (it also comes in ordered
and unordered
varieties).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out, though I think we conflate "categorical" to mean "string" right now. Maybe we should re-visit the dtype
and refactor that to be the underlying pseudo-primative string
or number
so that we could potentially use categorical
in the more semantic sense as you point out.
""" | ||
Returns the categories of a categorical dimension (usually a dimension of string values) as a list of unique string labels sorted in lexicographical order. Missing values are excluded. Non-categorical dimensions return an empty list. | ||
""" | ||
categories: [String!]! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follows the terminology in pandas api
The difference here is that we only return observed values, while pandas returns all allowed categories regardless whether they have any observations.
resolves #257