Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: resolver for unique values for string dimensions #303

Merged
merged 7 commits into from
Mar 1, 2023

Conversation

RogerHYang
Copy link
Contributor

resolves #257

@RogerHYang RogerHYang marked this pull request as draft February 27, 2023 17:33
@RogerHYang RogerHYang changed the title feat: resolver for unique string values feat: resolver for unique values for string dimensions Feb 27, 2023
@@ -77,6 +77,18 @@ def _get_dimensions(
name=name,
data_type=self._infer_dimension_data_type(name),
type=dimension_type,
data=(
lambda name: (
lambda: (
Copy link
Contributor Author

@RogerHYang RogerHYang Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use lazy eval so it doesn't fail existing unit tests

@RogerHYang RogerHYang marked this pull request as ready for review February 27, 2023 22:10
Copy link
Contributor

@mikeldking mikeldking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor thoughts, feel free to come to a good conclusion. Happy to re-visit too.

src/phoenix/core/dimension.py Outdated Show resolved Hide resolved
src/phoenix/core/dimension.py Outdated Show resolved Hide resolved
@@ -9,3 +14,14 @@ class Dimension:
name: str
data_type: DimensionDataType
type: DimensionType
data: Callable[[], List["pd.Series[Any]"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually a bit confused about having the data be attached to the Dimension here - feels more like the model would just keep track if the basic structure (e.g. the schema) and we could resolve the values in the resolver itself. That way this data structure can stay a bit more light-weight. It certainly wouldn't be coachable using static_property but I think it would make the code cleaner and I have a fair amount of control over the caching on the frontend. We can always build a separate caching mechanism for the API itself too if needed.

Copy link
Contributor Author

@RogerHYang RogerHYang Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string values themselves are cached below, so this part is just a pointer. Dimensions are singletons themselves at least on the conceptual level, so they can cache things just fine. While datasets can come and go, Model and Dimensions are platonic abstractions, so it makes sense (at least to me) to have them hold pointers to "implementations" which are the datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think where things might get messy here is if we need to delineate between the primary and the reference data (e.g. figure out distinct values) - then having the data be the merge of the two dataframe columns gets a bit messy here. No need to refactor now but just thinking a bit ahead.

Comment on lines 47 to 50
"""
Returns unique string values in lexicographical order. Non-string values and missing values are ignored.
"""
uniqueStringValues: [String!]!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 I think the API now has two terms being used "categorical" and "string". Let's pick one for consistency as it will inform the domain model and make things a bit more intuitive on the data consumption side.

Copy link
Contributor Author

@RogerHYang RogerHYang Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Categorical" is more generally applicable. Values can be numbers but are still considered categorical (like hurricane categories), and this is in fact how it's understood in pandas (https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html), i.e. a numeric array equipped with a dictionary (it also comes in ordered and unordered varieties).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out, though I think we conflate "categorical" to mean "string" right now. Maybe we should re-visit the dtype and refactor that to be the underlying pseudo-primative string or number so that we could potentially use categorical in the more semantic sense as you point out.

"""
Returns the categories of a categorical dimension (usually a dimension of string values) as a list of unique string labels sorted in lexicographical order. Missing values are excluded. Non-categorical dimensions return an empty list.
"""
categories: [String!]!
Copy link
Contributor Author

@RogerHYang RogerHYang Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follows the terminology in pandas api

The difference here is that we only return observed values, while pandas returns all allowed categories regardless whether they have any observations.

@RogerHYang RogerHYang merged commit f24324b into main Mar 1, 2023
@RogerHYang RogerHYang deleted the uniq_str_vals branch March 1, 2023 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[gql] Dimension Values resolver
2 participants