feat: resolver for unique values for string dimensions #303

RogerHYang · 2023-02-27T17:05:10Z

resolves #257

RogerHYang · 2023-02-27T22:09:41Z

src/phoenix/core/model.py

@@ -77,6 +77,18 @@ def _get_dimensions(
                        name=name,
                        data_type=self._infer_dimension_data_type(name),
                        type=dimension_type,
+                        data=(
+                            lambda name: (
+                                lambda: (


use lazy eval so it doesn't fail existing unit tests

mikeldking

Left some minor thoughts, feel free to come to a good conclusion. Happy to re-visit too.

src/phoenix/core/dimension.py

mikeldking · 2023-02-28T03:25:43Z

src/phoenix/core/dimension.py

@@ -9,3 +14,14 @@ class Dimension:
    name: str
    data_type: DimensionDataType
    type: DimensionType
+    data: Callable[[], List["pd.Series[Any]"]]


Actually a bit confused about having the data be attached to the Dimension here - feels more like the model would just keep track if the basic structure (e.g. the schema) and we could resolve the values in the resolver itself. That way this data structure can stay a bit more light-weight. It certainly wouldn't be coachable using static_property but I think it would make the code cleaner and I have a fair amount of control over the caching on the frontend. We can always build a separate caching mechanism for the API itself too if needed.

The string values themselves are cached below, so this part is just a pointer. Dimensions are singletons themselves at least on the conceptual level, so they can cache things just fine. While datasets can come and go, Model and Dimensions are platonic abstractions, so it makes sense (at least to me) to have them hold pointers to "implementations" which are the datasets.

I think where things might get messy here is if we need to delineate between the primary and the reference data (e.g. figure out distinct values) - then having the data be the merge of the two dataframe columns gets a bit messy here. No need to refactor now but just thinking a bit ahead.

mikeldking · 2023-02-28T03:33:50Z

app/schema.graphql

+  """
+  Returns unique string values in lexicographical order. Non-string values and missing values are ignored.
+  """
+  uniqueStringValues: [String!]!


💭 I think the API now has two terms being used "categorical" and "string". Let's pick one for consistency as it will inform the domain model and make things a bit more intuitive on the data consumption side.

"Categorical" is more generally applicable. Values can be numbers but are still considered categorical (like hurricane categories), and this is in fact how it's understood in pandas (https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html), i.e. a numeric array equipped with a dictionary (it also comes in ordered and unordered varieties).

Good call out, though I think we conflate "categorical" to mean "string" right now. Maybe we should re-visit the dtype and refactor that to be the underlying pseudo-primative string or number so that we could potentially use categorical in the more semantic sense as you point out.

RogerHYang · 2023-03-01T03:28:39Z

app/schema.graphql

+  """
+  Returns the categories of a categorical dimension (usually a dimension of string values) as a list of unique string labels sorted in lexicographical order. Missing values are excluded. Non-categorical dimensions return an empty list.
+  """
+  categories: [String!]!


Follows the terminology in pandas api

The difference here is that we only return observed values, while pandas returns all allowed categories regardless whether they have any observations.

add method for unique string values

70c7c57

RogerHYang marked this pull request as draft February 27, 2023 17:33

convert to lazy eval to pass tests

26a4f92

RogerHYang changed the title ~~feat: resolver for unique string values~~ feat: resolver for unique values for string dimensions Feb 27, 2023

RogerHYang added 2 commits February 27, 2023 14:02

fix lamba closure

d924e31

use list instead

373786e

RogerHYang commented Feb 27, 2023

View reviewed changes

RogerHYang marked this pull request as ready for review February 27, 2023 22:10

mikeldking approved these changes Feb 28, 2023

View reviewed changes

RogerHYang added 2 commits February 27, 2023 23:41

rename variables

1ffc4d6

rename resolver to categories

0abb5a1

RogerHYang commented Mar 1, 2023

View reviewed changes

clarify observed categories

d6a67c0

RogerHYang merged commit f24324b into main Mar 1, 2023

RogerHYang deleted the uniq_str_vals branch March 1, 2023 03:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: resolver for unique values for string dimensions #303

feat: resolver for unique values for string dimensions #303

RogerHYang commented Feb 27, 2023

RogerHYang Feb 27, 2023 •

edited

Loading

mikeldking left a comment

mikeldking Feb 28, 2023

RogerHYang Feb 28, 2023 •

edited

Loading

mikeldking Feb 28, 2023

mikeldking Feb 28, 2023

RogerHYang Feb 28, 2023 •

edited

Loading

mikeldking Feb 28, 2023

RogerHYang Mar 1, 2023 •

edited

Loading

feat: resolver for unique values for string dimensions #303

feat: resolver for unique values for string dimensions #303

Conversation

RogerHYang commented Feb 27, 2023

RogerHYang Feb 27, 2023 • edited Loading

Choose a reason for hiding this comment

mikeldking left a comment

Choose a reason for hiding this comment

mikeldking Feb 28, 2023

Choose a reason for hiding this comment

RogerHYang Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

mikeldking Feb 28, 2023

Choose a reason for hiding this comment

mikeldking Feb 28, 2023

Choose a reason for hiding this comment

RogerHYang Feb 28, 2023 • edited Loading

Choose a reason for hiding this comment

mikeldking Feb 28, 2023

Choose a reason for hiding this comment

RogerHYang Mar 1, 2023 • edited Loading

Choose a reason for hiding this comment

RogerHYang Feb 27, 2023 •

edited

Loading

RogerHYang Feb 28, 2023 •

edited

Loading

RogerHYang Feb 28, 2023 •

edited

Loading

RogerHYang Mar 1, 2023 •

edited

Loading