Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: resolver for unique values for string dimensions #303

Merged
merged 7 commits into from
Mar 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions app/schema.graphql
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,11 @@ type Dimension implements Node {
dataType: DimensionDataType!
dataQualityMetric(metric: DataQualityMetric!): Float

"""
Returns the observed categories of a categorical dimension (usually a dimension of string values) as a list of unique string labels sorted in lexicographical order. Missing values are excluded. Non-categorical dimensions return an empty list.
"""
categories: [String!]!
Copy link
Contributor Author

@RogerHYang RogerHYang Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follows the terminology in pandas api

The difference here is that we only return observed values, while pandas returns all allowed categories regardless whether they have any observations.


"""
Returns the time series of the specified metric for data within timeRange. Data points are generated starting at the end time, are separated by the sampling interval. Each data point is labeled by the end instant of and contains data from their respective evaluation window.
"""
Expand Down
16 changes: 16 additions & 0 deletions src/phoenix/core/dimension.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,9 @@
from dataclasses import dataclass
from functools import cached_property
from itertools import chain
from typing import Any, Callable, List

import pandas as pd

from .dimension_data_type import DimensionDataType
from .dimension_type import DimensionType
Expand All @@ -9,3 +14,14 @@ class Dimension:
name: str
data_type: DimensionDataType
type: DimensionType
data: Callable[[], List["pd.Series[Any]"]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually a bit confused about having the data be attached to the Dimension here - feels more like the model would just keep track if the basic structure (e.g. the schema) and we could resolve the values in the resolver itself. That way this data structure can stay a bit more light-weight. It certainly wouldn't be coachable using static_property but I think it would make the code cleaner and I have a fair amount of control over the caching on the frontend. We can always build a separate caching mechanism for the API itself too if needed.

Copy link
Contributor Author

@RogerHYang RogerHYang Feb 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string values themselves are cached below, so this part is just a pointer. Dimensions are singletons themselves at least on the conceptual level, so they can cache things just fine. While datasets can come and go, Model and Dimensions are platonic abstractions, so it makes sense (at least to me) to have them hold pointers to "implementations" which are the datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think where things might get messy here is if we need to delineate between the primary and the reference data (e.g. figure out distinct values) - then having the data be the merge of the two dataframe columns gets a bit messy here. No need to refactor now but just thinking a bit ahead.


@cached_property
def categories(self) -> List[str]:
if self.data_type != DimensionDataType.CATEGORICAL:
return []
return sorted(
value
for value in set(chain.from_iterable(series.unique() for series in self.data()))
if isinstance(value, str)
)
12 changes: 12 additions & 0 deletions src/phoenix/core/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,18 @@ def _get_dimensions(
name=name,
data_type=self._infer_dimension_data_type(name),
type=dimension_type,
data=(
lambda name: (
lambda: (
Copy link
Contributor Author

@RogerHYang RogerHYang Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use lazy eval so it doesn't fail existing unit tests

[primary_dataset.dataframe.loc[:, name]]
+ (
[reference_dataset.dataframe.loc[:, name]]
if reference_dataset is not None
else []
)
)
)
)(name),
)
)

Expand Down
15 changes: 14 additions & 1 deletion src/phoenix/server/api/types/Dimension.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from datetime import timedelta
from typing import Optional
from typing import List, Optional

import strawberry
from strawberry.types import Info
Expand Down Expand Up @@ -41,6 +41,19 @@ async def dataQualityMetric(
return await info.context.loaders.percent_empty.load(dimension_name)
raise NotImplementedError(f"Metric {metric} is not implemented.")

@strawberry.field(
description=(
"Returns the observed categories of a categorical dimension (usually a dimension of"
" string values) as a list of unique string labels sorted in lexicographical order."
" Missing values are excluded. Non-categorical dimensions return an empty list."
)
) # type: ignore # https://github.com/strawberry-graphql/strawberry/issues/1929
def categories(self, info: Info[Context, None]) -> List[str]:
for dim in info.context.model.dimensions:
if dim.name == self.name:
return dim.categories
return []

@strawberry.field(
description=(
"Returns the time series of the specified metric for data within timeRange. Data points"
Expand Down