-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: dataset Prediction Id and Timestamp normalization #166
Conversation
@@ -49,6 +49,16 @@ def __init__(self, errors: Union[ValidationError, List[ValidationError]]): | |||
self.errors = errors | |||
|
|||
|
|||
class InvalidColumnType(ValidationError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does DatasetError only inherit from the BaseException (above) and not ValidationError as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't remember the rationale but I think the thought was to have a set of errors that made it clear which was to blame - the dataframe
being malformed or the schema
being mis-configured. Feel free to fix the inheritance as it makes sense.
src/phoenix/datasets/dataset.py
Outdated
schema = dataclasses.replace(schema, prediction_id_column_name="prediction_id") | ||
cols_to_add["prediction_id"] = lambda _: str(uuid.uuid4()) | ||
elif is_numeric_dtype(dataframe.dtypes[schema.prediction_id_column_name]): | ||
dataframe["prediction_id"] = dataframe["prediction_id"].apply(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's a case where if we trust the user input, we have no guarantee of uniqueness. Thoughts on having a column that is not derived from user provided values so we can guarantee uniqueness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikeldking The thought here is that they already have a prediction_id column with a numeric ID that maps rows to something potentially meaningful on their end. This would just cast them to strings, so that they're normalized on our side. Are you suggesting an additional column that would pair with the user provided prediction_id as well?
The uniqueness issue with user input still can be there even if they pass in ids with the expected "'string" type.
src/phoenix/datasets/dataset.py
Outdated
|
||
if schema.prediction_id_column_name is None: | ||
schema = dataclasses.replace(schema, prediction_id_column_name="prediction_id") | ||
cols_to_add["prediction_id"] = lambda _: str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming uuid is pretty light weight but offset of the row could be substituted here? Or is the thought to have a globally unique id system such that exports can easily comb through either dataframe and have a guarantee of not mis-exporting?
@@ -49,6 +49,16 @@ def __init__(self, errors: Union[ValidationError, List[ValidationError]]): | |||
self.errors = errors | |||
|
|||
|
|||
class InvalidColumnType(ValidationError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't remember the rationale but I think the thought was to have a set of errors that made it clear which was to blame - the dataframe
being malformed or the schema
being mis-configured. Feel free to fix the inheritance as it makes sense.
src/phoenix/datasets/validation.py
Outdated
return list(general_checks) | ||
|
||
|
||
def check_column_type(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc string for IDE support? Not prescriptive on the docs thing but since this is outside of datasets
as a util, it might be nice to have it slightly more documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/phoenix/datasets/validation.py
Outdated
return list(general_checks) | ||
|
||
|
||
def check_column_type(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]: | ||
wrong_type_cols = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this list? Best to name according to what it contains, which is actually error strings, not the columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Mikyo King <mikyo@arize.com>
Co-authored-by: Mikyo King <mikyo@arize.com>
src/phoenix/datasets/dataset.py
Outdated
|
||
if parsed_schema.timestamp_column_name is None: | ||
now = Timestamp.utcnow() | ||
parsed_schema = dataclasses.replace(parsed_schema, timestamp_column_name="timestamp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace
is directly imported from dataclasses
.
dictionary = self.__dict__ | ||
dictionary = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find.
src/phoenix/datasets/dataset.py
Outdated
].apply(lambda x: to_datetime(x, unit="ms")) | ||
|
||
if parsed_schema.prediction_id_column_name is None: | ||
parsed_schema = dataclasses.replace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace
is directly imported from dataclasses
.
@mikeldking I think I see why implementing feature discovery and excludes with pure functions would have been nice + made Nate's life easier 😆 |
tests/datasets/test_dataset.py
Outdated
@property | ||
def num_records(self): | ||
return self._NUM_RECORDS | ||
|
||
@property | ||
def embedding_dimension(self): | ||
return self._EMBEDDING_DIMENSION | ||
|
||
|
||
class TestDataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate dataset creation test
Performs normalization of input data frame so that we have a standard set of columns and column types in the resulting dataset (dataframe/parquet file). Specifically this PR addresses the following: