Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: dataset Prediction Id and Timestamp normalization #166

Merged
merged 28 commits into from
Jan 24, 2023

Conversation

nate-mar
Copy link
Contributor

@nate-mar nate-mar commented Jan 13, 2023

Performs normalization of input data frame so that we have a standard set of columns and column types in the resulting dataset (dataframe/parquet file). Specifically this PR addresses the following:

  • Normalizes Timestamp - includes if omitted, and converts to date time if passed in as a numeric type
  • Normalizes Prediction ID - includes if omitted and converts to a string if passed in. as a numeric type
  • Validates valid timestamp and prediction id values

@nate-mar nate-mar changed the title Dataset Prediction Id and Timestamp normalization feat: dataset Prediction Id and Timestamp normalization Jan 13, 2023
@@ -49,6 +49,16 @@ def __init__(self, errors: Union[ValidationError, List[ValidationError]]):
self.errors = errors


class InvalidColumnType(ValidationError):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does DatasetError only inherit from the BaseException (above) and not ValidationError as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remember the rationale but I think the thought was to have a set of errors that made it clear which was to blame - the dataframe being malformed or the schema being mis-configured. Feel free to fix the inheritance as it makes sense.

@nate-mar nate-mar marked this pull request as ready for review January 18, 2023 14:00
schema = dataclasses.replace(schema, prediction_id_column_name="prediction_id")
cols_to_add["prediction_id"] = lambda _: str(uuid.uuid4())
elif is_numeric_dtype(dataframe.dtypes[schema.prediction_id_column_name]):
dataframe["prediction_id"] = dataframe["prediction_id"].apply(str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's a case where if we trust the user input, we have no guarantee of uniqueness. Thoughts on having a column that is not derived from user provided values so we can guarantee uniqueness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikeldking The thought here is that they already have a prediction_id column with a numeric ID that maps rows to something potentially meaningful on their end. This would just cast them to strings, so that they're normalized on our side. Are you suggesting an additional column that would pair with the user provided prediction_id as well?

The uniqueness issue with user input still can be there even if they pass in ids with the expected "'string" type.


if schema.prediction_id_column_name is None:
schema = dataclasses.replace(schema, prediction_id_column_name="prediction_id")
cols_to_add["prediction_id"] = lambda _: str(uuid.uuid4())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming uuid is pretty light weight but offset of the row could be substituted here? Or is the thought to have a globally unique id system such that exports can easily comb through either dataframe and have a guarantee of not mis-exporting?

@@ -49,6 +49,16 @@ def __init__(self, errors: Union[ValidationError, List[ValidationError]]):
self.errors = errors


class InvalidColumnType(ValidationError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remember the rationale but I think the thought was to have a set of errors that made it clear which was to blame - the dataframe being malformed or the schema being mis-configured. Feel free to fix the inheritance as it makes sense.

src/phoenix/datasets/validation.py Outdated Show resolved Hide resolved
src/phoenix/datasets/validation.py Outdated Show resolved Hide resolved
return list(general_checks)


def check_column_type(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc string for IDE support? Not prescriptive on the docs thing but since this is outside of datasets as a util, it might be nice to have it slightly more documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return list(general_checks)


def check_column_type(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]:
wrong_type_cols = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type this list? Best to name according to what it contains, which is actually error strings, not the columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


if parsed_schema.timestamp_column_name is None:
now = Timestamp.utcnow()
parsed_schema = dataclasses.replace(parsed_schema, timestamp_column_name="timestamp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace is directly imported from dataclasses.

dictionary = self.__dict__
dictionary = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find.

].apply(lambda x: to_datetime(x, unit="ms"))

if parsed_schema.prediction_id_column_name is None:
parsed_schema = dataclasses.replace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace is directly imported from dataclasses.

src/phoenix/datasets/dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
@axiomofjoy
Copy link
Contributor

@mikeldking I think I see why implementing feature discovery and excludes with pure functions would have been nice + made Nate's life easier 😆

@property
def num_records(self):
return self._NUM_RECORDS

@property
def embedding_dimension(self):
return self._EMBEDDING_DIMENSION


class TestDataset:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate dataset creation test

src/phoenix/datasets/errors.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
tests/datasets/test_dataset.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Datasets predictionId and timestamp normalization in parquet
3 participants