feat: dataset Prediction Id and Timestamp normalization #166

nate-mar · 2023-01-13T01:27:51Z

Performs normalization of input data frame so that we have a standard set of columns and column types in the resulting dataset (dataframe/parquet file). Specifically this PR addresses the following:

Normalizes Timestamp - includes if omitted, and converts to date time if passed in as a numeric type
Normalizes Prediction ID - includes if omitted and converts to a string if passed in. as a numeric type
Validates valid timestamp and prediction id values

nate-mar · 2023-01-18T13:56:04Z

src/phoenix/datasets/errors.py

@@ -49,6 +49,16 @@ def __init__(self, errors: Union[ValidationError, List[ValidationError]]):
        self.errors = errors


+class InvalidColumnType(ValidationError):


Why does DatasetError only inherit from the BaseException (above) and not ValidationError as well?

Don't remember the rationale but I think the thought was to have a set of errors that made it clear which was to blame - the dataframe being malformed or the schema being mis-configured. Feel free to fix the inheritance as it makes sense.

mikeldking · 2023-01-18T15:10:30Z

src/phoenix/datasets/dataset.py

+            schema = dataclasses.replace(schema, prediction_id_column_name="prediction_id")
+            cols_to_add["prediction_id"] = lambda _: str(uuid.uuid4())
+        elif is_numeric_dtype(dataframe.dtypes[schema.prediction_id_column_name]):
+            dataframe["prediction_id"] = dataframe["prediction_id"].apply(str)


I guess there's a case where if we trust the user input, we have no guarantee of uniqueness. Thoughts on having a column that is not derived from user provided values so we can guarantee uniqueness?

@mikeldking The thought here is that they already have a prediction_id column with a numeric ID that maps rows to something potentially meaningful on their end. This would just cast them to strings, so that they're normalized on our side. Are you suggesting an additional column that would pair with the user provided prediction_id as well?

The uniqueness issue with user input still can be there even if they pass in ids with the expected "'string" type.

mikeldking · 2023-01-18T15:12:22Z

src/phoenix/datasets/dataset.py

+
+        if schema.prediction_id_column_name is None:
+            schema = dataclasses.replace(schema, prediction_id_column_name="prediction_id")
+            cols_to_add["prediction_id"] = lambda _: str(uuid.uuid4())


Assuming uuid is pretty light weight but offset of the row could be substituted here? Or is the thought to have a globally unique id system such that exports can easily comb through either dataframe and have a guarantee of not mis-exporting?

mikeldking · 2023-01-18T15:14:29Z

src/phoenix/datasets/errors.py

@@ -49,6 +49,16 @@ def __init__(self, errors: Union[ValidationError, List[ValidationError]]):
        self.errors = errors


+class InvalidColumnType(ValidationError):


Don't remember the rationale but I think the thought was to have a set of errors that made it clear which was to blame - the dataframe being malformed or the schema being mis-configured. Feel free to fix the inheritance as it makes sense.

src/phoenix/datasets/validation.py

mikeldking · 2023-01-18T15:17:12Z

src/phoenix/datasets/validation.py

    return list(general_checks)


+def check_column_type(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]:


can you add a doc string for IDE support? Not prescriptive on the docs thing but since this is outside of datasets as a util, it might be nice to have it slightly more documented.

mikeldking · 2023-01-18T15:18:04Z

src/phoenix/datasets/validation.py

    return list(general_checks)


+def check_column_type(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]:
+    wrong_type_cols = []


can you type this list? Best to name according to what it contains, which is actually error strings, not the columns

Co-authored-by: Mikyo King <mikyo@arize.com>

axiomofjoy · 2023-01-20T18:01:19Z

src/phoenix/datasets/dataset.py

+
+    if parsed_schema.timestamp_column_name is None:
+        now = Timestamp.utcnow()
+        parsed_schema = dataclasses.replace(parsed_schema, timestamp_column_name="timestamp")


replace is directly imported from dataclasses.

axiomofjoy · 2023-01-20T18:03:26Z

src/phoenix/datasets/schema.py

-        dictionary = self.__dict__
+        dictionary = {}


axiomofjoy · 2023-01-20T18:09:41Z

src/phoenix/datasets/dataset.py

+        ].apply(lambda x: to_datetime(x, unit="ms"))
+
+    if parsed_schema.prediction_id_column_name is None:
+        parsed_schema = dataclasses.replace(


replace is directly imported from dataclasses.

src/phoenix/datasets/dataset.py

tests/datasets/test_dataset.py

axiomofjoy · 2023-01-20T19:46:26Z

@mikeldking I think I see why implementing feature discovery and excludes with pure functions would have been nice + made Nate's life easier 😆

nate-mar · 2023-01-23T18:06:37Z

tests/datasets/test_dataset.py

    @property
    def num_records(self):
        return self._NUM_RECORDS

    @property
    def embedding_dimension(self):
        return self._EMBEDDING_DIMENSION
+
+
+class TestDataset:


separate dataset creation test

src/phoenix/datasets/errors.py

tests/datasets/test_dataset.py

nate-mar added 3 commits January 12, 2023 17:23

wip

b9ecbbc

Update dataset.py

4cce6c2

Update dataset.py

7cab303

nate-mar changed the title ~~Dataset Prediction Id and Timestamp normalization~~ feat: dataset Prediction Id and Timestamp normalization Jan 13, 2023

nate-mar added 3 commits January 13, 2023 04:31

wip

50be45d

updates

efdf6c4

Update test_dataset.py

522c601

nate-mar commented Jan 18, 2023

View reviewed changes

Update test_dataset.py

b0c0380

nate-mar marked this pull request as ready for review January 18, 2023 14:00

mikeldking approved these changes Jan 18, 2023

View reviewed changes

nate-mar and others added 12 commits January 19, 2023 11:54

Update src/phoenix/datasets/validation.py

5f59f3f

Co-authored-by: Mikyo King <mikyo@arize.com>

Update src/phoenix/datasets/validation.py

e4a4259

Co-authored-by: Mikyo King <mikyo@arize.com>

Merge branch 'main' into dataset-normalize-a-few-cols

ba0571e

wip -- tests still fail after merge

f74100b

Update test_dataset.py

7e1c8a8

fixes to tests

0d04367

Update test_dataset.py

a8aa8cb

formatting

64becaa

imports

5989ae7

fix test, address feedback comment

831a871

type fixes

9b66cb6

Update validation.py

47ae0d6

axiomofjoy approved these changes Jan 20, 2023

View reviewed changes

nate-mar added 2 commits January 23, 2023 09:56

updates

ce7008c

Update dataset.py

ea98970

nate-mar commented Jan 23, 2023

View reviewed changes

nate-mar added 2 commits January 23, 2023 10:13

formatting

162873e

update error type

6bff7ef

nate-mar added 2 commits January 23, 2023 16:21

reformatting

653a1be

Update test_dataset.py

09b4789

axiomofjoy approved these changes Jan 24, 2023

View reviewed changes

nate-mar added 3 commits January 23, 2023 17:06

minor changes to address feedback

a6d3d8f

fix formatting

5f72041

Update test_dataset.py

cdebaec

nate-mar merged commit b5556b1 into main Jan 24, 2023

nate-mar deleted the dataset-normalize-a-few-cols branch January 24, 2023 06:30

This was linked to issues Jan 24, 2023

GCS data fixtures for datasets #150

Closed

Datasets predictionId and timestamp normalization in parquet #120

Closed

axiomofjoy removed a link to an issue Jan 24, 2023

GCS data fixtures for datasets #150

Closed

6 tasks

mikeldking mentioned this pull request Feb 8, 2023

Datasets predictionId and timestamp normalization in parquet #120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: dataset Prediction Id and Timestamp normalization #166

feat: dataset Prediction Id and Timestamp normalization #166

nate-mar commented Jan 13, 2023 •

edited

Loading

nate-mar Jan 18, 2023

mikeldking Jan 18, 2023

mikeldking Jan 18, 2023

nate-mar Jan 19, 2023

mikeldking Jan 18, 2023

mikeldking Jan 18, 2023

mikeldking Jan 18, 2023

nate-mar Jan 19, 2023

mikeldking Jan 18, 2023

nate-mar Jan 19, 2023

axiomofjoy Jan 20, 2023

axiomofjoy Jan 20, 2023

axiomofjoy Jan 20, 2023

axiomofjoy commented Jan 20, 2023

nate-mar Jan 23, 2023

		@@ -49,6 +49,16 @@ def __init__(self, errors: Union[ValidationError, List[ValidationError]]):
		self.errors = errors


		class InvalidColumnType(ValidationError):

		return list(general_checks)


		def check_column_type(dataframe: DataFrame, schema: Schema) -> List[err.ValidationError]:

feat: dataset Prediction Id and Timestamp normalization #166

feat: dataset Prediction Id and Timestamp normalization #166

Conversation

nate-mar commented Jan 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

axiomofjoy commented Jan 20, 2023

Choose a reason for hiding this comment

nate-mar commented Jan 13, 2023 •

edited

Loading