ColumnNameLabeler Setup #635

taylorfturner · 2022-09-16T19:46:56Z

adding resource folder for new model
adding new test suite for loading_with_components
added label_mapping for the ColumnNameModel as well

taylorfturner · 2022-09-19T17:48:49Z

dataprofiler/labelers/column_name_model.py

+        label_mapping = {"not": "implemented"}
+        self.set_label_mapping(label_mapping)


Feedback really wanted here: I don't like and it is very spaghetti code to force the labeler load_with_components to run. The API on the labeler requires a label_mapping; however, for this model, we don't use a label mapping currently. We could refactor the code to ultimately use that but POC / MVP stage the data or the model aren't really setup to use this attribute of the API.

taylorfturner · 2022-09-19T17:49:11Z

dataprofiler/labelers/data_processing.py

-            if data[iter_value][0] > self._parameters["positive_threshold_config"]:
+            if labels[iter_value][0] > self._parameters["positive_threshold_config"]:


typo fix after realizing labels is the parameter that represents the output from the model component

taylorfturner · 2022-09-19T17:49:15Z

dataprofiler/labelers/data_processing.py

                results[iter_value] = {}
                try:
                    results[iter_value]["pred"] = self._parameters[
                        "true_positive_dict"
-                    ][data[iter_value][1]]["label"]
+                    ][labels[iter_value][1]]["label"]


typo fix after realizing labels is the parameter that represents the output from the model component

taylorfturner · 2022-09-19T17:49:33Z

dataprofiler/tests/labelers/test_integration_column_name_data_labeler.py

@@ -0,0 +1,69 @@
+import os


brand new file for testing the pre processor, model, and post processor all together

taylorfturner · 2022-09-19T17:54:52Z

dataprofiler/labelers/column_name_model.py

+        logger.info(
+            "For MVP implementation, the `ColumnNameModel`"
+            "does not implement the `label_mapping` utility."
+            "'prediction -> label' mapping is handled in the"
+            "post-processor using the data index / row value."
+        )


RE: comment above... adding a logger.info so the user is aware of the label_mapping situation for MVP implementation

taylorfturner · 2022-09-19T17:55:59Z

dataprofiler/labelers/column_name_model.py

        if show_confidences:
            raise NotImplementedError(
                """`show_confidences` parameter is disabled
-                for Proof of Concept implementation. Confidence
-                values are enabled by default."""
+                for MVP implementation. Note: Confidence
+                values are returned by default."""


My thought is we look at this in a re-work of the pipeline. Will require some thought how the data is passed into the labeler flow and re formatting of the true_positive_dict and false_positive_dict parameter formatting

taylorfturner · 2022-09-19T21:14:44Z

dataprofiler/tests/labelers/test_column_name_model.py

+        self.assertEqual(
+            '{"true_positive_dict": [{"attribute": "ssn", "label": "ssn"}, '
+            '{"attribute": "suffix", "label": "name"}, {"attribute": "my_home_address", '
+            '"label": "address"}], "false_positive_dict": [{"attribute": '
+            '"contract_number", "label": "ssn"}, {"attribute": "role", '
+            '"label": "name"}, {"attribute": "send_address", "label": "address"}], '
+            '"negative_threshold_config": 50, "include_label": true}{"true_positive_dict": '
+            '[{"attribute": "ssn", "label": "ssn"}, {"attribute": "suffix", '
+            '"label": "name"}, {"attribute": "my_home_address", "label": "address"}], '
+            '"false_positive_dict": [{"attribute": "contract_number", "label": "ssn"}, '
+            '{"attribute": "role", "label": "name"}, {"attribute": "send_address", "label": '
+            '"address"}], "negative_threshold_config": 50, "include_label": true}',
+            mock_file.getvalue(),


test load of model (label mapping and model parameters)

taylorfturner · 2022-09-19T21:15:04Z

resources/labelers/column_name_labeler/preprocessor_parameters.json

@@ -0,0 +1 @@
+{}


DirectPassPreprocessor

dataprofiler/tests/labelers/test_column_name_model.py

taylorfturner · 2022-09-19T22:09:11Z

dataprofiler/tests/labelers/test_data_processing.py

+        with self.assertRaises(TypeError):
+            process_output = processor.process(data)


this raises TypeError when just testing the post processor becuase the process() label parameter is None and NoneType is not subscriptable. We could do a more elegant error handling here

taylorfturner

Would like to have this merged, but its not 100% following the API paradigm of the models / labelers. In that this model and post processor are not fully interchangable componenet

taylorfturner · 2022-09-19T22:27:31Z

dataprofiler/labelers/column_name_model.py

        if show_confidences:
-            raise NotImplementedError(
+            raise Warning(
                """`show_confidences` parameter is disabled
-                for Proof of Concept implementation. Confidence
-                values are enabled by default."""
+                for MVP implementation. Due to the requirement
+                of having the data point in the post processor.
+                Note: Confidence values are returned by default."""


disabled becuase confidences must be in the output from the model for the post processing currently. Ideally we move the filtering / labeling around so that this API works as inteneded

For now, the confidences will show by default

should use the logging warning instead of raise Warning.

should we still add confidences to the output anyways as the output should have format:
dict(pred=..., conf=...)?

I think all other models have the same output, no?

or dict(pred=...) if show_confidences=False

or dict(pred=...) if show_confidences=False this is a good idea to make sure things follow the same paradigm; however, as soon as we do this, the filtering on the post processor would not work because there is no similarity score on which to filter

I think all other models have the same output, no?

Yes, they do that is a concern hesitation on this labeler; however, I think where I'm at right now is just make sure those functionalities that don't ... yet... follow the same paradigm are called out in logging and then we will have a follow-up PR to finalize the consistency between this model and the rest of the models already part of the package / model zoom

Updated to allow for use of the show_confidences parameter and to return values in dict(pred=np.array(), conf=np.array())

taylorfturner · 2022-09-19T22:59:24Z

Maybe hold off... I'll see what I can tackle tomorrow AM

JGSweets · 2022-09-19T23:48:52Z

dataprofiler/labelers/column_name_model.py

@@ -243,7 +245,12 @@ def load_from_disk(cls, dirpath):
        with open(model_param_dirpath, "r") as fp:
            parameters = json.load(fp)

-        loaded_model = cls(parameters)
+        # load label_mapping
+        labels_dirpath = os.path.join(dirpath, "label_mapping.json")


JGSweets · 2022-09-19T23:50:52Z

dataprofiler/tests/labelers/test_column_name_model.py

-        model = ColumnNameModel(parameters=mock_model_parameters)
+        cls.parameters = {
+            "true_positive_dict": [
+                {"attribute": "ssn", "label": "ssn"},
+                {"attribute": "suffix", "label": "name"},
+                {"attribute": "my_home_address", "label": "address"},
+            ],
+            "false_positive_dict": [
+                {
+                    "attribute": "contract_number",
+                    "label": "ssn",
+                },
+                {
+                    "attribute": "role",
+                    "label": "name",
+                },
+                {
+                    "attribute": "send_address",
+                    "label": "address",
+                },
+            ],
+            "negative_threshold_config": 50,
+            "include_label": True,
+        }
+
+        cls.test_label_mapping = {"ssn": 1, "name": 2, "address": 3}


Any reason this over:
model = ColumnNameModel(label_mapping=mock_label_parameters, parameters=mock_model_parameters)
?

JGSweets · 2022-09-19T23:53:22Z

I think we are also missing the check in validate_parameters for the model to check against the pos labels against label_mapping to ensure the pos labels are a subset of label_mapping

JGSweets · 2022-09-19T23:54:26Z

resources/labelers/column_name_labeler/data_labeler_parameters.json

@@ -0,0 +1 @@
+{"model": {"class": "ColumnNameModel"}, "preprocessor": {"class": "DirectPassPreprocessor"}, "postprocessor": {"class": "ColumnNamePostprocessor"}}


these files look good

typo in here actually I found when load_from_library test in the test_integration was added

JGSweets · 2022-09-19T23:55:35Z

dataprofiler/tests/labelers/test_integration_column_name_data_labeler.py

+            preprocessor=preprocessor, model=model, postprocessor=postprocessor
+        )
+
+    def test_default_model(self):


do we also need a load_from_library test? ensure the resources work.

Fair point yeah should test load from library for the model for sure

added and actually caught a typo in the resources too -- 👍

JGSweets · 2022-09-19T23:58:16Z

dataprofiler/labelers/column_name_model.py

-        # initialize class
+        # validate and set parameters
+        self.set_label_mapping(label_mapping)
        self._validate_parameters(parameters)
        self._parameters = parameters


Instead of these, could call the super?

BaseModel.__init__(self, label_mapping, parameters)

maybe we can't bc regex didn't? or an oversight.

I'm seeing some of the other models use the BaseModel.__init__ but when I do that or the super unit tests are failing... I'll trouble shoot a bit more though

Not seeing success with doing BaseModel.__init__(self, label_mapping, parameters) ... maybe I'm doing something wrong, though

curious, what were the errors?

taylorfturner · 2022-09-20T03:24:23Z

I think we are also missing the check in validate_parameters for the model to check against the pos labels against label_mapping to ensure the pos labels are a subset of label_mapping

Great call. Added in a check in the _validate_parameters of the column_name_model.py to validate that the true_positive_unique_labels is a subset of the label_mapping values, otherwise raise an error. Pushed and test case added in the test_column_name_model.py test suite, @JGSweets

taylorfturner

major overhaul overnight

taylorfturner · 2022-09-20T12:55:52Z

dataprofiler/labelers/column_name_model.py

+        if parameters["true_positive_dict"]:
+            label_map_dict_keys = set(self.label_mapping.keys())
+            true_positive_unique_labels = set(
+                parameters["true_positive_dict"][0].values()
+            )
+
+            # if not a subset that is less than or equal to
+            # label mapping dict
+            if true_positive_unique_labels > label_map_dict_keys:
+                errors.append(
+                    """`true_positive_dict` must be a subset
+                        of the `label_mapping` values()"""
+                )
+


checking that true_positive_dict label values are together an equal or lesser subset of label_mapping

Ideally we would say what is not supposed to be there, but that can always be done later.

taylorfturner · 2022-09-20T12:56:24Z

dataprofiler/labelers/column_name_model.py

+            elif param == "positive_threshold_config" and (
+                value is None or not isinstance(value, int)
+            ):
+                errors.append(
+                    "`{}` is a required parameter that must be a boolean.".format(param)
+                )


moving this back to column_name_model.py to allow for show_confidences in the ColumnNameDataLabeler

taylorfturner · 2022-09-20T12:56:47Z

dataprofiler/labelers/column_name_model.py

+        predictions = np.array([])
+        confidences = np.array([])
+
+        # `data` at this point is either filtered or not filtered
+        # list of column names on which we are predicting
+        for iter_value, value in enumerate(data):
+
+            if output[iter_value][0] > self._parameters["positive_threshold_config"]:
+                predictions = np.append(
+                    predictions,
+                    self._parameters["true_positive_dict"][output[iter_value][1]][
+                        "label"
+                    ],
+                )
+
+                if show_confidences:
+                    confidences = np.append(confidences, output[iter_value][0])


returning in the same way as other models dict(preds=np.array(), confs=np.array())

taylorfturner · 2022-09-20T12:57:02Z

dataprofiler/labelers/column_name_model.py

+        if show_confidences:
+            return {"pred": predictions, "conf": confidences}
+        return {"pred": predictions}


implementing same output as other data labelers / models

taylorfturner · 2022-09-20T12:57:49Z

dataprofiler/labelers/data_processing.py

@@ -2086,15 +2086,9 @@ class ColumnNameModelPostprocessor(
 ):
    """Subclass of BaseDataPostprocessor for postprocessing regex data."""

-    def __init__(self, true_positive_dict=None, positive_threshold_config=None):


removing this from the post processor to the model file to allow for show_confidences parameter usage and similar output

NOTE: this basically becomes a DirectPass post processor at this point

taylorfturner · 2022-09-20T12:58:34Z

resources/labelers/column_name_labeler/model_parameters.json

@@ -0,0 +1 @@
+{"true_positive_dict": [{"attribute": "ssn", "label": "ssn"}, {"attribute": "suffix", "label": "name"}, {"attribute": "my_home_address", "label": "address"}], "false_positive_dict": [{"attribute": "contract_number", "label": "ssn"}, {"attribute": "role", "label": "name"}, {"attribute": "send_address", "label": "address"}], "negative_threshold_config": 50, "positive_threshold_config": 85, "include_label": true}


positive_threshold_config back in here

taylorfturner · 2022-09-20T12:58:44Z

resources/labelers/column_name_labeler/postprocessor_parameters.json

@@ -0,0 +1 @@
+{}


no params here anymore

taylorfturner added Work In Progress Solution is being developed High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable New Feature A feature addition not currently in the library labels Sep 16, 2022

taylorfturner requested a review from JGSweets as a code owner September 16, 2022 19:46

taylorfturner self-assigned this Sep 16, 2022

taylorfturner requested review from ksneab7, micdavis and tyfarnan as code owners September 16, 2022 19:46

taylorfturner added 5 commits September 19, 2022 10:35

column name labeler setup

5f90417

rename test class and file name

a924f28

load command

8cd9074

data labeler params json

e18bb20

run pre-commit

5a3209a

JGSweets enabled auto-merge (squash) September 19, 2022 16:23

running tests on labeler suite for ColumnNameModel

dcc1052

taylorfturner commented Sep 19, 2022

View reviewed changes

taylorfturner added 2 commits September 19, 2022 13:52

running tests on labeler suite for ColumnNameModel

f241707

running tests on labeler suite for ColumnNameModel

6fc59e2

taylorfturner commented Sep 19, 2022

View reviewed changes

taylorfturner added 2 commits September 19, 2022 13:57

comment in the test

1d46359

label mapping

035b1f1

taylorfturner removed the Work In Progress Solution is being developed label Sep 19, 2022

taylorfturner added 2 commits September 19, 2022 16:53

label mapping tests

c31b49f

running now

79c2ab0

taylorfturner commented Sep 19, 2022

View reviewed changes

micdavis reviewed Sep 19, 2022

View reviewed changes

dataprofiler/tests/labelers/test_column_name_model.py Outdated Show resolved Hide resolved

micdavis previously approved these changes Sep 19, 2022

View reviewed changes

fix raise regex

ce1dc35

taylorfturner dismissed micdavis’s stale review via ce1dc35 September 19, 2022 21:45

taylorfturner added 2 commits September 19, 2022 18:01

fix error in test

e77ca9e

fix

8a0a686

taylorfturner commented Sep 19, 2022

View reviewed changes

taylorfturner added 2 commits September 19, 2022 18:15

fix

c85635e

fix

a2d1586

taylorfturner commented Sep 19, 2022

View reviewed changes

JGSweets reviewed Sep 19, 2022

View reviewed changes

taylorfturner added 2 commits September 19, 2022 22:45

white space

a321037

validate subset true_positive_dict of label_mapping

9db41c7

taylorfturner added 4 commits September 19, 2022 23:30

user logger for raising a warning

78c29fc

fix assertLogs

d63398a

move around things

1c442aa

clean up and make more consistent across models

3cdc143

taylorfturner commented Sep 20, 2022

View reviewed changes

add show_confidence False test

0d7a49e

micdavis approved these changes Sep 20, 2022

View reviewed changes

JGSweets approved these changes Sep 20, 2022

View reviewed changes

JGSweets merged commit 62294b6 into capitalone:main Sep 20, 2022

taylorfturner deleted the new_model/labeler branch October 3, 2022 15:38

		label_mapping = {"not": "implemented"}
		self.set_label_mapping(label_mapping)

		if data[iter_value][0] > self._parameters["positive_threshold_config"]:
		if labels[iter_value][0] > self._parameters["positive_threshold_config"]:

		with self.assertRaises(TypeError):
		process_output = processor.process(data)

		@@ -0,0 +1 @@
		{"model": {"class": "ColumnNameModel"}, "preprocessor": {"class": "DirectPassPreprocessor"}, "postprocessor": {"class": "ColumnNamePostprocessor"}}

		@@ -0,0 +1 @@
		{"true_positive_dict": [{"attribute": "ssn", "label": "ssn"}, {"attribute": "suffix", "label": "name"}, {"attribute": "my_home_address", "label": "address"}], "false_positive_dict": [{"attribute": "contract_number", "label": "ssn"}, {"attribute": "role", "label": "name"}, {"attribute": "send_address", "label": "address"}], "negative_threshold_config": 50, "positive_threshold_config": 85, "include_label": true}

ColumnNameLabeler Setup #635

ColumnNameLabeler Setup #635

Conversation

taylorfturner commented Sep 16, 2022 • edited Loading

taylorfturner Sep 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taylorfturner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taylorfturner commented Sep 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets commented Sep 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taylorfturner commented Sep 20, 2022

taylorfturner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taylorfturner commented Sep 16, 2022 •

edited

Loading

taylorfturner Sep 19, 2022 •

edited

Loading