Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColumnNameLabeler Setup #635

Merged
merged 26 commits into from
Sep 20, 2022
Merged

ColumnNameLabeler Setup #635

merged 26 commits into from
Sep 20, 2022

Conversation

taylorfturner
Copy link
Contributor

@taylorfturner taylorfturner commented Sep 16, 2022

  • adding resource folder for new model
  • adding new test suite for loading_with_components
  • added label_mapping for the ColumnNameModel as well

@taylorfturner taylorfturner added Work In Progress Solution is being developed High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable New Feature A feature addition not currently in the library labels Sep 16, 2022
@taylorfturner taylorfturner self-assigned this Sep 16, 2022
@JGSweets JGSweets enabled auto-merge (squash) September 19, 2022 16:23
Comment on lines 42 to 43
label_mapping = {"not": "implemented"}
self.set_label_mapping(label_mapping)
Copy link
Contributor Author

@taylorfturner taylorfturner Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback really wanted here: I don't like and it is very spaghetti code to force the labeler load_with_components to run. The API on the labeler requires a label_mapping; however, for this model, we don't use a label mapping currently. We could refactor the code to ultimately use that but POC / MVP stage the data or the model aren't really setup to use this attribute of the API.

if data[iter_value][0] > self._parameters["positive_threshold_config"]:
if labels[iter_value][0] > self._parameters["positive_threshold_config"]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo fix after realizing labels is the parameter that represents the output from the model component

results[iter_value] = {}
try:
results[iter_value]["pred"] = self._parameters[
"true_positive_dict"
][data[iter_value][1]]["label"]
][labels[iter_value][1]]["label"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo fix after realizing labels is the parameter that represents the output from the model component

@@ -0,0 +1,69 @@
import os
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brand new file for testing the pre processor, model, and post processor all together

Comment on lines 44 to 49
logger.info(
"For MVP implementation, the `ColumnNameModel`"
"does not implement the `label_mapping` utility."
"'prediction -> label' mapping is handled in the"
"post-processor using the data index / row value."
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE: comment above... adding a logger.info so the user is aware of the label_mapping situation for MVP implementation

Comment on lines 229 to 233
if show_confidences:
raise NotImplementedError(
"""`show_confidences` parameter is disabled
for Proof of Concept implementation. Confidence
values are enabled by default."""
for MVP implementation. Note: Confidence
values are returned by default."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is we look at this in a re-work of the pipeline. Will require some thought how the data is passed into the labeler flow and re formatting of the true_positive_dict and false_positive_dict parameter formatting

@taylorfturner taylorfturner removed the Work In Progress Solution is being developed label Sep 19, 2022
Comment on lines 176 to 188
self.assertEqual(
'{"true_positive_dict": [{"attribute": "ssn", "label": "ssn"}, '
'{"attribute": "suffix", "label": "name"}, {"attribute": "my_home_address", '
'"label": "address"}], "false_positive_dict": [{"attribute": '
'"contract_number", "label": "ssn"}, {"attribute": "role", '
'"label": "name"}, {"attribute": "send_address", "label": "address"}], '
'"negative_threshold_config": 50, "include_label": true}{"true_positive_dict": '
'[{"attribute": "ssn", "label": "ssn"}, {"attribute": "suffix", '
'"label": "name"}, {"attribute": "my_home_address", "label": "address"}], '
'"false_positive_dict": [{"attribute": "contract_number", "label": "ssn"}, '
'{"attribute": "role", "label": "name"}, {"attribute": "send_address", "label": '
'"address"}], "negative_threshold_config": 50, "include_label": true}',
mock_file.getvalue(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test load of model (label mapping and model parameters)

@@ -0,0 +1 @@
{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DirectPassPreprocessor

micdavis
micdavis previously approved these changes Sep 19, 2022
Comment on lines 3020 to 3021
with self.assertRaises(TypeError):
process_output = processor.process(data)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this raises TypeError when just testing the post processor becuase the process() label parameter is None and NoneType is not subscriptable. We could do a more elegant error handling here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out dated

Copy link
Contributor Author

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to have this merged, but its not 100% following the API paradigm of the models / labelers. In that this model and post processor are not fully interchangable componenet

Comment on lines 221 to 226
if show_confidences:
raise NotImplementedError(
raise Warning(
"""`show_confidences` parameter is disabled
for Proof of Concept implementation. Confidence
values are enabled by default."""
for MVP implementation. Due to the requirement
of having the data point in the post processor.
Note: Confidence values are returned by default."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disabled becuase confidences must be in the output from the model for the post processing currently. Ideally we move the filtering / labeling around so that this API works as inteneded

For now, the confidences will show by default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use the logging warning instead of raise Warning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we still add confidences to the output anyways as the output should have format:
dict(pred=..., conf=...)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all other models have the same output, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or dict(pred=...) if show_confidences=False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or dict(pred=...) if show_confidences=False this is a good idea to make sure things follow the same paradigm; however, as soon as we do this, the filtering on the post processor would not work because there is no similarity score on which to filter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all other models have the same output, no?

Yes, they do that is a concern hesitation on this labeler; however, I think where I'm at right now is just make sure those functionalities that don't ... yet... follow the same paradigm are called out in logging and then we will have a follow-up PR to finalize the consistency between this model and the rest of the models already part of the package / model zoom

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to allow for use of the show_confidences parameter and to return values in dict(pred=np.array(), conf=np.array())

@taylorfturner
Copy link
Contributor Author

Maybe hold off... I'll see what I can tackle tomorrow AM

@@ -243,7 +245,12 @@ def load_from_disk(cls, dirpath):
with open(model_param_dirpath, "r") as fp:
parameters = json.load(fp)

loaded_model = cls(parameters)
# load label_mapping
labels_dirpath = os.path.join(dirpath, "label_mapping.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Comment on lines 82 to 106
model = ColumnNameModel(parameters=mock_model_parameters)
cls.parameters = {
"true_positive_dict": [
{"attribute": "ssn", "label": "ssn"},
{"attribute": "suffix", "label": "name"},
{"attribute": "my_home_address", "label": "address"},
],
"false_positive_dict": [
{
"attribute": "contract_number",
"label": "ssn",
},
{
"attribute": "role",
"label": "name",
},
{
"attribute": "send_address",
"label": "address",
},
],
"negative_threshold_config": 50,
"include_label": True,
}

cls.test_label_mapping = {"ssn": 1, "name": 2, "address": 3}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this over:
model = ColumnNameModel(label_mapping=mock_label_parameters, parameters=mock_model_parameters)
?

@JGSweets
Copy link
Contributor

I think we are also missing the check in validate_parameters for the model to check against the pos labels against label_mapping to ensure the pos labels are a subset of label_mapping

@@ -0,0 +1 @@
{"model": {"class": "ColumnNameModel"}, "preprocessor": {"class": "DirectPassPreprocessor"}, "postprocessor": {"class": "ColumnNamePostprocessor"}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these files look good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dope

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in here actually I found when load_from_library test in the test_integration was added

preprocessor=preprocessor, model=model, postprocessor=postprocessor
)

def test_default_model(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need a load_from_library test? ensure the resources work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point yeah should test load from library for the model for sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added and actually caught a typo in the resources too -- 👍

Comment on lines -42 to 45
# initialize class
# validate and set parameters
self.set_label_mapping(label_mapping)
self._validate_parameters(parameters)
self._parameters = parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of these, could call the super?

BaseModel.__init__(self, label_mapping, parameters)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can't bc regex didn't? or an oversight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing some of the other models use the BaseModel.__init__ but when I do that or the super unit tests are failing... I'll trouble shoot a bit more though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not seeing success with doing BaseModel.__init__(self, label_mapping, parameters) ... maybe I'm doing something wrong, though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, what were the errors?

@taylorfturner
Copy link
Contributor Author

I think we are also missing the check in validate_parameters for the model to check against the pos labels against label_mapping to ensure the pos labels are a subset of label_mapping

Great call. Added in a check in the _validate_parameters of the column_name_model.py to validate that the true_positive_unique_labels is a subset of the label_mapping values, otherwise raise an error. Pushed and test case added in the test_column_name_model.py test suite, @JGSweets

Copy link
Contributor Author

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

major overhaul overnight

Comment on lines +75 to +88
if parameters["true_positive_dict"]:
label_map_dict_keys = set(self.label_mapping.keys())
true_positive_unique_labels = set(
parameters["true_positive_dict"][0].values()
)

# if not a subset that is less than or equal to
# label mapping dict
if true_positive_unique_labels > label_map_dict_keys:
errors.append(
"""`true_positive_dict` must be a subset
of the `label_mapping` values()"""
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking that true_positive_dict label values are together an equal or lesser subset of label_mapping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would say what is not supposed to be there, but that can always be done later.

Comment on lines +129 to +134
elif param == "positive_threshold_config" and (
value is None or not isinstance(value, int)
):
errors.append(
"`{}` is a required parameter that must be a boolean.".format(param)
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving this back to column_name_model.py to allow for show_confidences in the ColumnNameDataLabeler

Comment on lines +244 to +260
predictions = np.array([])
confidences = np.array([])

# `data` at this point is either filtered or not filtered
# list of column names on which we are predicting
for iter_value, value in enumerate(data):

if output[iter_value][0] > self._parameters["positive_threshold_config"]:
predictions = np.append(
predictions,
self._parameters["true_positive_dict"][output[iter_value][1]][
"label"
],
)

if show_confidences:
confidences = np.append(confidences, output[iter_value][0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning in the same way as other models dict(preds=np.array(), confs=np.array())

Comment on lines +265 to +267
if show_confidences:
return {"pred": predictions, "conf": confidences}
return {"pred": predictions}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementing same output as other data labelers / models

@@ -2086,15 +2086,9 @@ class ColumnNameModelPostprocessor(
):
"""Subclass of BaseDataPostprocessor for postprocessing regex data."""

def __init__(self, true_positive_dict=None, positive_threshold_config=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this from the post processor to the model file to allow for show_confidences parameter usage and similar output

NOTE: this basically becomes a DirectPass post processor at this point

@@ -0,0 +1 @@
{"true_positive_dict": [{"attribute": "ssn", "label": "ssn"}, {"attribute": "suffix", "label": "name"}, {"attribute": "my_home_address", "label": "address"}], "false_positive_dict": [{"attribute": "contract_number", "label": "ssn"}, {"attribute": "role", "label": "name"}, {"attribute": "send_address", "label": "address"}], "negative_threshold_config": 50, "positive_threshold_config": 85, "include_label": true}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

positive_threshold_config back in here

@@ -0,0 +1 @@
{}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no params here anymore

@JGSweets JGSweets merged commit 62294b6 into capitalone:main Sep 20, 2022
@taylorfturner taylorfturner deleted the new_model/labeler branch October 3, 2022 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Dramatic improvement, inaccurate calculation(s) or bug / feature making the library unusable New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants