-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ColumnNameLabeler Setup #635
Conversation
label_mapping = {"not": "implemented"} | ||
self.set_label_mapping(label_mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback really wanted here: I don't like and it is very spaghetti code to force the labeler load_with_components
to run. The API on the labeler requires a label_mapping
; however, for this model, we don't use a label mapping currently. We could refactor the code to ultimately use that but POC / MVP stage the data or the model aren't really setup to use this attribute of the API.
if data[iter_value][0] > self._parameters["positive_threshold_config"]: | ||
if labels[iter_value][0] > self._parameters["positive_threshold_config"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo fix after realizing labels
is the parameter that represents the output from the model component
results[iter_value] = {} | ||
try: | ||
results[iter_value]["pred"] = self._parameters[ | ||
"true_positive_dict" | ||
][data[iter_value][1]]["label"] | ||
][labels[iter_value][1]]["label"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo fix after realizing labels
is the parameter that represents the output from the model component
@@ -0,0 +1,69 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
brand new file for testing the pre processor, model, and post processor all together
logger.info( | ||
"For MVP implementation, the `ColumnNameModel`" | ||
"does not implement the `label_mapping` utility." | ||
"'prediction -> label' mapping is handled in the" | ||
"post-processor using the data index / row value." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RE: comment above... adding a logger.info so the user is aware of the label_mapping situation for MVP implementation
if show_confidences: | ||
raise NotImplementedError( | ||
"""`show_confidences` parameter is disabled | ||
for Proof of Concept implementation. Confidence | ||
values are enabled by default.""" | ||
for MVP implementation. Note: Confidence | ||
values are returned by default.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is we look at this in a re-work of the pipeline. Will require some thought how the data is passed into the labeler flow and re formatting of the true_positive_dict
and false_positive_dict
parameter formatting
self.assertEqual( | ||
'{"true_positive_dict": [{"attribute": "ssn", "label": "ssn"}, ' | ||
'{"attribute": "suffix", "label": "name"}, {"attribute": "my_home_address", ' | ||
'"label": "address"}], "false_positive_dict": [{"attribute": ' | ||
'"contract_number", "label": "ssn"}, {"attribute": "role", ' | ||
'"label": "name"}, {"attribute": "send_address", "label": "address"}], ' | ||
'"negative_threshold_config": 50, "include_label": true}{"true_positive_dict": ' | ||
'[{"attribute": "ssn", "label": "ssn"}, {"attribute": "suffix", ' | ||
'"label": "name"}, {"attribute": "my_home_address", "label": "address"}], ' | ||
'"false_positive_dict": [{"attribute": "contract_number", "label": "ssn"}, ' | ||
'{"attribute": "role", "label": "name"}, {"attribute": "send_address", "label": ' | ||
'"address"}], "negative_threshold_config": 50, "include_label": true}', | ||
mock_file.getvalue(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test load of model (label mapping and model parameters)
@@ -0,0 +1 @@ | |||
{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DirectPassPreprocessor
with self.assertRaises(TypeError): | ||
process_output = processor.process(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this raises TypeError
when just testing the post processor becuase the process()
label parameter is None and NoneType
is not subscriptable. We could do a more elegant error handling here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out dated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to have this merged, but its not 100% following the API paradigm of the models / labelers. In that this model and post processor are not fully interchangable componenet
if show_confidences: | ||
raise NotImplementedError( | ||
raise Warning( | ||
"""`show_confidences` parameter is disabled | ||
for Proof of Concept implementation. Confidence | ||
values are enabled by default.""" | ||
for MVP implementation. Due to the requirement | ||
of having the data point in the post processor. | ||
Note: Confidence values are returned by default.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disabled becuase confidences must be in the output from the model for the post processing currently. Ideally we move the filtering / labeling around so that this API works as inteneded
For now, the confidences will show by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use the logging warning instead of raise Warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we still add confidences to the output anyways as the output should have format:
dict(pred=..., conf=...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all other models have the same output, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or dict(pred=...)
if show_confidences=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or dict(pred=...) if show_confidences=False
this is a good idea to make sure things follow the same paradigm; however, as soon as we do this, the filtering on the post processor would not work because there is no similarity score on which to filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all other models have the same output, no?
Yes, they do that is a concern hesitation on this labeler; however, I think where I'm at right now is just make sure those functionalities that don't ... yet... follow the same paradigm are called out in logging and then we will have a follow-up PR to finalize the consistency between this model and the rest of the models already part of the package / model zoom
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to allow for use of the show_confidences
parameter and to return values in dict(pred=np.array(), conf=np.array())
Maybe hold off... I'll see what I can tackle tomorrow AM |
@@ -243,7 +245,12 @@ def load_from_disk(cls, dirpath): | |||
with open(model_param_dirpath, "r") as fp: | |||
parameters = json.load(fp) | |||
|
|||
loaded_model = cls(parameters) | |||
# load label_mapping | |||
labels_dirpath = os.path.join(dirpath, "label_mapping.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
model = ColumnNameModel(parameters=mock_model_parameters) | ||
cls.parameters = { | ||
"true_positive_dict": [ | ||
{"attribute": "ssn", "label": "ssn"}, | ||
{"attribute": "suffix", "label": "name"}, | ||
{"attribute": "my_home_address", "label": "address"}, | ||
], | ||
"false_positive_dict": [ | ||
{ | ||
"attribute": "contract_number", | ||
"label": "ssn", | ||
}, | ||
{ | ||
"attribute": "role", | ||
"label": "name", | ||
}, | ||
{ | ||
"attribute": "send_address", | ||
"label": "address", | ||
}, | ||
], | ||
"negative_threshold_config": 50, | ||
"include_label": True, | ||
} | ||
|
||
cls.test_label_mapping = {"ssn": 1, "name": 2, "address": 3} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason this over:
model = ColumnNameModel(label_mapping=mock_label_parameters, parameters=mock_model_parameters)
?
I think we are also missing the check in |
@@ -0,0 +1 @@ | |||
{"model": {"class": "ColumnNameModel"}, "preprocessor": {"class": "DirectPassPreprocessor"}, "postprocessor": {"class": "ColumnNamePostprocessor"}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these files look good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in here actually I found when load_from_library
test in the test_integration
was added
preprocessor=preprocessor, model=model, postprocessor=postprocessor | ||
) | ||
|
||
def test_default_model(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we also need a load_from_library
test? ensure the resources work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point yeah should test load from library for the model for sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added and actually caught a typo in the resources too -- 👍
# initialize class | ||
# validate and set parameters | ||
self.set_label_mapping(label_mapping) | ||
self._validate_parameters(parameters) | ||
self._parameters = parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of these, could call the super?
BaseModel.__init__(self, label_mapping, parameters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can't bc regex didn't? or an oversight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing some of the other models use the BaseModel.__init__
but when I do that or the super unit tests are failing... I'll trouble shoot a bit more though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not seeing success with doing BaseModel.__init__(self, label_mapping, parameters)
... maybe I'm doing something wrong, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, what were the errors?
Great call. Added in a check in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
major overhaul overnight
if parameters["true_positive_dict"]: | ||
label_map_dict_keys = set(self.label_mapping.keys()) | ||
true_positive_unique_labels = set( | ||
parameters["true_positive_dict"][0].values() | ||
) | ||
|
||
# if not a subset that is less than or equal to | ||
# label mapping dict | ||
if true_positive_unique_labels > label_map_dict_keys: | ||
errors.append( | ||
"""`true_positive_dict` must be a subset | ||
of the `label_mapping` values()""" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking that true_positive_dict
label values are together an equal or lesser subset of label_mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would say what is not supposed to be there, but that can always be done later.
elif param == "positive_threshold_config" and ( | ||
value is None or not isinstance(value, int) | ||
): | ||
errors.append( | ||
"`{}` is a required parameter that must be a boolean.".format(param) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving this back to column_name_model.py
to allow for show_confidences
in the ColumnNameDataLabeler
predictions = np.array([]) | ||
confidences = np.array([]) | ||
|
||
# `data` at this point is either filtered or not filtered | ||
# list of column names on which we are predicting | ||
for iter_value, value in enumerate(data): | ||
|
||
if output[iter_value][0] > self._parameters["positive_threshold_config"]: | ||
predictions = np.append( | ||
predictions, | ||
self._parameters["true_positive_dict"][output[iter_value][1]][ | ||
"label" | ||
], | ||
) | ||
|
||
if show_confidences: | ||
confidences = np.append(confidences, output[iter_value][0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning in the same way as other models dict(preds=np.array(), confs=np.array())
if show_confidences: | ||
return {"pred": predictions, "conf": confidences} | ||
return {"pred": predictions} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementing same output as other data labelers / models
@@ -2086,15 +2086,9 @@ class ColumnNameModelPostprocessor( | |||
): | |||
"""Subclass of BaseDataPostprocessor for postprocessing regex data.""" | |||
|
|||
def __init__(self, true_positive_dict=None, positive_threshold_config=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this from the post processor to the model file to allow for show_confidences
parameter usage and similar output
NOTE: this basically becomes a DirectPass
post processor at this point
@@ -0,0 +1 @@ | |||
{"true_positive_dict": [{"attribute": "ssn", "label": "ssn"}, {"attribute": "suffix", "label": "name"}, {"attribute": "my_home_address", "label": "address"}], "false_positive_dict": [{"attribute": "contract_number", "label": "ssn"}, {"attribute": "role", "label": "name"}, {"attribute": "send_address", "label": "address"}], "negative_threshold_config": 50, "positive_threshold_config": 85, "include_label": true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
positive_threshold_config
back in here
@@ -0,0 +1 @@ | |||
{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no params here anymore
ColumnNameModel
as well