Make Binary Classification metrics more robust (ValueError: Found unknown categories...
)
#515
Labels
feature request
Request for a new feature
Problem Description
The Binary Classification metrics, are designed to:
The classifier only works if all the possible category values are available during the training phase. In practice, it's possible that the synthetic data may be missing some categories.
For example, consider that there may be exceedingly rare category in the real data:
credit_fraud
occurs <1% of the time. This case may never be covered by the synthetic data due to sheer luck. If you had data like this, the classifier would fail with aValueError
because therestatus='credit_fraud'
is an unknown category at the time of testing.Expected behavior
We expect the metric to be more robust, meaning that it should not crash if it encounters this case. At a bare minimum, it may just skip over any rows with unknown values. So these rows would never even factor into the final F1 score that is returned.
The text was updated successfully, but these errors were encountered: