-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gini_impurity function property and test cases for this #308
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments
profile["statistics"].update( | ||
dict(gini_impurity=self.gini_impurity) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you could just add the gini_impurity key to the call to update above, with the categories.
df_categorical = pd.Series(["y", "y", "y", "y", "n", "n", "n"]) | ||
profile = CategoricalColumn(df_categorical.name) | ||
profile.update(df_categorical) | ||
expected_val = ((4 / 7) * (3/7)) + ((4 / 7) * (3/7)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make the spacing spacing on (4 / 7) consistent with (3/7)
profile["statistics"].update( | ||
dict(categories=self.categories) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not your issue, but if you have to make another change in this pr, could you change this to match the format below:
profile["statistics"]["categories"] = self.categories
def gini_impurity(self): | ||
""" | ||
Property for gini impurity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets put a reference to the formula or show the formula used here. Also need a space between the description and the return.
return None | ||
summation = 0 | ||
total = sum(self._categories.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this value may already exist as self.sample_size
expected_gini = .914600550 | ||
self.assertAlmostEqual(report_gini, expected_gini) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to get the exact math like above or do we have to do almost equal? if we can get it then we won't need to pop it either.
print(self.sample_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the print and comment
if self.sample_size == 0: | ||
return None | ||
summation = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just call this gini_impurity
? thoughts?
* added gini_impurity function property and test cases for this * fixed documentation for gini impurity * fixed syntax and test cases related to gini_impurity * edited test cases and code related to gini_impurity * deleted extra code and simplified variable names
I have added gini impurity into categorical column, basically likelihood of an incorrect classification of a new instance of a random variable. Please review.