Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gini_impurity function property and test cases for this #308

Merged
merged 7 commits into from
Jul 7, 2021

Conversation

az85252
Copy link
Contributor

@az85252 az85252 commented Jul 6, 2021

I have added gini impurity into categorical column, basically likelihood of an incorrect classification of a new instance of a random variable. Please review.

Copy link
Contributor

@AnhTruong AnhTruong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments

AnhTruong
AnhTruong previously approved these changes Jul 6, 2021
Comment on lines 84 to 86
profile["statistics"].update(
dict(gini_impurity=self.gini_impurity)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you could just add the gini_impurity key to the call to update above, with the categories.

df_categorical = pd.Series(["y", "y", "y", "y", "n", "n", "n"])
profile = CategoricalColumn(df_categorical.name)
profile.update(df_categorical)
expected_val = ((4 / 7) * (3/7)) + ((4 / 7) * (3/7))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the spacing spacing on (4 / 7) consistent with (3/7)

Comment on lines 81 to 83
profile["statistics"].update(
dict(categories=self.categories)
)
Copy link
Contributor

@JGSweets JGSweets Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not your issue, but if you have to make another change in this pr, could you change this to match the format below:
profile["statistics"]["categories"] = self.categories

def gini_impurity(self):
"""
Property for gini impurity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets put a reference to the formula or show the formula used here. Also need a space between the description and the return.

return None
summation = 0
total = sum(self._categories.values())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this value may already exist as self.sample_size

Comment on lines 273 to 274
expected_gini = .914600550
self.assertAlmostEqual(report_gini, expected_gini)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to get the exact math like above or do we have to do almost equal? if we can get it then we won't need to pop it either.

Comment on lines 192 to 193
print(self.sample_size)
Copy link
Contributor

@JGSweets JGSweets Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the print and comment

if self.sample_size == 0:
return None
summation = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just call this gini_impurity? thoughts?

@JGSweets JGSweets enabled auto-merge (squash) July 7, 2021 18:00
@JGSweets JGSweets merged commit 553f623 into capitalone:main Jul 7, 2021
stevensecreti pushed a commit to stevensecreti/DataProfiler that referenced this pull request Jun 15, 2022
* added gini_impurity function property and test cases for this

* fixed documentation for gini impurity

* fixed syntax and test cases related to gini_impurity

* edited test cases and code related to gini_impurity

* deleted extra code and simplified variable names
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants