gini_impurity function property and test cases for this #308

az85252 · 2021-07-06T19:32:57Z

I have added gini impurity into categorical column, basically likelihood of an incorrect classification of a new instance of a random variable. Please review.

dataprofiler/profilers/categorical_column_profile.py

AnhTruong

some comments

dataprofiler/profilers/categorical_column_profile.py

ChrisWallace2020 · 2021-07-07T14:05:24Z

dataprofiler/profilers/categorical_column_profile.py

+            profile["statistics"].update(
+                dict(gini_impurity=self.gini_impurity)
+            )


I believe you could just add the gini_impurity key to the call to update above, with the categories.

ChrisWallace2020 · 2021-07-07T14:07:51Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

+        df_categorical = pd.Series(["y", "y", "y", "y", "n", "n", "n"])
+        profile = CategoricalColumn(df_categorical.name)
+        profile.update(df_categorical)
+        expected_val = ((4 / 7) * (3/7)) + ((4 / 7) * (3/7))


Make the spacing spacing on (4 / 7) consistent with (3/7)

JGSweets · 2021-07-07T15:21:52Z

dataprofiler/profilers/categorical_column_profile.py

            profile["statistics"].update(
                dict(categories=self.categories)
            )


not your issue, but if you have to make another change in this pr, could you change this to match the format below:
profile["statistics"]["categories"] = self.categories

JGSweets · 2021-07-07T15:23:48Z

dataprofiler/profilers/categorical_column_profile.py

+    def gini_impurity(self):
+        """
+        Property for gini impurity


lets put a reference to the formula or show the formula used here. Also need a space between the description and the return.

JGSweets · 2021-07-07T15:24:57Z

dataprofiler/profilers/categorical_column_profile.py

+            return None
+        summation = 0
+        total = sum(self._categories.values())


this value may already exist as self.sample_size

JGSweets · 2021-07-07T15:29:34Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

+        expected_gini = .914600550
+        self.assertAlmostEqual(report_gini, expected_gini)


is it possible to get the exact math like above or do we have to do almost equal? if we can get it then we won't need to pop it either.

JGSweets · 2021-07-07T15:59:58Z

dataprofiler/profilers/categorical_column_profile.py

remove the print and comment

JGSweets · 2021-07-07T16:01:39Z

dataprofiler/profilers/categorical_column_profile.py

+        if self.sample_size == 0:
+            return None
+        summation = 0


Maybe just call this gini_impurity? thoughts?

* added gini_impurity function property and test cases for this * fixed documentation for gini impurity * fixed syntax and test cases related to gini_impurity * edited test cases and code related to gini_impurity * deleted extra code and simplified variable names

added gini_impurity function property and test cases for this

4d1b1e1

az85252 requested review from AnhTruong, ChrisWallace2020, grant-eden, JGSweets and lettergram as code owners July 6, 2021 19:32

AnhTruong reviewed Jul 6, 2021

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Show resolved Hide resolved

AnhTruong reviewed Jul 6, 2021

View reviewed changes

dataprofiler/profilers/categorical_column_profile.py Outdated Show resolved Hide resolved

fixed documentation for gini impurity

12df928

AnhTruong previously approved these changes Jul 6, 2021

View reviewed changes

ChrisWallace2020 reviewed Jul 7, 2021

View reviewed changes

fixed syntax and test cases related to gini_impurity

32aaf7e

az85252 dismissed AnhTruong’s stale review via 32aaf7e July 7, 2021 15:11

Merge branch 'main' into gini_impurity

c7e798e

JGSweets reviewed Jul 7, 2021

View reviewed changes

az85252 added 2 commits July 7, 2021 10:43

edited test cases and code related to gini_impurity

dbe2b02

Merge remote-tracking branch 'origin/gini_impurity' into gini_impurity

3eaf0cf

JGSweets reviewed Jul 7, 2021

View reviewed changes

deleted extra code and simplified variable names

67d1a9f

JGSweets approved these changes Jul 7, 2021

View reviewed changes

JGSweets enabled auto-merge (squash) July 7, 2021 18:00

AnhTruong approved these changes Jul 7, 2021

View reviewed changes

JGSweets merged commit 553f623 into capitalone:main Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gini_impurity function property and test cases for this #308

gini_impurity function property and test cases for this #308

az85252 commented Jul 6, 2021 •

edited

Loading

AnhTruong left a comment

ChrisWallace2020 Jul 7, 2021

ChrisWallace2020 Jul 7, 2021

JGSweets Jul 7, 2021 •

edited

Loading

JGSweets Jul 7, 2021

JGSweets Jul 7, 2021

JGSweets Jul 7, 2021

JGSweets Jul 7, 2021 •

edited

Loading

JGSweets Jul 7, 2021

		expected_gini = .914600550
		self.assertAlmostEqual(report_gini, expected_gini)

gini_impurity function property and test cases for this #308

gini_impurity function property and test cases for this #308

Conversation

az85252 commented Jul 6, 2021 • edited Loading

AnhTruong left a comment

Choose a reason for hiding this comment

ChrisWallace2020 Jul 7, 2021

Choose a reason for hiding this comment

ChrisWallace2020 Jul 7, 2021

Choose a reason for hiding this comment

JGSweets Jul 7, 2021 • edited Loading

Choose a reason for hiding this comment

JGSweets Jul 7, 2021

Choose a reason for hiding this comment

JGSweets Jul 7, 2021

Choose a reason for hiding this comment

JGSweets Jul 7, 2021

Choose a reason for hiding this comment

JGSweets Jul 7, 2021 • edited Loading

Choose a reason for hiding this comment

JGSweets Jul 7, 2021

Choose a reason for hiding this comment

az85252 commented Jul 6, 2021 •

edited

Loading

JGSweets Jul 7, 2021 •

edited

Loading

JGSweets Jul 7, 2021 •

edited

Loading