Showing Low Variance columns #192

reza1615 · 2020-04-27T15:29:30Z

Sometimes a dataset may have a categorical feature with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature. For a ML model, such feature may not add a lot of information and thus can be ignored for modeling.
source

Count of unique values in a feature / sample size < 10%
Count of most common value / Count of second most common value > 20 times.

I suggest to add a part in column description to show is this column is Low Variance or not.
(e.g. Low_Variance = True/False)
Also it can be like * or ! as a new icon before column's name to show which columns are Low variance.
⑉ or 𝄇 or 🔰 or 🚩

To make it more visual
Also we can have a section on top of the column description with different flags to show

is the categorical column, sparse
does the column have Gaussian distribution
is the categorical column, Low variance
is it with high ration missing values
is it with many outliers
...

this flags can also shown in the top of drop-down menu

aschonfeld · 2020-06-02T01:19:06Z

@reza, I really like this feature. Couple questions:

Can I just display the count of a value in parenthesis next to each value in the "Unique Values" section of the "Describe" popup?
What is the actual formula to determine if low variance? I see the following:

(count of unique values / # rows in dataframe)
(count of most common value) / (count of second most common)

Looks like I'm cutting off the count of outliers at 100 (i'll fix this)
How to determine sparseness?
How to determine Gaussian distribution? (skew = 0 & kurtosis = 3)?
What would be considered a "high" ratio of missing values?

FedererKK · 2020-07-13T21:24:05Z

Not referring specifically to categorical features, but my 2 cents

Jarque-Brera test for normality, or maybe even better a flag to show moments of different orders.
It is very much data dependant, so an input box where the user specifies a threshold maybe?

aschonfeld · 2020-07-13T21:38:43Z

Thanks for feedback @FedererKK i was going to start on this soon

aschonfeld · 2020-07-16T03:11:17Z

@FedererKK this is what I have so far: https://youtu.be/quco79Val4w

Can you give me some more information on what you meant by "a flag to show moments of different orders"?

Please let me know what else you think might be helpful to be determine variance (charts, calculations, etc...)

Thanks

aschonfeld · 2020-07-17T01:52:54Z

@FedererKK I also found this on stackoverflow

aschonfeld · 2020-07-18T20:34:40Z

@FedererKK @reza1615 I think this will be the final version

It will only be available for columns of numeric data (ints, floats). Please let me know if you think there's anything else I should add.

aschonfeld · 2020-07-22T02:41:57Z

added in v1.10.0

reza1615 changed the title ~~Showing Low Variance~~ Showing Low Variance columns Apr 27, 2020

aschonfeld added a commit that referenced this issue Jul 15, 2020

#192: initial changes

f8100da

aschonfeld added a commit that referenced this issue Jul 16, 2020

#192: initial changes

c40ccd7

aschonfeld added a commit that referenced this issue Jul 18, 2020

#192: Variance Report & Low Variance Flag

5173d83

aschonfeld added a commit that referenced this issue Jul 19, 2020

#192: Variance Report & Low Variance Flag

fb0e07a

aschonfeld added a commit that referenced this issue Jul 19, 2020

#192: Variance Report & Low Variance Flag

8fba305

aschonfeld added a commit that referenced this issue Jul 19, 2020

#192: Variance Report & Low Variance Flag

1215dab

aschonfeld closed this as completed Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Showing Low Variance columns #192

Showing Low Variance columns #192

reza1615 commented Apr 27, 2020 •

edited

Loading

aschonfeld commented Jun 2, 2020

FedererKK commented Jul 13, 2020

aschonfeld commented Jul 13, 2020

aschonfeld commented Jul 16, 2020

aschonfeld commented Jul 17, 2020

aschonfeld commented Jul 18, 2020

aschonfeld commented Jul 22, 2020

Showing Low Variance columns #192

Showing Low Variance columns #192

Comments

reza1615 commented Apr 27, 2020 • edited Loading

aschonfeld commented Jun 2, 2020

FedererKK commented Jul 13, 2020

aschonfeld commented Jul 13, 2020

aschonfeld commented Jul 16, 2020

aschonfeld commented Jul 17, 2020

aschonfeld commented Jul 18, 2020

aschonfeld commented Jul 22, 2020

reza1615 commented Apr 27, 2020 •

edited

Loading