Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Showing Low Variance columns #192

Closed
reza1615 opened this issue Apr 27, 2020 · 7 comments
Closed

Showing Low Variance columns #192

reza1615 opened this issue Apr 27, 2020 · 7 comments

Comments

@reza1615
Copy link

reza1615 commented Apr 27, 2020

Sometimes a dataset may have a categorical feature with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature. For a ML model, such feature may not add a lot of information and thus can be ignored for modeling.
source

  • Count of unique values in a feature / sample size < 10%
  • Count of most common value / Count of second most common value > 20 times.

I suggest to add a part in column description to show is this column is Low Variance or not.
(e.g. Low_Variance = True/False)
Also it can be like * or ! as a new icon before column's name to show which columns are Low variance.
⑉ or 𝄇 or 🔰 or 🚩

To make it more visual
Also we can have a section on top of the column description with different flags to show

  • is the categorical column, sparse
  • does the column have Gaussian distribution
  • is the categorical column, Low variance
  • is it with high ration missing values
  • is it with many outliers
    ...

this flags can also shown in the top of drop-down menu

@reza1615 reza1615 changed the title Showing Low Variance Showing Low Variance columns Apr 27, 2020
@aschonfeld
Copy link
Collaborator

@reza, I really like this feature. Couple questions:

  1. Can I just display the count of a value in parenthesis next to each value in the "Unique Values" section of the "Describe" popup?
  2. What is the actual formula to determine if low variance? I see the following:
  • (count of unique values / # rows in dataframe)
  • (count of most common value) / (count of second most common)
  1. Looks like I'm cutting off the count of outliers at 100 (i'll fix this)
  2. How to determine sparseness?
  3. How to determine Gaussian distribution? (skew = 0 & kurtosis = 3)?
  4. What would be considered a "high" ratio of missing values?

@FedererKK
Copy link

Not referring specifically to categorical features, but my 2 cents

  1. Jarque-Brera test for normality, or maybe even better a flag to show moments of different orders.
  2. It is very much data dependant, so an input box where the user specifies a threshold maybe?

@aschonfeld
Copy link
Collaborator

Thanks for feedback @FedererKK i was going to start on this soon

aschonfeld added a commit that referenced this issue Jul 15, 2020
aschonfeld added a commit that referenced this issue Jul 16, 2020
@aschonfeld
Copy link
Collaborator

@FedererKK this is what I have so far: https://youtu.be/quco79Val4w

Can you give me some more information on what you meant by "a flag to show moments of different orders"?

Please let me know what else you think might be helpful to be determine variance (charts, calculations, etc...)

Thanks

@aschonfeld
Copy link
Collaborator

@FedererKK I also found this on stackoverflow

@aschonfeld
Copy link
Collaborator

@FedererKK @reza1615 I think this will be the final version

It will only be available for columns of numeric data (ints, floats). Please let me know if you think there's anything else I should add.

@aschonfeld
Copy link
Collaborator

added in v1.10.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants