-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added histograms and boxplots for outlier variables #35
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this feature, especially the boxplots. They are super useful for identifying group outliers, and I think they add a lot of value.
Observations
- For example, when looking at the
income
variable grouped, I noticed an outlier ininc_1
. - However, when comparing the histogram with and without outliers, both look the same. The axis in the "without outliers" view still shows values up to 10,000,000, which is the identified outlier.
Suggestions
- I am unsure whether we should remove these outliers altogether.
- Winsorization might be a great alternative. It could be helpful to show:
- The histogram of
inc_01
. - The histogram with winsorized
inc_01
to see the impact on the distribution. - instead of the filtered_hfc we would create a dataset with the winsorized variable maybe at 95%, let me know if you want to discuss this.
- The histogram of
This is a great addition to the tool—thanks for all the work on it!
good idea! Will fix. |
I reviewed this. |
Winsorization is applied to outliers check in lieu of filtering. Parameters are adjusted depending on the method and multiplier selected by the user in the Setup Tab.
@@ -7,7 +7,7 @@ | |||
if(!require(pacman)) install.packages("pacman") | |||
|
|||
pacman::p_load( | |||
shiny, dplyr, tidyr, stringr, lubridate, purrr, ggplot2, janitor, data.table, DT, remotes, bsicons, | |||
shiny, dplyr, tidyr, stringr, lubridate, purrr, ggplot2, janitor, data.table, DescTools, DT, remotes, bsicons, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually ended up using a custom function so perhaps we don't need DescTools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on this! @m-visintini I have a few comments:
- I changed the name to winsorized instead of "without outliers" to be more precise.
- I think we should include the level in the title (e.g., 95%). I left if as xx%
- The winsorization should be applied at the 95% level regardless of the method chosen.
- For the boxplots, I believe it’s fine to keep only the with outliers. The only change after winsorization will apply to the selected variable (inc_01), not the entire group (inc), so inc_02, inc_03 are still the same box. The boxplots look clear to me, but I’m happy to discuss this if needed.
Once these changes are applied, this will be ready to merge!
* Removed dynamic winsorization * Added note to explain winsorization vs selected method * Removed winsorized boxplots
Please test it and let me know the outcome!
One thing to bear in mind is that the rendering of graphs without outliers only works if the correct ID variable is set up in the Setup tab. This is a bit prone to error at the moment. As soon as I will be done with issue #24 this should not be a problem anymore, since the id variable will be enforced globally.