Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added histograms and boxplots for outlier variables #35

Merged
merged 5 commits into from
Oct 31, 2024

Conversation

m-visintini
Copy link
Member

@m-visintini m-visintini commented Oct 22, 2024

Please test it and let me know the outcome!
One thing to bear in mind is that the rendering of graphs without outliers only works if the correct ID variable is set up in the Setup tab. This is a bit prone to error at the moment. As soon as I will be done with issue #24 this should not be a problem anymore, since the id variable will be enforced globally.

@m-visintini m-visintini linked an issue Oct 22, 2024 that may be closed by this pull request
Copy link
Member

@mariarrt94 mariarrt94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this feature, especially the boxplots. They are super useful for identifying group outliers, and I think they add a lot of value.

Observations

  • For example, when looking at the income variable grouped, I noticed an outlier in inc_1.
  • However, when comparing the histogram with and without outliers, both look the same. The axis in the "without outliers" view still shows values up to 10,000,000, which is the identified outlier.
    Screenshot 2024-10-24 100939

Suggestions

  • I am unsure whether we should remove these outliers altogether.
  • Winsorization might be a great alternative. It could be helpful to show:
    1. The histogram of inc_01.
    2. The histogram with winsorized inc_01 to see the impact on the distribution.
    3. instead of the filtered_hfc we would create a dataset with the winsorized variable maybe at 95%, let me know if you want to discuss this.

This is a great addition to the tool—thanks for all the work on it!

@m-visintini
Copy link
Member Author

good idea! Will fix.

@m-visintini
Copy link
Member Author

I reviewed this.
I suspect the reason why the two histograms looked the same was due to the selection of the id variable in the Setup tab, because I obtained different histograms. Nevertheless, winsorization is better practice so I fixed it as suggested and committing now.

Winsorization is applied to outliers check in lieu of filtering. Parameters are adjusted depending on the method and multiplier selected by the user in the Setup Tab.
@@ -7,7 +7,7 @@
if(!require(pacman)) install.packages("pacman")

pacman::p_load(
shiny, dplyr, tidyr, stringr, lubridate, purrr, ggplot2, janitor, data.table, DT, remotes, bsicons,
shiny, dplyr, tidyr, stringr, lubridate, purrr, ggplot2, janitor, data.table, DescTools, DT, remotes, bsicons,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually ended up using a custom function so perhaps we don't need DescTools

- change title of histogram graph.
Copy link
Member

@mariarrt94 mariarrt94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! @m-visintini I have a few comments:

  1. I changed the name to winsorized instead of "without outliers" to be more precise.
  2. I think we should include the level in the title (e.g., 95%). I left if as xx%
  3. The winsorization should be applied at the 95% level regardless of the method chosen.
  4. For the boxplots, I believe it’s fine to keep only the with outliers. The only change after winsorization will apply to the selected variable (inc_01), not the entire group (inc), so inc_02, inc_03 are still the same box. The boxplots look clear to me, but I’m happy to discuss this if needed.

Once these changes are applied, this will be ready to merge!

* Removed dynamic winsorization
* Added note to explain winsorization vs selected method
* Removed winsorized boxplots
@m-visintini m-visintini merged commit 99a0603 into main Oct 31, 2024
@m-visintini m-visintini deleted the 23-add-graph-to-visualize-outliers-checks branch October 31, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add graph to visualize Outliers checks
2 participants