Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Winsorize based on the MAD #179

Merged
merged 7 commits into from
Jun 27, 2022
Merged

Winsorize based on the MAD #179

merged 7 commits into from
Jun 27, 2022

Conversation

rempsyc
Copy link
Member

@rempsyc rempsyc commented Jun 25, 2022

First attempt(!) at adding the possibility to winsorize based on the MAD, related to #177 & #49 & #47.

R/winsorize.R Outdated Show resolved Hide resolved
@mattansb
Copy link
Member

There are two methods now:

  1. Percentile based (robust=FALSE)
  2. Median ± MAD (robust=TRUE)

However, both method can be considered "robust".
Additionally, the classic Z-score (Mean ± SD) based method.

Maybe we can have

  • method = "percentile": x is between [threshold / 2, 1 - threshold /2]%
  • method = "zscore"
    • robust = FALSE: x is between Mean ± threshold * SD
    • robust = TRUE: x is between Median ± threshold * MAD

What do you think?

@rempsyc
Copy link
Member Author

rempsyc commented Jun 26, 2022

My first (inexperienced) intuition was to set method to a specific option (e.g., method = "percentile", method = "mean", and method = "median" or method = "mad"). In our stats class, we are taught to use the MAD and it wasn't immediately clear to me that the trick to get that was to set robust = TRUE in other functions, though I was able to figure it out from the documentation. I'm just thinking that many of my peers don't have the habit to check the documentation like I do.

That said, I think we need to do as you suggest to stay consistent with the other datawizard functions. Also, it makes a lot of sense when you know a bit of stats. So I changed it accordingly. In this case, the robust argument only applies when method = zscore since as you mention method = percentile is robust already.

Another thing. In order to not let your previous work on #47 go to waste, I incorporated them in my PR before adding my changes. However, checks were already failing on that, so necessarily checks are failing on my modified version as well. I'm not familiar with automatic tests (yet) unfortunately, but I had a look at tests/testthat/test-winsorization.R. From what I can understand, the failed checks seem to be due to an error with one of the tests (i.e., the third one). So actually I just tried updating the expected message in the regexp call (from 1 to 0.5 since that's the modification you made to the function earlier) and it seems to have fixed it! :O

added raw method

made the code easier to maintain by modularizing it

made doc more explicit about the methods

updated examples to visualize the effect

update NEWS
@mattansb
Copy link
Member

Okay, I've made some updates - I've made the code more modular (so now each method computes the thresholds and then those are treated).
I've also added a "raw" method as per #47.

library(datawizard)
hist(iris$Sepal.Length, main = "Original data")

hist(winsorize(iris$Sepal.Length, threshold = 0.2),
     xlim = c(4, 8), main = "Percentile Winz")

hist(winsorize(iris$Sepal.Length, threshold = 1.5, method = "zscore"),
     xlim = c(4, 8), main = "Mean+-SD Winz")

hist(winsorize(iris$Sepal.Length, threshold = 1.5, method = "zscore", robust = TRUE),
     xlim = c(4, 8), main = "Median+-MAD Winz")

hist(winsorize(iris$Sepal.Length, threshold = c(5, 7.5), method = "raw"),
     xlim = c(4, 8), main = "Raw Thresholds")

Created on 2022-06-26 by the reprex package (v2.0.1)


Note that as we currently have it set up, the percentile, the threshold argument defines the amount to windzorize from each tail. Is this intended? Desired? @IndrajeetPatil @DominiqueMakowski

@mattansb
Copy link
Member

Also, of course - @rempsyc great work!

@codecov-commenter
Copy link

codecov-commenter commented Jun 26, 2022

Codecov Report

Merging #179 (1f2c277) into master (e986a8d) will decrease coverage by 0.09%.
The diff coverage is 81.25%.

❗ Current head 1f2c277 differs from pull request most recent head 03f85bf. Consider uploading reports for the commit 03f85bf to get more accurate results

@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
- Coverage   83.52%   83.42%   -0.10%     
==========================================
  Files          52       52              
  Lines        3053     3071      +18     
==========================================
+ Hits         2550     2562      +12     
- Misses        503      509       +6     
Impacted Files Coverage Δ
R/winsorize.R 82.35% <81.25%> (-17.65%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e986a8d...03f85bf. Read the comment docs.

@rempsyc
Copy link
Member Author

rempsyc commented Jun 26, 2022

The only thing failing at this point is the pkgdown check from the standardize_data.Rmd vignette because tidyr is called but not part of the suggested packages, so I added it to Suggests in DESCRIPTION (plus made minor modifications to the documentation).

Great work too @mattansb!!

@mattansb
Copy link
Member

@IndrajeetPatil how should we avoid the tidyr issue?

@IndrajeetPatil
Copy link
Member

Great work, both of you! Thanks for that.

As for tidyr, that's strange. The builds are not failing in the default branch without it, so how can they fail in the PR?!

Which tidyr functions are we using in the vignette?

@rempsyc
Copy link
Member Author

rempsyc commented Jun 26, 2022

pkgdown test doesn't fail anymore after adding tydir to Suggests in DESCRIPTION (https://github.com/easystats/datawizard/actions/runs/2563988526). So is the issue solved, or was the idea not to add it there for some reason?

@IndrajeetPatil
Copy link
Member

Not sure. I am currently traveling and don't have a laptop on me. I will have a look later.

@mattansb Feel free to squash and merge whenever you think this is ready for a merge.

@rempsyc
Copy link
Member Author

rempsyc commented Jun 26, 2022

I think the only tidyr function from the standardize_data.Rmd vignette (lines 76:86) is tidyr::pivot_longer. Do we have a datawizard equivalent? YES! And it's used just a bit further down in the vignette: datawizard::data_to_long. So what I did is just remove tidyr from Suggests and simply replaced tidyr::pivot_longer with datawizard::data_to_long in the vignette.

@IndrajeetPatil
Copy link
Member

Awesome! Thanks. ☺️

R/winsorize.R Outdated
winsorize.data.frame <- function(data, threshold = 0.2, method = "percentile", robust = FALSE,
verbose = TRUE, ...) {
data <- lapply(data, winsorize, threshold = threshold, method = method, robust = robust, verbose = verbose)
as.data.frame(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using data[] <- lapply... and remove the next line with as.data.frame().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but in that way it didn't return an output, so I had to add a line to return the dataframe anyway.

if (length(threshold) != 2L) {
if (isTRUE(verbose)) {
warning("threshold must be of length 2 for lower and upper bound. Did not winsorize data.", call. = FALSE)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest wrapping in insight::format_message().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (threshold < 0 || threshold > 0.5) {
if (isTRUE(verbose)) {
warning("'threshold' for winsorization must be a scalar between 0 and 0.5. Did not winsorize data.", call. = FALSE)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use warning(insight::format_message("..."), call.= FALSE).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if (threshold <= 0) {
if (isTRUE(verbose)) {
warning("'threshold' for winsorization must be a scalar greater than 0. Did not winsorize data.", call. = FALSE)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -11,11 +11,11 @@ test_that("with missing values", {
test_that("winsorize: threshold must be between 0 and 1", {
expect_warning(
winsorize(sample(1:10, 5), threshold = -0.1),
regexp = "must be a scalar between 0 and 1"
regexp = "must be a scalar between 0 and 0.5"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be that when insight::format_message() is used, a line break will by coincident just be inside this string, so matching does no longer work. Maybe just reduce the pattern to 0 and 0.5 or so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this has been solved (essentially, we had modified the warning message within the function but not within the test checks. Harmonizing them fixed it)

@IndrajeetPatil IndrajeetPatil merged commit 3d2992a into easystats:master Jun 27, 2022
@rempsyc
Copy link
Member Author

rempsyc commented Jun 30, 2022

@mattansb , after working on performance::check_outliers, I noticed that instead of using "zscore" along with "robust = TRUE", they simply use "zscore_robust".

The intention of going with the robust = TRUE argument was to stay consistent with datawizard::standardize. However, for the former, there is only two methods: robust (median MAD) or not robust (mean SD), so it kind of makes sense.

However, now that we have several methods with datawizard::winsorize and that we had some confusion regarding robust = TRUE (since the argument currently applies to only one method), I wonder if we should go the performance way and just use zscore_robust. Perhaps it would be less confusing? By losing one parameter, I think we gain simplicity. What's the procedure in this case (if you agree), open a new PR?

@bwiernik
Copy link
Contributor

bwiernik commented Jul 1, 2022

I say to keep robust and method as it is now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants