-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Winsorize based on the MAD #179
Conversation
There are two methods now:
However, both method can be considered "robust". Maybe we can have
What do you think? |
My first (inexperienced) intuition was to set method to a specific option (e.g., That said, I think we need to do as you suggest to stay consistent with the other Another thing. In order to not let your previous work on #47 go to waste, I incorporated them in my PR before adding my changes. However, checks were already failing on that, so necessarily checks are failing on my modified version as well. I'm not familiar with automatic tests (yet) unfortunately, but I had a look at |
added raw method made the code easier to maintain by modularizing it made doc more explicit about the methods updated examples to visualize the effect update NEWS
Okay, I've made some updates - I've made the code more modular (so now each method computes the thresholds and then those are treated). library(datawizard)
hist(iris$Sepal.Length, main = "Original data") hist(winsorize(iris$Sepal.Length, threshold = 0.2),
xlim = c(4, 8), main = "Percentile Winz") hist(winsorize(iris$Sepal.Length, threshold = 1.5, method = "zscore"),
xlim = c(4, 8), main = "Mean+-SD Winz") hist(winsorize(iris$Sepal.Length, threshold = 1.5, method = "zscore", robust = TRUE),
xlim = c(4, 8), main = "Median+-MAD Winz") hist(winsorize(iris$Sepal.Length, threshold = c(5, 7.5), method = "raw"),
xlim = c(4, 8), main = "Raw Thresholds") Created on 2022-06-26 by the reprex package (v2.0.1) Note that as we currently have it set up, the percentile, the threshold argument defines the amount to windzorize from each tail. Is this intended? Desired? @IndrajeetPatil @DominiqueMakowski |
Also, of course - @rempsyc great work! |
Codecov Report
@@ Coverage Diff @@
## master #179 +/- ##
==========================================
- Coverage 83.52% 83.42% -0.10%
==========================================
Files 52 52
Lines 3053 3071 +18
==========================================
+ Hits 2550 2562 +12
- Misses 503 509 +6
Continue to review full report at Codecov.
|
The only thing failing at this point is the Great work too @mattansb!! |
@IndrajeetPatil how should we avoid the |
Great work, both of you! Thanks for that. As for tidyr, that's strange. The builds are not failing in the default branch without it, so how can they fail in the PR?! Which tidyr functions are we using in the vignette? |
|
Not sure. I am currently traveling and don't have a laptop on me. I will have a look later. @mattansb Feel free to squash and merge whenever you think this is ready for a merge. |
…awizard::data_to_long` in vignette
I think the only |
Awesome! Thanks. |
R/winsorize.R
Outdated
winsorize.data.frame <- function(data, threshold = 0.2, method = "percentile", robust = FALSE, | ||
verbose = TRUE, ...) { | ||
data <- lapply(data, winsorize, threshold = threshold, method = method, robust = robust, verbose = verbose) | ||
as.data.frame(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using data[] <- lapply...
and remove the next line with as.data.frame()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but in that way it didn't return an output, so I had to add a line to return the dataframe anyway.
if (length(threshold) != 2L) { | ||
if (isTRUE(verbose)) { | ||
warning("threshold must be of length 2 for lower and upper bound. Did not winsorize data.", call. = FALSE) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest wrapping in insight::format_message()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (threshold < 0 || threshold > 0.5) { | ||
if (isTRUE(verbose)) { | ||
warning("'threshold' for winsorization must be a scalar between 0 and 0.5. Did not winsorize data.", call. = FALSE) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, use warning(insight::format_message("..."), call.= FALSE)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if (threshold <= 0) { | ||
if (isTRUE(verbose)) { | ||
warning("'threshold' for winsorization must be a scalar greater than 0. Did not winsorize data.", call. = FALSE) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -11,11 +11,11 @@ test_that("with missing values", { | |||
test_that("winsorize: threshold must be between 0 and 1", { | |||
expect_warning( | |||
winsorize(sample(1:10, 5), threshold = -0.1), | |||
regexp = "must be a scalar between 0 and 1" | |||
regexp = "must be a scalar between 0 and 0.5" | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be that when insight::format_message()
is used, a line break will by coincident just be inside this string, so matching does no longer work. Maybe just reduce the pattern to 0 and 0.5
or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this has been solved (essentially, we had modified the warning message within the function but not within the test checks. Harmonizing them fixed it)
…e(), data[] <- lapply...
@mattansb , after working on The intention of going with the However, now that we have several methods with |
I say to keep robust and method as it is now |
First attempt(!) at adding the possibility to winsorize based on the MAD, related to #177 & #49 & #47.