Winsorize based on the MAD #179

rempsyc · 2022-06-25T03:39:54Z

First attempt(!) at adding the possibility to winsorize based on the MAD, related to #177 & #49 & #47.

… based on the MAD

R/winsorize.R

mattansb · 2022-06-25T08:36:53Z

There are two methods now:

Percentile based (robust=FALSE)
Median ± MAD (robust=TRUE)

However, both method can be considered "robust".
Additionally, the classic Z-score (Mean ± SD) based method.

Maybe we can have

method = "percentile": x is between [threshold / 2, 1 - threshold /2]%
method = "zscore"
- robust = FALSE: x is between Mean ± threshold * SD
- robust = TRUE: x is between Median ± threshold * MAD

What do you think?

rempsyc · 2022-06-26T00:07:30Z

My first (inexperienced) intuition was to set method to a specific option (e.g., method = "percentile", method = "mean", and method = "median" or method = "mad"). In our stats class, we are taught to use the MAD and it wasn't immediately clear to me that the trick to get that was to set robust = TRUE in other functions, though I was able to figure it out from the documentation. I'm just thinking that many of my peers don't have the habit to check the documentation like I do.

That said, I think we need to do as you suggest to stay consistent with the other datawizard functions. Also, it makes a lot of sense when you know a bit of stats. So I changed it accordingly. In this case, the robust argument only applies when method = zscore since as you mention method = percentile is robust already.

Another thing. In order to not let your previous work on #47 go to waste, I incorporated them in my PR before adding my changes. However, checks were already failing on that, so necessarily checks are failing on my modified version as well. I'm not familiar with automatic tests (yet) unfortunately, but I had a look at tests/testthat/test-winsorization.R. From what I can understand, the failed checks seem to be due to an error with one of the tests (i.e., the third one). So actually I just tried updating the expected message in the regexp call (from 1 to 0.5 since that's the modification you made to the function earlier) and it seems to have fixed it! :O

added raw method made the code easier to maintain by modularizing it made doc more explicit about the methods updated examples to visualize the effect update NEWS

mattansb · 2022-06-26T08:18:59Z

Okay, I've made some updates - I've made the code more modular (so now each method computes the thresholds and then those are treated).
I've also added a "raw" method as per #47.

library(datawizard)
hist(iris$Sepal.Length, main = "Original data")

hist(winsorize(iris$Sepal.Length, threshold = 0.2),
     xlim = c(4, 8), main = "Percentile Winz")

hist(winsorize(iris$Sepal.Length, threshold = 1.5, method = "zscore"),
     xlim = c(4, 8), main = "Mean+-SD Winz")

hist(winsorize(iris$Sepal.Length, threshold = 1.5, method = "zscore", robust = TRUE),
     xlim = c(4, 8), main = "Median+-MAD Winz")

hist(winsorize(iris$Sepal.Length, threshold = c(5, 7.5), method = "raw"),
     xlim = c(4, 8), main = "Raw Thresholds")

^{Created on 2022-06-26 by the reprex package (v2.0.1)}

Note that as we currently have it set up, the percentile, the threshold argument defines the amount to windzorize from each tail. Is this intended? Desired? @IndrajeetPatil @DominiqueMakowski

mattansb · 2022-06-26T08:19:24Z

Also, of course - @rempsyc great work!

codecov-commenter · 2022-06-26T08:22:14Z

Codecov Report

Merging #179 (1f2c277) into master (e986a8d) will decrease coverage by 0.09%.
The diff coverage is 81.25%.

❗ Current head 1f2c277 differs from pull request most recent head 03f85bf. Consider uploading reports for the commit 03f85bf to get more accurate results

@@            Coverage Diff             @@
##           master     #179      +/-   ##
==========================================
- Coverage   83.52%   83.42%   -0.10%     
==========================================
  Files          52       52              
  Lines        3053     3071      +18     
==========================================
+ Hits         2550     2562      +12     
- Misses        503      509       +6

Impacted Files	Coverage Δ
R/winsorize.R	`82.35% <81.25%> (-17.65%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e986a8d...03f85bf. Read the comment docs.

rempsyc · 2022-06-26T11:23:09Z

The only thing failing at this point is the pkgdown check from the standardize_data.Rmd vignette because tidyr is called but not part of the suggested packages, so I added it to Suggests in DESCRIPTION (plus made minor modifications to the documentation).

Great work too @mattansb!!

mattansb · 2022-06-26T12:01:05Z

@IndrajeetPatil how should we avoid the tidyr issue?

IndrajeetPatil · 2022-06-26T12:09:34Z

Great work, both of you! Thanks for that.

As for tidyr, that's strange. The builds are not failing in the default branch without it, so how can they fail in the PR?!

Which tidyr functions are we using in the vignette?

rempsyc · 2022-06-26T12:12:48Z

pkgdown test doesn't fail anymore after adding tydir to Suggests in DESCRIPTION (https://github.com/easystats/datawizard/actions/runs/2563988526). So is the issue solved, or was the idea not to add it there for some reason?

IndrajeetPatil · 2022-06-26T12:14:36Z

Not sure. I am currently traveling and don't have a laptop on me. I will have a look later.

@mattansb Feel free to squash and merge whenever you think this is ready for a merge.

…awizard::data_to_long` in vignette

rempsyc · 2022-06-26T12:41:45Z

I think the only tidyr function from the standardize_data.Rmd vignette (lines 76:86) is tidyr::pivot_longer. Do we have a datawizard equivalent? YES! And it's used just a bit further down in the vignette: datawizard::data_to_long. So what I did is just remove tidyr from Suggests and simply replaced tidyr::pivot_longer with datawizard::data_to_long in the vignette.

IndrajeetPatil · 2022-06-26T12:43:36Z

Awesome! Thanks. ☺️

strengejacke · 2022-06-26T13:36:30Z

R/winsorize.R

+winsorize.data.frame <- function(data, threshold = 0.2, method = "percentile", robust = FALSE,
+                                 verbose = TRUE, ...) {
+  data <- lapply(data, winsorize, threshold = threshold, method = method, robust = robust, verbose = verbose)
+  as.data.frame(data)


I suggest using data[] <- lapply... and remove the next line with as.data.frame().

Done, but in that way it didn't return an output, so I had to add a line to return the dataframe anyway.

strengejacke · 2022-06-26T13:36:51Z

R/winsorize.R

+    if (length(threshold) != 2L) {
+      if (isTRUE(verbose)) {
+        warning("threshold must be of length 2 for lower and upper bound. Did not winsorize data.", call. = FALSE)
+      }


I suggest wrapping in insight::format_message().

strengejacke · 2022-06-26T13:37:24Z

R/winsorize.R

+    if (threshold < 0 || threshold > 0.5) {
+      if (isTRUE(verbose)) {
+        warning("'threshold' for winsorization must be a scalar between 0 and 0.5. Did not winsorize data.", call. = FALSE)
+      }


Same here, use warning(insight::format_message("..."), call.= FALSE).

strengejacke · 2022-06-26T13:37:35Z

R/winsorize.R

+    if (threshold <= 0) {
+      if (isTRUE(verbose)) {
+        warning("'threshold' for winsorization must be a scalar greater than 0. Did not winsorize data.", call. = FALSE)
+      }


strengejacke · 2022-06-26T13:38:45Z

tests/testthat/test-winsorization.R

@@ -11,11 +11,11 @@ test_that("with missing values", {
 test_that("winsorize: threshold must be between 0 and 1", {
  expect_warning(
    winsorize(sample(1:10, 5), threshold = -0.1),
-    regexp = "must be a scalar between 0 and 1"
+    regexp = "must be a scalar between 0 and 0.5"
  )


It might be that when insight::format_message() is used, a line break will by coincident just be inside this string, so matching does no longer work. Maybe just reduce the pattern to 0 and 0.5 or so?

Actually this has been solved (essentially, we had modified the warning message within the function but not within the test checks. Harmonizing them fixed it)

…e(), data[] <- lapply...

rempsyc · 2022-06-30T22:04:05Z

@mattansb , after working on performance::check_outliers, I noticed that instead of using "zscore" along with "robust = TRUE", they simply use "zscore_robust".

The intention of going with the robust = TRUE argument was to stay consistent with datawizard::standardize. However, for the former, there is only two methods: robust (median MAD) or not robust (mean SD), so it kind of makes sense.

However, now that we have several methods with datawizard::winsorize and that we had some confusion regarding robust = TRUE (since the argument currently applies to only one method), I wonder if we should go the performance way and just use zscore_robust. Perhaps it would be less confusing? By losing one parameter, I think we gain simplicity. What's the procedure in this case (if you agree), open a new PR?

bwiernik · 2022-07-01T13:48:35Z

I say to keep robust and method as it is now

rempsyc added 2 commits June 24, 2022 23:23

addresses easystats#177 & easystats#49 & easystats#47 for winsorizing…

3cb5a12

… based on the MAD

forgot to push updated documentation

807b8fe

IndrajeetPatil requested review from bwiernik and mattansb June 25, 2022 04:45

mattansb reviewed Jun 25, 2022

View reviewed changes

R/winsorize.R Outdated Show resolved Hide resolved

new argument "method", updated NEWS, resolved failed test, easystats#179

0a7cc4e

update winsorize.numeric

658b2b4

added raw method made the code easier to maintain by modularizing it made doc more explicit about the methods updated examples to visualize the effect update NEWS

minor modifications to docs

c626eca

removed tidyr from Suggests, replaced tidyr::pivot_longer with `dat…

d45ca3f

…awizard::data_to_long` in vignette

strengejacke reviewed Jun 26, 2022

View reviewed changes

added new tests for new winsorization methods, insight::format_messag…

03f85bf

…e(), data[] <- lapply...

strengejacke approved these changes Jun 26, 2022

View reviewed changes

IndrajeetPatil merged commit 3d2992a into easystats:master Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Winsorize based on the MAD #179

Winsorize based on the MAD #179

rempsyc commented Jun 25, 2022

mattansb commented Jun 25, 2022

rempsyc commented Jun 26, 2022

mattansb commented Jun 26, 2022

mattansb commented Jun 26, 2022

codecov-commenter commented Jun 26, 2022 •

edited

Loading

rempsyc commented Jun 26, 2022 •

edited

Loading

mattansb commented Jun 26, 2022

IndrajeetPatil commented Jun 26, 2022

rempsyc commented Jun 26, 2022

IndrajeetPatil commented Jun 26, 2022

rempsyc commented Jun 26, 2022 •

edited

Loading

IndrajeetPatil commented Jun 26, 2022

strengejacke Jun 26, 2022

rempsyc Jun 26, 2022

strengejacke Jun 26, 2022

rempsyc Jun 26, 2022

strengejacke Jun 26, 2022

rempsyc Jun 26, 2022

strengejacke Jun 26, 2022

rempsyc Jun 26, 2022

strengejacke Jun 26, 2022

rempsyc Jun 26, 2022

rempsyc commented Jun 30, 2022

bwiernik commented Jul 1, 2022

Winsorize based on the MAD #179

Winsorize based on the MAD #179

Conversation

rempsyc commented Jun 25, 2022

mattansb commented Jun 25, 2022

rempsyc commented Jun 26, 2022

mattansb commented Jun 26, 2022

mattansb commented Jun 26, 2022

codecov-commenter commented Jun 26, 2022 • edited Loading

Codecov Report

rempsyc commented Jun 26, 2022 • edited Loading

mattansb commented Jun 26, 2022

IndrajeetPatil commented Jun 26, 2022

rempsyc commented Jun 26, 2022

IndrajeetPatil commented Jun 26, 2022

rempsyc commented Jun 26, 2022 • edited Loading

IndrajeetPatil commented Jun 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rempsyc commented Jun 30, 2022

bwiernik commented Jul 1, 2022

codecov-commenter commented Jun 26, 2022 •

edited

Loading

rempsyc commented Jun 26, 2022 •

edited

Loading

rempsyc commented Jun 26, 2022 •

edited

Loading