Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior for violin plot for zero variance vs near-zero variance #27

Open
stefaneng opened this issue Feb 1, 2023 · 5 comments

Comments

@stefaneng
Copy link
Contributor

Violin plots have completely different behavior if variance is zero versus if there is near-zero variance. In the zero case, the kernel is extended way beyond the value. In the near zero case, the violin plot is constrained to an extremely small region which appears as a line. It seems like the desired behavior should be zero variance should be a line.

library(lattice)

simple.data <- data.frame(
    values = rep(100, 5),
    values2 = rep(100, 5) + rnorm(5, mean = 0.0001, sd = 0.0001),
    variable = rep('a', 5)
    )

print(sd(simple.data$values))
#> [1] 0

print(sd(simple.data$values2))
#> [1] 7.982134e-05

bwplot(values ~ variable,
       data = simple.data,
       ylim = c(-20, 200),
       panel = function(...) {
         # Default kernel = 'gaussian
         panel.violin(..., kernel = 'gaussian')
       })

bwplot(values2 ~ variable,
       data = simple.data,
       ylim = c(-20, 200),
       panel = function(...) {
         panel.violin(..., kernel = 'gaussian')
       })

Created on 2023-01-23 with reprex v2.0.2

@deepayan
Copy link
Owner

deepayan commented Feb 8, 2023

This ultimately comes from

> range(density(simple.data$values)$x)
[1] -95.69051 295.69051
> range(density(simple.data$values2)$x)
[1]  99.99981 100.00048

It may be better to try and fix it there. I don't see any obvious "nice" solution, but I'll think about it some more.

@stefaneng
Copy link
Contributor Author

@deepayan Thanks for the info on density.

A couple options: I am happy to implement one or the other or both if you like either suggestion.

  1. Check if all values are constant within each group and default to drawing a line for no variance case. See the problematic example below why this is useful if one of the groups has no variance
library(lattice)

simple.data <- data.frame(
  # Constant variance in group a
  # Normal data in group b
  values = c(rep(1, 5), rnorm(50, sd = 2)),
  variable = c(rep("a", 5), rep("b", 50))
  )

bwplot(values ~ variable,
  data = simple.data,
  ylim = c(-5, 10),
  main = 'Which has the zero variance?',
  panel = function(...) {
    panel.violin(...)
  })

  1. Allow vector values for the parameters to density so user can specify from, to, cut, etc to fix this issue for problematic cases. This could potentially be useful to more people that want more control over each group's density options. Currently the way it is implemented there is a silent error being caught if vectors are supplied and a nonsense plot is generated https://github.com/deepayan/lattice/blob/master/R/bwplot.R#L792-L805
# Not correct at all
# Error is caught silently and nonsense values returned
bwplot(values ~ variable,
  data = simple.data,
  ylim = c(-5, 10),
  main = 'Nonsense values returned',
  panel = function(...) {
    panel.violin(..., cut = c(0, 1))
  })

Created on 2023-02-15 with reprex v2.0.2

Both of these could be worth adding. Just doing 1) won't change the default but will allow user to supply specific parameters for each density plot. 2) I think would be a good default as it is unintuitive if a user plots violin plots for multiple categories and one has no variance.

@deepayan
Copy link
Owner

Yes, I mostly agree with your analysis.

The blanket try() call is definitely not ideal, but given that on failure we just draw a line at x[1] suggests that it was intended to catch the 0 variance case. This sort of works for your original problem if we add bw = "nrd", or even any other bw.* function other than bw.nrd0().

The problem with bw.nrd0() is that it always tries to return a non-zero result, and there's simply no good way for that to work.

So, definitely some version of your suggestion 1 should be implemented. Probably just change

    my.density <- function(x) {
        if (sd(x) > 0)
            do.call(stats::density, c(list(x = x), darg))
        else
            list(x = rep(x[1], 3), y = c(0, 1, 0))
    }

I will have to think a bit more about the second suggestion, but please do send a patch if you come up with one easily. I am worried about situations where different panels might have a different set of non-empty levels for the grouping variable.

A similar problem will happen with densityplot(), but that will require a different solution.

@deepayan
Copy link
Owner

@stefaneng I see that you have already submitted a PR for 2. Thanks, will take a look tomorrow.

@deepayan
Copy link
Owner

deepayan commented Apr 4, 2023

Fixed by commit 130b7cd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants