Inconsistent behavior for violin plot for zero variance vs near-zero variance #27

stefaneng · 2023-02-01T17:39:05Z

Violin plots have completely different behavior if variance is zero versus if there is near-zero variance. In the zero case, the kernel is extended way beyond the value. In the near zero case, the violin plot is constrained to an extremely small region which appears as a line. It seems like the desired behavior should be zero variance should be a line.

library(lattice)

simple.data <- data.frame(
    values = rep(100, 5),
    values2 = rep(100, 5) + rnorm(5, mean = 0.0001, sd = 0.0001),
    variable = rep('a', 5)
    )

print(sd(simple.data$values))
#> [1] 0

print(sd(simple.data$values2))
#> [1] 7.982134e-05

bwplot(values ~ variable,
       data = simple.data,
       ylim = c(-20, 200),
       panel = function(...) {
         # Default kernel = 'gaussian
         panel.violin(..., kernel = 'gaussian')
       })

bwplot(values2 ~ variable,
       data = simple.data,
       ylim = c(-20, 200),
       panel = function(...) {
         panel.violin(..., kernel = 'gaussian')
       })

^{Created on 2023-01-23 with reprex v2.0.2}

deepayan · 2023-02-08T18:55:51Z

This ultimately comes from

> range(density(simple.data$values)$x)
[1] -95.69051 295.69051
> range(density(simple.data$values2)$x)
[1]  99.99981 100.00048

It may be better to try and fix it there. I don't see any obvious "nice" solution, but I'll think about it some more.

stefaneng · 2023-02-16T00:12:33Z

@deepayan Thanks for the info on density.

A couple options: I am happy to implement one or the other or both if you like either suggestion.

Check if all values are constant within each group and default to drawing a line for no variance case. See the problematic example below why this is useful if one of the groups has no variance

library(lattice)

simple.data <- data.frame(
  # Constant variance in group a
  # Normal data in group b
  values = c(rep(1, 5), rnorm(50, sd = 2)),
  variable = c(rep("a", 5), rep("b", 50))
  )

bwplot(values ~ variable,
  data = simple.data,
  ylim = c(-5, 10),
  main = 'Which has the zero variance?',
  panel = function(...) {
    panel.violin(...)
  })

Allow vector values for the parameters to density so user can specify from, to, cut, etc to fix this issue for problematic cases. This could potentially be useful to more people that want more control over each group's density options. Currently the way it is implemented there is a silent error being caught if vectors are supplied and a nonsense plot is generated https://github.com/deepayan/lattice/blob/master/R/bwplot.R#L792-L805

# Not correct at all
# Error is caught silently and nonsense values returned
bwplot(values ~ variable,
  data = simple.data,
  ylim = c(-5, 10),
  main = 'Nonsense values returned',
  panel = function(...) {
    panel.violin(..., cut = c(0, 1))
  })

^{Created on 2023-02-15 with reprex v2.0.2}

Both of these could be worth adding. Just doing 1) won't change the default but will allow user to supply specific parameters for each density plot. 2) I think would be a good default as it is unintuitive if a user plots violin plots for multiple categories and one has no variance.

deepayan · 2023-02-16T19:08:16Z

Yes, I mostly agree with your analysis.

The blanket try() call is definitely not ideal, but given that on failure we just draw a line at x[1] suggests that it was intended to catch the 0 variance case. This sort of works for your original problem if we add bw = "nrd", or even any other bw.* function other than bw.nrd0().

The problem with bw.nrd0() is that it always tries to return a non-zero result, and there's simply no good way for that to work.

So, definitely some version of your suggestion 1 should be implemented. Probably just change

    my.density <- function(x) {
        if (sd(x) > 0)
            do.call(stats::density, c(list(x = x), darg))
        else
            list(x = rep(x[1], 3), y = c(0, 1, 0))
    }

I will have to think a bit more about the second suggestion, but please do send a patch if you come up with one easily. I am worried about situations where different panels might have a different set of non-empty levels for the grouping variable.

A similar problem will happen with densityplot(), but that will require a different solution.

deepayan · 2023-02-16T19:10:26Z

@stefaneng I see that you have already submitted a PR for 2. Thanks, will take a look tomorrow.

deepayan · 2023-04-04T11:37:18Z

Fixed by commit 130b7cd

stefaneng mentioned this issue Feb 15, 2023

Inconsistent behavior in violinplot when near-zero variance vs zero variance uclahs-cds/package-BoutrosLab-plotting-general#98

Open

stefaneng mentioned this issue Feb 16, 2023

Add vector argument support to panel.violinplot #28

Merged

deepayan added a commit that referenced this issue Apr 4, 2023

better handling of degenerate data in panel.violin (#27)

130b7cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior for violin plot for zero variance vs near-zero variance #27

Inconsistent behavior for violin plot for zero variance vs near-zero variance #27

stefaneng commented Feb 1, 2023

deepayan commented Feb 8, 2023

stefaneng commented Feb 16, 2023

deepayan commented Feb 16, 2023

deepayan commented Feb 16, 2023

deepayan commented Apr 4, 2023

Inconsistent behavior for violin plot for zero variance vs near-zero variance #27

Inconsistent behavior for violin plot for zero variance vs near-zero variance #27

Comments

stefaneng commented Feb 1, 2023

deepayan commented Feb 8, 2023

stefaneng commented Feb 16, 2023

deepayan commented Feb 16, 2023

deepayan commented Feb 16, 2023

deepayan commented Apr 4, 2023