Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior in violinplot when near-zero variance vs zero variance #98

Open
stefaneng opened this issue Feb 15, 2023 · 0 comments

Comments

@stefaneng
Copy link
Contributor

stefaneng commented Feb 15, 2023

Description

Converting this internal discussion into an issue for better visability: https://github.com/uclahs-cds/project-cohort-PrecisionHealth-WXPH-000086-150KPH/pull/12

Lattice issue here: deepayan/lattice#27

Violinplots appear different with near-zero variance vs zero variance.
The reason for this is that in lattice the limits are pulled from range(density(data)$x) which returns a large interval when there is no variance

> range(density(simple.data$values)$x)
[1] -95.69051 295.69051
> range(density(simple.data$values2)$x)
[1]  99.99981 100.00048

Originally posted in deepayan/lattice#27 (comment)

Example

library(BoutrosLab.plotting.general)

set.seed(13)
simple.data <- data.frame(
    values = rep(100, 5),
    values2 = rep(100, 5) + rnorm(5, mean = 0.0001, sd = 0.0001),
    variable = rep('a', 5)
    )

print(sd(simple.data$values))
#> [1] 0

print(sd(simple.data$values2))
#> [1] 8.052756e-05

create.violinplot(
    formula = values ~ variable,
    data = simple.data,
    ylim = c(-20, 200),
    main = 'Zero variance'
    )

create.violinplot(
    formula = values2 ~ variable,
    data = simple.data,
    ylim = c(-20, 200),
    main = 'Near zero variance'
    )

The most dangerous case is when we have multiple categories and we plot without knowing that we have zero variance in one of the categories.

library(BoutrosLab.plotting.general)

set.seed(13)
simple.data <- data.frame(
    values = c(rep(5, 5), rnorm(25, sd = 3)),
    variable = c(rep('a', 5), rep('b', 25))
    )

create.violinplot(
    formula = values ~ variable,
    data = simple.data,
    ylim = c(-10, 15)
    )

Just looking at the plot it is not clear that a has zero variance which could lead to incorrect interpretation if published. It would be more clear that it has zero variance if a single line is draw representing a point mass with probability one.

Created on 2023-02-15 with reprex v2.0.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant