Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Generalized) Linear Mixed Model Tutorial #1518

Closed
rokapre opened this issue Jan 9, 2021 · 28 comments
Closed

(Generalized) Linear Mixed Model Tutorial #1518

rokapre opened this issue Jan 9, 2021 · 28 comments

Comments

@rokapre
Copy link

rokapre commented Jan 9, 2021

Hello,

I am relatively new to Turing.jl and Bayesian in general. I was wondering how I could go about implementing an LMM/GLMM model with a random intercept/slope, I couldn't find the tutorial for this in the documentation here: https://turing.ml/dev/tutorials/

I know there is some Stan stuff out there but I have never used that either so its been a bit difficult to translate. Turing.jl is actually the first time I am trying a Probabilistic Programming Language. The syntax I have seen so far is really nice, but nothing for multilevel models. Since multilevel models are a big use case for Bayesian, it would be nice to have a tutorial on them. Then users could refer to that for their use cases.

Thanks!

@luiarthur
Copy link
Contributor

luiarthur commented Jan 10, 2021

Thanks for the inquiry.

Linear mixed models in a Bayesian setting are simply multi-level models. In a Frequentist setting, an LMM with random intercept requires specifying a distribution not only for the error terms, but random intercepts as well. In addition, other model parameters, e.g. a global intercept, are assumed to be fixed effects. The implication is that an LMM, which usually has both fixed and random effects, requires special considerations for estimating the effects; whereas In a model with only fixed effects, you can estimate parameters in the "usual" way using, for example, optimization algorithms, like gradient descent. (Or in the case of linear regression, the best linear unbiased estimator is available in closed form.)

In a Bayesian setting, you can specify a multi-level model with intercepts for each group like this (in Turing):

using Turing
using Distributions
import Random

@model multilevel_with_random_intercept(y, X, z) = begin
  # number of unique groups 
  num_random_intercepts = length(unique(z))

  # number of predictors
  num_predictors = size(X, 2) 

  ### NOTE: Carefully chosen priors should be used for a particular application. ###

  # Prior for standard deviation for errors.
  sigma ~ LogNormal()

  # Prior for coefficients for predictors.
  beta ~ filldist(Normal(), num_predictors)

  # Prior for intercept.
  intercept ~ Normal()

  # Prior for variance of random intercepts. Usually requires thoughtful specification.
  s2 ~ InverseGamma(3, 2)
  s = sqrt(s2)

  # Prior for random intercepts.
  random_intercepts ~ filldist(Normal(0, s), num_random_intercepts)
  
  # likelihood.
  y .~ Normal.(intercept .+ X * beta + random_intercepts[z], sigma)
end

# Generate data.
Random.seed!(0)
N = 200
X = randn(N, 3)
z = [fill(1, 50); fill(2, 50); fill(3, 50); fill(4, 50)]
beta = [-2, 0, 2]
intercept = 1
random_intercepts = [-1, 1, -1, 1]
sigma = .1
y = intercept .+ X * beta + random_intercepts[z] + randn(N) * .sigma

# Sample via NUTS.
chain = sample(multilevel_with_random_intercept(y, X, z), NUTS(), 2000)

# Print the posterior mean of model parameters.
mean(chain)

# Mean                             
#             parameters      mean 
#                 Symbol   Float64 
#                                  
#                beta[1]   -1.9875 
#                beta[2]   -0.0015 
#                beta[3]    1.9902 
#              intercept    0.7918 
#   random_intercepts[1]   -0.7694 
#   random_intercepts[2]    1.2085 
#   random_intercepts[3]   -0.7999 
#   random_intercepts[4]    1.2024 
#                     s2    1.1104 
#                  sigma    0.0997 
#                                  

Note that above, the priors usually require at least some degree of consideration and should not be too vague. Also note that in a Bayesian model, all parameters are random. (i.e. The global intercept would not be termed a "fixed effect".)

You can do something similar for a model with slopes for each group ("random slopes"). Let me know if you need more clarification.

A GLMM would replace the Gaussian likelihood with something else (e.g. Bernoulli, Binomial, Poisson, etc.) for when the response does not have support on the entire real line.

@rokapre
Copy link
Author

rokapre commented Jan 11, 2021

Thanks! This helps a lot. I am having a bit of trouble extending it to the random slope case, mainly since its hard to debug dimensions. I modified the above by adding this to the end. But I get an error about Dimension Mismatch

  #Prior for var of random slope 
  s2_slope ~ InverseGamma(3,2)
  s_slope = sqrt(s2_slope)

  #Prior for random slopes 

  random_slopes ~ filldist(Normal(0,s_slope),num_random_intercepts)
  
  # likelihood.
  y .~ Normal.(intercept .+ (X * (beta .+ random_slopes[z])) + random_intercepts[z], sigma)
end

Units wise, I need to add it to the beta before left multiplication with X, but I guess the issue is that beta is p x 1 and I need to have random_slopes[z] be p x 1 as well, which means the original random_slopes should be a vector of vectors I think but I don't know how to write that.

@luiarthur
Copy link
Contributor

luiarthur commented Jan 11, 2021

Let me write this model a different way.

y = X β + Z u + ϵ

where

  • y and ϵ are vectors of length N (number of observations), with y being the response, and epsilon being the errors.
  • X is a (known) design matrix of dimensions N x K, with K being the number of predictors (including a global intercept if desired).
  • beta are coefficients of length K
  • Z is a (known) design matrix for the "random effects (including slopes and intercepts)", of dimension N x J
  • u is a vector of "random effects (slopes and intercepts)" of dimension J

If you want a random-intercept-only model, then Z would be a matrix of ones and zeros. If there are only 3 groups and N=12, with the first 4 observations being in the first group, the next 4 being in the second group, and the last 4 being in the third group, then Z would be

1 0 0
1 0 0
1 0 0
1 0 0
0 1 0
0 1 0
0 1 0
0 1 0
0 0 1
0 0 1
0 0 1
0 0 1

If you want random slopes and intercepts, and you only have one covariate (so K=2, with first column being 1's and second column being the values of the predictors) then Z could be

1 0 0 x1,1 0 0
1 0 0 x1,2 0 0
1 0 0 x1,3 0 0
1 0 0 x1,4 0 0
0 1 0 0 x2,1 0
0 1 0 0 x2,2 0
0 1 0 0 x2,3 0
0 1 0 0 x2,4 0
0 0 1 0 0 x3,1
0 0 1 0 0 x3,2
0 0 1 0 0 x3,3
0 0 1 0 0 x3,4

where xi,j is the j-th measurement of the predictor for group i. If you have more covariates, you just add more columns to Z. (i.e. add another three columns to Z for another predictor.) The predictors that appear in Z don't have to be the same ones that appear in X.

Note that if you include a global intercept, X should contain (exactly) one column of ones.

Under this form, you can avoid the index (z) used in the previous model, and just supply X, y, and Z as data.

I haven't tested the following code, but something like this should work:

@model function multilevel_model(y, X, Z)
  # number of predictors
  num_predictors_X = size(X, 2)  # including intercept (column of 1s) if desired
  num_predictors_Z = size(Z, 2)

  ### NOTE: Carefully chosen priors should be used for the particular application. ###

  # Prior for standard deviation for errors.
  sigma ~ LogNormal()

  # Prior for coefficients for predictors. The supplied prior can be a little vague.
  beta ~ filldist(Normal(0, 10), num_predictors_X)

  # Prior for variance of group effects. Usually requires thoughtful specification.
  s2 ~ InverseGamma(3, 2)
  s = sqrt(s2)

  # Prior for group effects.
  u ~ filldist(Normal(0, s), num_predictors_Z)
  
  # likelihood.
  y .~ Normal.(X * beta + Z * u, sigma)
end

@rokapre
Copy link
Author

rokapre commented Jan 11, 2021

Thanks so much! I'm sure this will also help a lot of others looking to do Bayesian Multilevel Models, and the syntax here looks intuitive. This matrix representation helps since it is easier to generalize. I just checked on my data and it works (which is just a simple 1D case of Y vs X but 4 groups for the REs).

Just one more thing is in my case, I want the fixed effects intercept and slope to have different prior values. I tried this:

beta .~ Normal.([0.5,100],[0.2,20])

but it didn't seem to work. However, when I did:

beta0 ~ Normal(0.5,0.2)
beta1 ~ Normal(100,20)
beta = vcat(beta0,beta1)

It worked. So I wonder, is there a way to broadcast the Normal prior here all in 1 go when you want a different prior for each coefficient? Or do you have to manually specify like this even if its from the same distribution?

@luiarthur
Copy link
Contributor

Just one more thing is in my case, I want the fixed effects intercept and slope to have different prior values. I tried this:
beta .~ Normal.([0.5,100],[0.2,20])
but it didn't seem to work.

Try this:

beta ~ arraydist(Normal.([0.5, 100], [0.2, 20]))

The documentation on Turing is constantly changing, so I'm going to quote a bit from the current doc:

Avoiding loops can be done using filldist(dist, N) and arraydist(dists). filldist(dist, N) creates a multivariate distribution that is composed of N identical and independent copies of the univariate distribution dist if dist is univariate, or it creates a matrix-variate distribution composed of N identical and independent copies of the multivariate distribution dist if dist is multivariate. filldist(dist, N, M) can also be used to create a matrix-variate distribution from a univariate distribution dist. arraydist(dists) is similar to filldist but it takes an array of distributions dists as input. Writing a custom distribution with a custom adjoint is another option to avoid loops.

From: https://turing.ml/dev/docs/using-turing/performancetips

@rokapre
Copy link
Author

rokapre commented Jan 12, 2021

Thanks, using arraydist seems to work. Experimenting with something a bit more advanced now--I noticed that in the above model you specified the Random Effects all have the same variance (and presumably the random intercept/slope are independent of each other). What I mean is for example the frequentist LMM in R lme4 using VarCorr() you get the variance of the RE intercept, RE slope, and their covariance.

How do I get those variance components here, I assume it would involve using a Covariance matrix prior. I tried:

I = 1/40 .* Matrix(LinearAlgebra.I(num_predictors_Z))
  Σ ~ Wishart(num_predictors_Z+5,I)
  # Prior for group effects.
  #u ~ filldist(Normal(0, s), num_predictors_Z)
  u ~ MultivariateNormal(fill(0,num_predictors_Z),Σ)
  # likelihood.
  y .~ Normal.(X * beta + Z * u, sigma)

Basically I am trying to use a Wishart as the prior for the random effect covariance matrix (since I think it generates Positive Semidefinite matrices?) with 1/40* identity scale (I got the 1/40 from 1/n here https://en.wikipedia.org/wiki/Wishart_distribution#Use_in_Bayesian_statistics) . The covariance matrix should have num_predictors_Z dimensions for each of the random intercept coefficients and slope I think?

I get the error PosDefException: matrix is not Hermitian. Cholesky factorization failed

But I thought the Wishart distribution generates Positive Definite matrices so this shouldn't be an issue? Am I doing something wrong? How can I get variance components?

@cpfiffer
Copy link
Member

Can you try enforcing symmetry with

using LinearAlgebra
MultivariateNormal(fill(0,num_predictors_Z), Symmetric(Σ))

@rokapre
Copy link
Author

rokapre commented Jan 12, 2021

I get the same error, it seems to be occurring (based on the Atom IDE red highlighting) in the step before that where Σ is defined. So not sure if its numerical issues (which Symmetric() should theoretically prevent to my knowledge) or something else with the syntax

@luiarthur
Copy link
Contributor

luiarthur commented Jan 13, 2021

I'm not very experienced with multilevel / mixed effects models. But for mixed effects models, specifying a covariance for the u vector can be tricky and people tend to put structure into it when possible, as opposed to using a completely unstructured covariance. See this for example:

https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/198-30.pdf

My guess would be that adding small values (e.g. 1e-4) to the diagonal might help with the positive definiteness in this case.

If that doesn't work, could you provide a minimal example to reproduce the errors?

@luiarthur
Copy link
Contributor

I'll have to think about the Wishart prior specifications.

@jfb-h
Copy link

jfb-h commented Jan 13, 2021

The Stan documentation is a good resource on multilevel models. They argue in favour of an LKJ prior over the Wishart. Using the LKJ doesn't solve the numerical precision problem, though.

@itsdfish
Copy link
Contributor

itsdfish commented Jan 14, 2021

Thanks for providing examples of multilevel models. I have been trying to scale up your example to a more common amount of data. Unfortunately, it is running very slow and the progress meter is not displaying a estimated time. Its not clear to me whether memoization is applicable in this case. Do you have any recommendations?

using Turing, ReverseDiff
using Distributions
import Random

Turing.setadbackend(:reversediff)

@model multilevel_with_random_intercept(y, X, z) = begin
  # number of unique groups 
  num_random_intercepts = length(unique(z))

  # number of predictors
  num_predictors = size(X, 2) 

  ### NOTE: Carefully chosen priors should be used for a particular application. ###

  # Prior for standard deviation for errors.
  sigma ~ LogNormal()

  # Prior for coefficients for predictors.
  beta ~ filldist(Normal(), num_predictors)

  # Prior for intercept.
  intercept ~ Normal()

  # Prior for variance of random intercepts. Usually requires thoughtful specification.
  s2 ~ InverseGamma(3, 2)
  s = sqrt(s2)

  # Prior for random intercepts.
  random_intercepts ~ filldist(Normal(0, s), num_random_intercepts)
  
  # likelihood.
  y .~ Normal.(intercept .+ X * beta + random_intercepts[z], sigma)
end

# Generate data.
Random.seed!(0)
N = 30
n = 50
X = randn(N*n, 3)
z = repeat(1:N, inner=n)
beta = [-2, 0, 2]
intercept = 1
random_intercepts = rand(Normal(0,1), N)
sigma = .1
y = intercept .+ X * beta + random_intercepts[z] + randn(N*n) * sigma

# Sample via NUTS.
chain = sample(multilevel_with_random_intercept(y, X, z), NUTS(), 2000, progress=true)

# Print the posterior mean of model parameters.
mean(chain)

@luiarthur
Copy link
Contributor

My guess is that NUTS is using a large tree depth.

Could you try replacing NUTS() with NUTS(nadapt=200, target_accept_ratio=0.8, max_depth=5)? This will probably make the sampler run faster, but I cannot speak to the quality of the samples in advance.

@itsdfish
Copy link
Contributor

itsdfish commented Feb 2, 2021

Thanks for getting back to me. I added NUTS(200, .8;max_depth=5) to the code above. It was much faster, but the effective sample size was around 6 for the random intercepts. It was low in the previous version as well. In addition, I tried to increase nadapt to 1000, but the results were similar. If you don't have any more ideas, I might do a comparison with Stan and submit a new issue.

@luiarthur
Copy link
Contributor

Try a combination of gradually increasing the max_depth and lowering the target_accept_ratio. For target_accept_ratio, I wouldn't go below 0.65. For max_depth, the default is 10, IIRC. That's why you got the speed up. The defaults might be different for STAN.

@storopoli
Copy link
Member

The hyperprior for the intercept SDs are way too low

# Prior for variance of random intercepts. Usually requires thoughtful specification.
  s2 ~ InverseGamma(3, 2)
  s = sqrt(s2)

try this:

s ~ Truncated(Cauchy(0, 2), 0, Inf)
random_intercepts ~ filldist(Normal(0, s), num_random_intercepts)

and NUTS(0.65, num_warmup) -- NUTS with acceptance ratio of 0.65 (In my tests I didn't need to touch on max_depth and I set num_warmup = 1_000)

@storopoli
Copy link
Member

My go-to varying intercept model looks like this:

# Model
@model varying_intercept(X, idx, y) = begin
    n_gr = length(unique(idx))
    predictors = size(X, 2)

    # priors
    μ ~ Normal(mean(y), 2.5 * std(y))       # population-level intercept
    σ ~ Exponential(1 / std(y))             # residual SD
    # Coefficients Student-t(ν = 3)
    β ~ filldist(TDist(3), size(X, 2))
    # Prior for variance of random intercepts. Usually requires thoughtful specification.
    σᵢ ~ Truncated(Cauchy(0, 2), 0, Inf)
    # s = sqrt(s2)
    μᵢ ~ filldist(Normal(0, σᵢ), n_gr)      # group-level intercepts

    # likelihood
    ŷ = μ .+ X * β .+ μᵢ[idx]
    y ~ MvNormal(ŷ, σ)
end

@devmotion
Copy link
Member

Usually it should be more efficient to fix n_gr by making it an argument of the model (possibly with default value length(unique(idx))). Currently, it is recomputed every time you sample from the model or compute (the gradient of) the log pdf. Alternatively, maybe you could use an array type such as CategoricalArray that already encodes the (number of) levels.

@storopoli
Copy link
Member

Are CategoricalArrays supported by Turing? I am just trying to do that in a Bernoulli Model

sing Turing, RDatasets
# We need a logistic function, which is provided by StatsFuns.
using StatsFuns:logistic

default = RDatasets.dataset("ISLR", "Default")

X = Matrix(select(default, [:Student, :Balance, :Income]))
y = default[:, :Default]

# original formula: default ~ student + balance + income, family

@model logreg(X, y) = begin
	d = size(X, 2)
	μ ~ Normal(0, 2.5 * std(y))
	β ~ filldist(TDist(3), d)
	v = logistic.(μ .+ X * β)
	y .~ Bernoulli.(v)
end

model = logreg(X, y)

chn = sample(model, NUTS(1_000, 0.65), MCMCThreads(), 2_000, 4)

The problem is that :Student and :Default are CategoricalArrays and I get an error

ERROR: LoadError: TaskFailedException:
MethodError: no method matching /(::CategoricalValue{String,UInt8}, ::Int64)

@devmotion
Copy link
Member

y is an array of strings but Bernoulli only supports Reals. This is different from the example above with idx: idx is not an observation of a probability distribution. However, you have to ensure that the elements of idx can be used for indexing, i.e., are Ints.

@storopoli
Copy link
Member

storopoli commented Feb 23, 2021

@devmotion you are right! Thanks!

Going from 26s to 25s (1s improvement but that's something)

the model now is:

# Model
@model varying_intercept(X, idx, y; n_gr = length(unique(idx))) = begin
    predictors = size(X, 2)

    # priors
    μ ~ Normal(mean(y), 2.5 * std(y))       # population-level intercept
    σ ~ Exponential(1 / std(y))             # residual SD
    # Coefficients Student-t(ν = 3)
    β ~ filldist(TDist(3), predictors)
    # Prior for variance of random intercepts. Usually requires thoughtful specification.
    σᵢ ~ Truncated(Cauchy(0, 2), 0, Inf)
    # s = sqrt(s2)
    μᵢ ~ filldist(Normal(0, σᵢ), n_gr)      # group-level intercepts

    # likelihood
    ŷ = μ .+ X * β .+ μᵢ[idx]
    y ~ MvNormal(ŷ, σ)
end

Or even more efficient:

# Model
@model varying_intercept(X, idx, y; n_gr = length(unique(idx)), predictors = size(X, 2)) = begin
    # priors
    μ ~ Normal(mean(y), 2.5 * std(y))       # population-level intercept
    σ ~ Exponential(1 / std(y))             # residual SD
    # Coefficients Student-t(ν = 3)
    β ~ filldist(TDist(3), predictors)
    # Prior for variance of random intercepts. Usually requires thoughtful specification.
    σᵢ ~ Truncated(Cauchy(0, 2), 0, Inf)
    # s = sqrt(s2)
    μᵢ ~ filldist(Normal(0, σᵢ), n_gr)      # group-level intercepts

    # likelihood
    ŷ = μ .+ X * β .+ μᵢ[idx]
    y ~ MvNormal(ŷ, σ)
end

@storopoli
Copy link
Member

storopoli commented Feb 23, 2021

Also, while we are at it. Could you guys help me out? I'm trying to implement a multivariate varying-slope model that uses a cholesky decomposition. But I keep getting bad ESS and Rhat:

using Turing, RDatasets
using Random:seed!
using Statistics: mean, std
using LinearAlgebra: cholesky

seed!(1)

mtcars = RDatasets.dataset("datasets", "mtcars")

# Data prep
y = mtcars[:, :MPG]
idx = mtcars[:, :Cyl] # vector of group indeces
idx = map(idx) do i
    i == 4 ? 1 :
    i == 6 ? 2 :
    i == 8 ? 3 : missing
end
X = Matrix(select(mtcars, [:HP, :WT])) # the model matrix

# Model
@model varying_intercept_multi(X, idx, y; n_gr=length(unique(idx)), predictors=size(X, 2), 
                               L=cholesky(cor(X)).L, mus=transpose(mean(X, dims=1))) = begin
    # priors
    μ ~ Normal(mean(y), 2.5 * std(y))                         # population-level intercept
    # Prior for variance of random intercepts. Usually requires thoughtful specification.
    σᵢ ~ Truncated(Cauchy(0, 2), 0, Inf)
    μᵢ ~ filldist(Normal(0, σᵢ), n_gr)                        # group-level intercepts
    # Non-Centered Parameterization
    β_raw ~ filldist(Normal(), predictors)                    # NCP hyperprior for slopes
    σⱼ ~ filldist(Truncated(Cauchy(0, 5), 0, Inf), predictors) # SD hyperprior for slopes
    α ~ filldist(Normal(), predictors)                        # mean hyperprior for slopes
    β = mus + σⱼ .* (L * α)
    σ ~ Exponential(1 / std(y))                               # residual SD

    # likelihood
    ŷ = vec.+ X * β .+ μᵢ[idx])
    y ~ MvNormal(ŷ, σ)

    # generated quantities
    return β
end

@devmotion
Copy link
Member

mus is not defined, is it? BTW be careful with the default arguments - their values won't be updated when X changes.

@storopoli
Copy link
Member

storopoli commented Feb 23, 2021

Sorry my bad yes it is defined it was a mistake while copying and pasting. mu =transpose(mean(X, dims=1)) is now mus =transpose(mean(X, dims=1))

@sjwild
Copy link

sjwild commented Feb 23, 2021

If it helps, I have some code in this issue that you can adapt.

I'm trying to work out which ad backends are better in which circumstances. Using reverse ad can save a lot of time with bigger models. Using the Hsb82 dataset, for example, Zygote (when it worked) let me run the model in 15 min or so. With forwarddiff, it took like like 6 hours.

@storopoli
Copy link
Member

By default Turing runs on forward diff?

@cpfiffer
Copy link
Member

Yeah, that's correct.

@yebai
Copy link
Member

yebai commented Dec 16, 2021

@yebai yebai closed this as completed Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants