Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't tune DEMetropolis by default #3743

Merged
merged 2 commits into from
Dec 21, 2019

Conversation

michaelosthege
Copy link
Member

Why

Until now, our implementation tunes scaling, which is a bit pointless, because as soon as the population spreads much more than the scaled proposal distribution, it doesn't really make a difference. Only in situations where n_chains < n_dim+1, tuning scaling should have a real impact, but that's not recommended anyways and since #3719 we have a warning about that.

As I mentioned in #3720, Nelson et al. (2013), section 4.1.2 describes a procedure where they tune lambda. For non-Normal densities, where the 2.38/sqrt(2*ndim) rule of thumb is not necessarily the best setting, this sounds like a great idea. However, tuning lambda can lead to swing-in effects if convergence to the typical set takes a significant part of the tuning phase.

I compared all three settings (None, scaling, lambda) on a 50-dimensional MvNormal, but there's no clear winner.

To compromise, I propose to tune neither scaling nor lambda by default, but allow the user to tune either of them if they wish to.

Changes

+ tune argument now one of None,scaling,lambda
+ support for tuning lambda (closes pymc-devs#3720)
+ added test to check checking of tune setting
+ both scaling and lambda are recorded in the sampler stats
@michaelosthege michaelosthege marked this pull request as ready for review December 19, 2019 15:49
@codecov
Copy link

codecov bot commented Dec 19, 2019

Codecov Report

Merging #3743 into master will decrease coverage by 0.32%.
The diff coverage is 77.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3743      +/-   ##
==========================================
- Coverage   90.41%   90.08%   -0.33%     
==========================================
  Files         133      133              
  Lines       20344    20361      +17     
==========================================
- Hits        18394    18343      -51     
- Misses       1950     2018      +68
Impacted Files Coverage Δ
pymc3/tests/test_step.py 90.98% <100%> (-9.02%) ⬇️
pymc3/step_methods/metropolis.py 86.06% <33.33%> (-0.73%) ⬇️
pymc3/tests/models.py 70.24% <0%> (-15.71%) ⬇️
pymc3/step_methods/hmc/base_hmc.py 93.45% <0%> (-1.87%) ⬇️
pymc3/tests/test_distributions.py 96.42% <0%> (-1.73%) ⬇️
pymc3/tests/test_transforms.py 98.52% <0%> (-1.48%) ⬇️
pymc3/tests/test_mixture.py 98.95% <0%> (-0.7%) ⬇️
pymc3/distributions/continuous.py 80.45% <0%> (-0.41%) ⬇️

@aloctavodia
Copy link
Member

Just out of curiosity, have you tried testing if the 3 scaling options makes sense for a multimodal posterior (like the two gaussians example for SMC) and for a hierachical model?

@michaelosthege
Copy link
Member Author

michaelosthege commented Dec 20, 2019

I had to modify the two-gaussian example a bit, but I ran all three settings 5x:

import pymc3 as pm
import pandas
import arviz
from theano.tensor import tt

n = 4

mu1 = np.ones(n) * (1. / 2)
mu2 = -mu1

stdev = 0.3
sigma = np.power(stdev, 2) * np.eye(n)
isigma = np.linalg.inv(sigma)
dsigma = np.linalg.det(sigma)

w1 = 0.1
w2 = (1 - w1)

def two_gaussians(x):
    log_like1 = - 0.5 * n * tt.log(2 * np.pi) \
                - 0.5 * tt.log(dsigma) \
                - 0.5 * (x - mu1).T.dot(isigma).dot(x - mu1)
    log_like2 = - 0.5 * n * tt.log(2 * np.pi) \
                - 0.5 * tt.log(dsigma) \
                - 0.5 * (x - mu2).T.dot(isigma).dot(x - mu2)
    return tt.log(w1 * tt.exp(log_like1) + w2 * tt.exp(log_like2))


with pm.Model() as model:
    X = pm.Uniform('X',
                   shape=n,
                   lower=-2. * np.ones_like(mu1),
                   upper=2. * np.ones_like(mu1),
                   testval=-1. * np.ones_like(mu1))
    llk = pm.Potential('llk', two_gaussians(X))

df_many = pandas.DataFrame(columns=['setting', 'r', 'ess']).set_index(['setting', 'r'])
with model:
    for setting in [None, 'scaling', 'lambda']:
        for r in range(5):
            n_tune = 8000
            n_draws = 8000
            trace = pm.sample(
                tune=n_tune, draws=n_draws, step=pm.DEMetropolis(tune=setting),
                chains=10, cores=1, discard_tuned_samples=False)
            ess = arviz.ess(trace[-n_draws:])
            df_many.at[(str(setting), r), 'ess'] = float(numpy.mean(ess.X).values)
df_many.head()

The effective sample sizes:

setting run ess
None 0 2289.04
None 1 2549.26
None 2 2707.09
None 3 2112.21
None 4 2354.95
scaling 0 2585.26
scaling 1 2570.08
scaling 2 2261.88
scaling 3 2685.28
scaling 4 2372.64
lambda 0 1948.31
lambda 1 1688.59
lambda 2 2077.51
lambda 3 1731.80
lambda 4 1768.45

Summarized:

setting mean standard deviation standard error
None 2402.51 231.14 84.40
scaling 2495.03 172.70 63.06
lambda 1842.93 164.20 59.95

Error bars indicate standard errors:
summarized

With 20000+5000 iterations also the progression of the sampler stats - using rolling mean with window of 500 for accepted:

3x3

tl;dr: tune=None looks to be a reasonable default.

@aloctavodia
Copy link
Member

LGTM!

Copy link
Member

@junpenglao junpenglao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am wondering if it make sense/possible to do both?

pymc3/step_methods/metropolis.py Show resolved Hide resolved
@michaelosthege michaelosthege merged commit 1b49e5e into pymc-devs:master Dec 21, 2019
@michaelosthege michaelosthege deleted the demcmc-lambda-tuning branch August 7, 2021 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEMetropolis: tune lambda instead of epsilon
3 participants