Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create_rng with chain=0 appears to return biased first draw #3167

Closed
cescalara opened this issue Feb 23, 2023 · 28 comments · Fixed by #3168
Closed

create_rng with chain=0 appears to return biased first draw #3167

cescalara opened this issue Feb 23, 2023 · 28 comments · Fixed by #3168

Comments

@cescalara
Copy link

Summary:

When using poisson_rng inside the transformed parameter block, the results do not follow the expected Poisson distribution. The issue only seems to appear for rate values of lambda < 10, e.g. poisson_rng(9.9). The issue only appears inside the transformed data block, and using poisson_rng inside the generated quantities block is fine.

Description:

If I compile and run the Stan code shown below as a simulation (i.e. fixed_param=True) via cmdstanpy I get very different output depending on which block the function is called. When calling poisson_rng in the transformed data block, the output doesn't make sense. I also add the python code I use and a plot demonstrating this.

Stan code: poisson.stan

data {

    real Nex;

}

transformed data {

    int N_td;

    N_td = poisson_rng(Nex);

}

generated quantities {

    int N_out_td;
    int N_gq;

    N_out_td = N_td; 
    N_gq = poisson_rng(Nex);

}


python code:

import numpy as np
from scipy import stats
from matplotlib import pyplot as plt
from cmdstanpy import CmdStanModel

poisson_model = "poisson.stan"
model = CmdStanModel(stan_file=poisson_model)

data = {"Nex": 9.9}
output_N_td = []
output_N_gq = []

for _ in range(100):
    seed = np.random.randint(1, 10000)
    sim = model.sample(data=data, iter_sampling=1, chains=1, fixed_param=True, seed=seed)
    N_td = sim.stan_variable("N_out_td")[0]
    N_gq = sim.stan_variable("N_gq")[0]
    output_N_td.append(N_td)
    output_N_gq.append(N_gq)

fig, ax = plt.subplots()
max_x = 30
x = np.arange(max_x)
bins = np.linspace(0, max_x, max_x+1)-0.5
ax.hist(output_N_td, bins=bins, density=True, label="Output from transformed data", alpha=0.5)
ax.hist(output_N_gq, bins=bins, density=True, label="Output from generated quantities", alpha=0.5)
ax.scatter(x, stats.poisson(data["Nex"]).pmf(x), color="k", label="Poisson PMF")
ax.set_title("Histograms of 100 samples, Nex=9.9")
ax.legend()
fig.savefig("figures/stan_poisson_db.png")

Figure:
stan_poisson_db

Additional Information:

I often use poisson_rng in the transformed data block to implement a simulation of mock data from a Poisson process.

Current Version:

INSTALLED VERSIONS
---------------------
python: 3.9.12 (main, Jun  1 2022, 11:38:51) 
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-135-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
cmdstan_folder: /home/iwsatlas1/capel/.cmdstan/cmdstan-2.31.0
cmdstan: (2, 31)
cmdstanpy: 1.1.0
pandas: 1.4.3
xarray: 2022.6.0
tqdm: 4.64.0
numpy: 1.21.5

I'm curious if others are able to reproduce this, or I am missing something. In case this is the wrong place to post the issue, I can also move it elsewhere. Thanks a lot!

@WardBrian
Copy link
Member

Very odd. I can confirm I can reproduce the behavior you see.

Investigating now

@WardBrian
Copy link
Member

Hm, inserting an extra rng call in the tranformed data block changes the behavior:

data {
  real Nex;
}
transformed data {
  real ignored = uniform_rng(0,1);

  int N_td;

  N_td = poisson_rng(Nex);
}
generated quantities {
  int N_out_td;
  int N_gq;

  N_out_td = N_td;
  N_gq = poisson_rng(Nex);
}

I suspect the bug you've found is not in cmdstanpy, but somewhere deeper in Stan. I will transfer this issue when I figure out where it should go!

@WardBrian
Copy link
Member

WardBrian commented Feb 23, 2023

Looking at the code now, it seems we

  1. use a different RNG (with the same seed) inside the model constructor as everywhere else
  2. do not supply the chain ID to advance the RNG in the constructor, so all chains would have the same RNG in the transformed data block (we just pass chain id 0).

Both of those seem like potential issues, but neither really explain whatever is going on here. (Edit: 2 is intentional: https://discourse.mc-stan.org/t/transformed-data-rng-varies-by-chain/569)

I've also confirmed that setting the RNG to any other chain (by editing poison.hpp) fixes the problem, and setting id=0 breaks the RNG for generated quantities as well.

Very odd!

@WardBrian
Copy link
Member

Oh also, it has nothing to do with poisson - here's the same observation using uniform_rng(0,1):

Figure_1

@cescalara
Copy link
Author

cescalara commented Feb 23, 2023

Thanks for looking into this! Happy to help out if I can. I thought it was particularly weird that setting lambda>10 seemed to fix things...

@WardBrian
Copy link
Member

It seems like this issue may be even below Stan, but it's possible we can fix it ourselves.

Here's the equivalent of what we're doing in pure C++ with Boost: https://godbolt.org/z/h98Gbdf9j

If you change out seeds, you'll see that the first value printed is always very high, but the following ones look normal.

Very odd! We're still looking into this but we will definitely make sure something is changed for the next version of Stan! For now, inserting a dummy real = uniform_rng(0,1) call seems to fix subsequent things.

@WardBrian WardBrian transferred this issue from stan-dev/cmdstanpy Feb 23, 2023
@WardBrian
Copy link
Member

I've transferred this to the Stan repository because this is most likely where it will be fixed, by editing https://github.com/stan-dev/stan/blob/develop/src/stan/services/util/create_rng.hpp

@WardBrian WardBrian changed the title Unexpected behaviour when using poisson_rng in transformed data create_rng with chain=0 appears to return biased first draw Feb 23, 2023
@ahartikainen
Copy link

If you change out seeds, you'll see that the first value printed is always very high, but the following ones look normal.

In Stan or in C++?

@WardBrian
Copy link
Member

Just in C++. The godbolt link above shows what I'm describing using only the boost parts we use in Stan.

@ahartikainen
Copy link

Just in C++. The godbolt link above shows what I'm describing using only the boost parts we use in Stan.

Are you sure? I did not see anything peculiar?

@WardBrian
Copy link
Member

WardBrian commented Feb 23, 2023

If you vary the seed, you'll see things like:

Seed: 233

Program stdout

12
2
2

The first number printed is usually >10 regardless of seed, which is quite unlikely for poission(3). Furthermore, if I change the code so that there is always an rng.discard(1) call, this behavior vanishes

@WardBrian
Copy link
Member

Hmm, the behavior is even weirder, and it seems to also depend on small seeds.

The original python code used a random number between 1 and 10,000 as a seed. If you make this much bigger (I used seed = np.random.randint(10000, 10000000)) then the graphs start looking correct.

@ahartikainen
Copy link

Ok, now I see it

@WardBrian
Copy link
Member

Here's another godbolt which shows the issue more precisely:
https://godbolt.org/z/e9veEn5T5

@bob-carpenter
Copy link
Contributor

Could you also file an issue with Boost? I know we'll need to work around this in the short term, but it's probably something they want to know, too, as giving things small seeds by hand is a very common practice. My guess is that it's a bad interaction between the linear congruential pRNG we use and the implementation of the Poisson RNG.

@WardBrian
Copy link
Member

It doesn't seem to be unique to Poisson, here is the same code but with a uniform(0,1) rng: https://godbolt.org/z/zfT579dox
The first draw is always close to 1, the second draw appears to be from the correct distribution.

That said, it also isn't universal - beta and normal both look correct in both situations.

@bob-carpenter
Copy link
Contributor

That makes sense. For our pRNG, there's an underlying 256-bit representation of the state of the RNG. I don't know how state is initialized from a 32-bit uint seed, but my guess is that a small number will somehow fail to get a good 256-bit initialization for the way the pRNGs are implemented for Poisson, uniform, etc.

Also, I had no idea C++11 had a lot of this implemented. For example,

https://en.cppreference.com/w/cpp/numeric/random/poisson_distribution

@WardBrian
Copy link
Member

Boost issue opened.

The easiest work-around for us seems to be just always discarding at least 1 draw when we create the RNG. If we want to maximize backwards compatibility of seeds, we can add 1 to the discard amount if and only if the chain ID is 0, otherwise we can just unconditionally add 1 and state in the release notes that random behavior will have changed this version for seeded runs.

@bob-carpenter
Copy link
Contributor

@mitzimorris said this issue may have been introduced relatively recently when we changed the default ID of the first chain from 0 to 1 (to avoid having to advance when not required, but it looks like it was required!).

@WardBrian
Copy link
Member

I think changing the default ID to 1 actually made this bug less common. Now it is only apparently in two scenarios

  1. The user manually sets the ID to 0 and then runs something like standalone generated quantities with a poisson or uniform variable
  2. The constructor of the model class hard-codes the chain ID to 0, so the transformed data block will see this for those distributions

@bob-carpenter
Copy link
Contributor

Sorry, I got that backwards. It changed from 1 to 0.

@bob-carpenter
Copy link
Contributor

Also, thanks much for reporting, @cescalara. We're all still marveling at how you managed to discover this bug given how subtle and seed-dependent it is.

@WardBrian
Copy link
Member

Looking through the history, it used to discard DISCARD_STRIDE * (chain - 1), until this PR in 2017: #2313, when it just discarded chain as it does now. The multi-chain PR #3033 edited this file while in progress but ultimately kept this behavior.

I believe the model class has used chain id 0 since the era of that 2017 PR, so the behavior would have been noticeable as long ago as that.

@rok-cesnovar
Copy link
Member

Until we get a boost bugfix, my vote would be to discard std::max(1, DISCARD_STRIDE * chain) and announce it in release notes and everywhere else. Once/if boost does a bugfix, we then switch back to what we currently have.

The only drawback (at least that I see) is backward compatibility for anyone using chain ID = 0. But anyone using that would have been using a biased rng and there is no reason to leave a bug in for backward compatiblity.

@kotika
Copy link

kotika commented Feb 24, 2023

Noone should be using ecuyer1988 RNG as it is obsolete. The replacement
boost::ecuyer1988 -> boost::random::mixmax
should solve your issues.
MIXMAX is a modern, industrial strength RNG. It has been made the default in major physics Monte-Carlo packages, GEANT4 and CMS-SW software.
https://github.com/Geant4/geant4/blob/master/source/externals/clhep/src/MixMaxRng.cc
cms-sw/cmssw@e79b467

Along with sequences indistinguishable statistically from true random numbers, attention has been paid to seeding, such that the sequence
obtained from seed(1) starts out perfectly randomized and is completely independent from seed(2), seed(3) etc, including the case if you
want to test the quality by interleaving the two or more streams.

@WardBrian
Copy link
Member

It appears MIXMAX defines discard as

void discard(boost::uint64_t nsteps) { for(boost::uint64_t j = 0; j < nsteps; ++j)  (*this)(); } ///< discard n steps, required in boost::random

We regularly call discard with an argument on the order of 2^50, so this wouldn’t exactly be a drop in replacement for our use case and would require reconsidering how we use the random number generators

@kotika
Copy link

kotika commented Feb 24, 2023

@WardBrian , @bob-carpenter
Most RNG work on a single trajectory, where seeding can place the state of the RNG at a "random" point.
The RNG called ecuyer1988 has the period of about 10^18, so advancing on it for 2^50 steps would exhaust the period and create the danger that random numbers simply repeat between different draws. Moreover, it is certain that with a generator like ecuyer1988 will produce identical stream of random numbers from pairs of completely different seeds (with a lag of e.g. 1).

In MIXMAX, the trajectory is astronomically long and
the state of seed(k+1) is the state of seed(k) advanced by exactly 2^512 steps. Perhaps if I understood what task you accomplish with discard, I could find some solution for you.

@WardBrian
Copy link
Member

The RNG called ecuyer1988 has the period of about 10^18

Table 32.6 in the Boost documentation claims ecuyer1988 has a period of approximately 2^61

Your comment boostorg/random#92 (comment) is correct about why we discard, it is for parallel MCMC chains. We currently create each chain with the same seed but discard (2^50 * chain_id) elements.

The behavior you're describing for MIXMAX sounds like something we could reasonably use. If I'm understanding correctly we could then seed each chain with (user_supplied_seed + chain_id)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants