Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random states (C++) #22

Closed
scarrazza opened this issue Feb 8, 2018 · 8 comments
Closed

Random states (C++) #22

scarrazza opened this issue Feb 8, 2018 · 8 comments
Assignees

Comments

@scarrazza
Copy link
Member

Issue by Zaharid
Friday Jun 23, 2017 at 09:49 GMT
Originally opened as https://github.com/NNPDF/libnnpdf/issues/9


We clearly need more localized random states. I think we should have high level functions (similar to the one in #7) that take a random_state as an argument. This is what sklearn does (e.g. http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) and works well enough.

These "high level functions" are things like Minimize, GeneratePseudorreplicas or TrainValidSplit.

Ideas?

@nhartland nhartland changed the title Random states Random states (C++) Feb 20, 2018
@Zaharid Zaharid closed this as completed May 3, 2018
@Zaharid Zaharid reopened this May 3, 2018
@Zaharid
Copy link
Contributor

Zaharid commented Jan 18, 2019

@wilsonmr @tgiani we should have a look at this about now. In particular I am not sure if the current random states for data generation work consistently. We should be able to specify the data seed for two fits be sure that the replicas are the same.

@wilsonmr
Copy link
Contributor

Ok so the point is to work out essentially each independent process which uses a random number and have a seperate seed for each random generator and pass the relevant random state to the relevant function?

@Zaharid
Copy link
Contributor

Zaharid commented Jan 18, 2019

We did a half backed thing for the alpha_s studies. The idea is to make it not half backed. Ie, to make sure that given the same data ordering and cuts, and the same seed, one gets the same fluctuations.

@Zaharid
Copy link
Contributor

Zaharid commented Jan 18, 2019

Do git grep dataseed to see about the current implementation.

@Zaharid
Copy link
Contributor

Zaharid commented Jan 18, 2019

I think we essentially want the same randomstate interface that numpy has, and want to make it interact in the same way it does for sklearn.

@wilsonmr
Copy link
Contributor

wilsonmr commented Jan 21, 2019

Do we want to be able to produce the same sequence of numbers as numpy given same seed? I guess this would be nice with regards to closure tests etc.

I was playing the the gsl_rng library and if I use gsl_rng_mt19937 generator then every other number matches the numpy one for the same seed up to the 8th decimal place, would that be sufficient? I'm surprised they don't match up to like 16th place since they're both doubles.

I think in theory the rng state which is already implemented will work in the same way as the numpy RandomState although it's just a single long int. I tried using the get_state method on the numpy random state but this returns like 624 long ints and none of them appear to correspond to the one outputted by gsl, which I find confusing

EDIT: nevermind I wasn't outputing the state of the gsl rng

@wilsonmr
Copy link
Contributor

So when we run a fit, am I right in thinking there is just once instance of the RNG we just reseed it at various points in the fit?

I was thinking initally that this wouldn't be so hard because we could just have basically a different instance of the RNG for each independent stream of random numbers but then I was messing around in python and it seems the rng instance is created once which then entangles all of the different calls of getRNG()->SetSeed() for example

@Zaharid
Copy link
Contributor

Zaharid commented Jan 21, 2019

Recent versions of the c++ standard have the mersanne twister, so we should use those (and we do in some places of buildmaster).

There are some subtleties when you think about it, particularly regarding vp. See #77, where I didn't really come up with a good idea. I believe than seeing how we organize the fit is easier in that there are only a limited number of configurations we care for (essentially correlated data streams and uncorrelated GA is the interesting one. Correlated everything for point-by-point reproducibility).

I suppose that numpy and friends have a global random instance that they use by default, if something else is not specified. The numpy random functions random.XX are really globalState.XX. This is useful to avoid seeding the state too frequently, or encountering unexpected correlations. I am not sure how much we care, given that the behaviour is going to be hardcoded.

@Zaharid Zaharid closed this as completed Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants