-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add method to set random seed on copulas models #313
Conversation
Codecov Report
@@ Coverage Diff @@
## master #313 +/- ##
==========================================
+ Coverage 87.16% 87.31% +0.15%
==========================================
Files 27 27
Lines 1706 1727 +21
==========================================
+ Hits 1487 1508 +21
Misses 219 219
Continue to review full report at Codecov.
|
0d0f6ed
to
b58edc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good @katxiao ! I want to have another look before approving, but the changes seem to be right on spot so far.
One note is: maybe we should consider also addressing issue #113 on this PR. This would basically fix the current random_state
wrapper, which currently sets the numpy
seed globally instead of just within the current operation, but also allow setting the random seed once and calling sample
multiple times obtaining different results at each call (Notice that this would be necessary in order to allow reject_sampling
strategies with a fixed seed!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but I think @csala has a good point. Maybe we can add that change too
@csala I updated the PR to address the issue you linked. I'm a little confused, because even if we switch to setting the numpy state instead of the seed, aren't we still setting it globally? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good!
4cb2e62
to
5152fd6
Compare
Sorry, my original statement was not precise enough. It is true that every time we set the random state we do it globally, but what I actually meant was that the change was permanent, meaning that operations that come after ours would also be affected by that change. But this was actually wrong, because the original In any case, the real advantage of using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there are a few changes required to achieve the desired functionality.
Here is an ipython
snippet that explains the expected behavior a bit more precisely:
In [1]: import numpy as np
In [2]: # Background: We simulate an external seed of 42, which we
...: # do not want to alter, and we set our model seed to 0.
...: # For reference, these are the sequences of random numbers
...: # that each seed produces:
...:
...: np.random.seed(42)
...: np.random.random(size=4)
Out[2]: array([0.37454012, 0.95071431, 0.73199394, 0.59865848])
In [3]: np.random.seed(0)
...: np.random.random(size=4)
Out[3]: array([0.5488135 , 0.71518937, 0.60276338, 0.54488318])
In [4]: # EXTERNAL: We simulate an external seed that we do not want to alter
...: # and certify that the random numbers are the expected ones
...: np.random.seed(42)
...: np.random.random(size=2)
Out[4]: array([0.37454012, 0.95071431])
In [5]: # FIRST CALL: Inside our decorator we capture the original state
...: # and set a new one
...: original_state = np.random.get_state()
...: new_state = np.random.RandomState(seed=0).get_state()
...: np.random.set_state(new_state)
...:
...: # Certify the random numbers are the expected ones
...: np.random.random(size=2)
Out[5]: array([0.5488135 , 0.71518937])
In [6]: # We capture the state AFTER the call and restore the original one
...: post_state = np.random.get_state()
...: np.random.set_state(original_state)
...:
...: # Certify that the original state is restored and the random
...: # sequence can continue as expected (sequence continues)
...: np.random.random(size=2)
Out[6]: array([0.73199394, 0.59865848])
In [7]: # SECOND CALL: Inside the decorator again, we restore the previous state
...: np.random.set_state(post_state)
...:
...: # We certify that the sequence of random numbers with seed = 0 continues
...: # as expected
...: np.random.random(size=2)
Out[7]: array([0.60276338, 0.54488318])
Additionally, we should create an integration test that reproduces a sequence similar to the one
shown above and that certifies that multiple calls after setting the seed produce different results, but
always following the expected sequence.
copulas/__init__.py
Outdated
try: | ||
yield | ||
finally: | ||
np.random.set_state(state) | ||
set_model_random_state(desired_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not be caputring the desired_state
, but rather the current state of numpy
as returned by np.random.get_state()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Made the fix.
@csala I added the integration test and addressed the other two comments! |
b2eadca
to
0aef6ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment about an edge case bug. Other than that, this looks ready
copulas/__init__.py
Outdated
raise TypeError(f'RandomState {random_state} is an unexpected type. ' | ||
'Expected to be int, np.random.RandomState, or tuple.') | ||
|
||
np.random.set_state(desired_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: If random_state
is a tuple
, desired_state
is never assigned any value. I think that it would be simpler to just re-use the random_state
variable name instead of desired_state
0aef6ba
to
9292677
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks all correct so far. The only comment that I would add here is that it would be interesting to add one integration test per model, which certifies that the set_random_state
is working as expected.
The tests I'm thinking about would be something like this (for each model!):
# Fit the model on some random data
fit_data = np.random.whatever...
model = Model()
model.fit(fit_data)
# Sample truly random data
random = model.sample(10)
# Set the seed to a fixed value and sample TWICE
model.set_random_seed(0)
seeded_0_0 = model,sample(10)
seeded_0_1 = model,sample(10)
# Set the seed again to the same value and sample TWICE again
model.set_random_seed(0)
seeded_1_0 = model,sample(10)
seeded_1_1 = model,sample(10)
# assert that the random data is not equal to the data with fixed seed
np.testing.assert_not_equals(random, sampled_0_0)
# assert that the two sample calls after setting the seed generated different outputs
np.testing.assert_not_equals(sampled_0_0, sampled_0_1)
# assert that setting the seed sampling once always produces the same results
np.testing.assert_equals(sampled_0_0, sampled_1_0)
# assert that the second call after setting the seed continues to produce the same results
np.testing.assert_equals(sampled_0_1, sampled_1_1)
51df96a
to
bc87087
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now!
0383db2
to
f06210e
Compare
f06210e
to
95ad1fe
Compare
As part of sdv-dev/SDV#690, we want to add a fixed seed when sampling. This requires setting the seed at sample time (after the model has already been created).
Currently, we can only pass in the
random_seed
at initialization of the model. In this PR, I add a setter method forrandom_seed
, to enable to following flow:Resolves #113