Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to set random seed on copulas models #313

Merged
merged 12 commits into from
Feb 18, 2022
Merged

Conversation

katxiao
Copy link
Contributor

@katxiao katxiao commented Feb 4, 2022

As part of sdv-dev/SDV#690, we want to add a fixed seed when sampling. This requires setting the seed at sample time (after the model has already been created).

Currently, we can only pass in the random_seed at initialization of the model. In this PR, I add a setter method for random_seed, to enable to following flow:

  1. create model
  2. set seed
  3. sample
  4. repeat steps 2-3 as needed.

Resolves #113

@codecov-commenter
Copy link

codecov-commenter commented Feb 4, 2022

Codecov Report

Merging #313 (95ad1fe) into master (350888e) will increase coverage by 0.15%.
The diff coverage is 88.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #313      +/-   ##
==========================================
+ Coverage   87.16%   87.31%   +0.15%     
==========================================
  Files          27       27              
  Lines        1706     1727      +21     
==========================================
+ Hits         1487     1508      +21     
  Misses        219      219              
Impacted Files Coverage Δ
copulas/datasets.py 52.27% <41.66%> (+2.27%) ⬆️
copulas/__init__.py 97.95% <100.00%> (+0.25%) ⬆️
copulas/bivariate/base.py 87.16% <100.00%> (+0.17%) ⬆️
copulas/multivariate/base.py 61.53% <100.00%> (+2.07%) ⬆️
copulas/multivariate/gaussian.py 91.47% <100.00%> (ø)
copulas/multivariate/vine.py 99.32% <100.00%> (ø)
copulas/univariate/base.py 82.82% <100.00%> (+0.32%) ⬆️
copulas/univariate/gaussian_kde.py 96.25% <100.00%> (+0.04%) ⬆️
copulas/univariate/truncated_gaussian.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 350888e...95ad1fe. Read the comment docs.

@katxiao katxiao requested a review from csala February 4, 2022 19:58
@katxiao katxiao marked this pull request as ready for review February 4, 2022 20:12
@katxiao katxiao requested a review from a team as a code owner February 4, 2022 20:12
@katxiao katxiao requested review from amontanez24 and removed request for a team February 4, 2022 20:13
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @katxiao ! I want to have another look before approving, but the changes seem to be right on spot so far.

One note is: maybe we should consider also addressing issue #113 on this PR. This would basically fix the current random_state wrapper, which currently sets the numpy seed globally instead of just within the current operation, but also allow setting the random seed once and calling sample multiple times obtaining different results at each call (Notice that this would be necessary in order to allow reject_sampling strategies with a fixed seed!)

Copy link
Collaborator

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but I think @csala has a good point. Maybe we can add that change too

@katxiao
Copy link
Contributor Author

katxiao commented Feb 7, 2022

This looks good @katxiao ! I want to have another look before approving, but the changes seem to be right on spot so far.

One note is: maybe we should consider also addressing issue #113 on this PR. This would basically fix the current random_state wrapper, which currently sets the numpy seed globally instead of just within the current operation, but also allow setting the random seed once and calling sample multiple times obtaining different results at each call (Notice that this would be necessary in order to allow reject_sampling strategies with a fixed seed!)

@csala I updated the PR to address the issue you linked.

I'm a little confused, because even if we switch to setting the numpy state instead of the seed, aren't we still setting it globally?

Copy link
Collaborator

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good!

copulas/__init__.py Show resolved Hide resolved
@katxiao katxiao force-pushed the set-random-seed branch 3 times, most recently from 4cb2e62 to 5152fd6 Compare February 8, 2022 19:29
@csala
Copy link
Contributor

csala commented Feb 9, 2022

I'm a little confused, because even if we switch to setting the numpy state instead of the seed, aren't we still setting it globally?

Sorry, my original statement was not precise enough. It is true that every time we set the random state we do it globally, but what I actually meant was that the change was permanent, meaning that operations that come after ours would also be affected by that change. But this was actually wrong, because the original random_seed function already had the try/finally block that restored the original state after the operation is finished.

In any case, the real advantage of using the random_state instead of the random_seed is the other part: now this is a state which changes over time as calls happen, rather than a fixed seed that is always the same on every call.

Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there are a few changes required to achieve the desired functionality.

Here is an ipython snippet that explains the expected behavior a bit more precisely:

In [1]: import numpy as np

In [2]: # Background: We simulate an external seed of 42, which we
   ...: # do not want to alter, and we set our model seed to 0.
   ...: # For reference, these are the sequences of random numbers
   ...: # that each seed produces:
   ...: 
   ...: np.random.seed(42)
   ...: np.random.random(size=4)
Out[2]: array([0.37454012, 0.95071431, 0.73199394, 0.59865848])

In [3]: np.random.seed(0)
   ...: np.random.random(size=4)
Out[3]: array([0.5488135 , 0.71518937, 0.60276338, 0.54488318])

In [4]: # EXTERNAL: We simulate an external seed that we do not want to alter
   ...: # and certify that the random numbers are the expected ones
   ...: np.random.seed(42)
   ...: np.random.random(size=2)
Out[4]: array([0.37454012, 0.95071431])

In [5]: # FIRST CALL: Inside our decorator we capture the original state
   ...: # and set a new one
   ...: original_state = np.random.get_state()
   ...: new_state = np.random.RandomState(seed=0).get_state()
   ...: np.random.set_state(new_state)
   ...: 
   ...: # Certify the random numbers are the expected ones
   ...: np.random.random(size=2)
Out[5]: array([0.5488135 , 0.71518937])

In [6]: # We capture the state AFTER the call and restore the original one
   ...: post_state = np.random.get_state()
   ...: np.random.set_state(original_state)
   ...: 
   ...: # Certify that the original state is restored and the random
   ...: # sequence can continue as expected (sequence continues)
   ...: np.random.random(size=2)
Out[6]: array([0.73199394, 0.59865848])

In [7]: # SECOND CALL: Inside the decorator again, we restore the previous state
   ...: np.random.set_state(post_state)
   ...: 
   ...: # We certify that the sequence of random numbers with seed = 0 continues
   ...: # as expected
   ...: np.random.random(size=2)
Out[7]: array([0.60276338, 0.54488318])

Additionally, we should create an integration test that reproduces a sequence similar to the one
shown above and that certifies that multiple calls after setting the seed produce different results, but
always following the expected sequence.

copulas/__init__.py Outdated Show resolved Hide resolved
try:
yield
finally:
np.random.set_state(state)
set_model_random_state(desired_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be caputring the desired_state, but rather the current state of numpy as returned by np.random.get_state()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Made the fix.

@katxiao
Copy link
Contributor Author

katxiao commented Feb 9, 2022

np.random.random(size=2)

@csala I added the integration test and addressed the other two comments!

Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment about an edge case bug. Other than that, this looks ready

raise TypeError(f'RandomState {random_state} is an unexpected type. '
'Expected to be int, np.random.RandomState, or tuple.')

np.random.set_state(desired_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: If random_state is a tuple, desired_state is never assigned any value. I think that it would be simpler to just re-use the random_state variable name instead of desired_state

Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks all correct so far. The only comment that I would add here is that it would be interesting to add one integration test per model, which certifies that the set_random_state is working as expected.

The tests I'm thinking about would be something like this (for each model!):

# Fit the model on some random data
fit_data = np.random.whatever...
model = Model()
model.fit(fit_data)

# Sample truly random data
random = model.sample(10)

# Set the seed to a fixed value and sample TWICE
model.set_random_seed(0)
seeded_0_0 = model,sample(10)
seeded_0_1 = model,sample(10)

# Set the seed again to the same value and sample TWICE again
model.set_random_seed(0)
seeded_1_0 = model,sample(10)
seeded_1_1 = model,sample(10)

# assert that the random data is not equal to the data with fixed seed
np.testing.assert_not_equals(random, sampled_0_0)
# assert that the two sample calls after setting the seed generated different outputs
np.testing.assert_not_equals(sampled_0_0, sampled_0_1)
# assert that setting the seed sampling once always produces the same results
np.testing.assert_equals(sampled_0_0, sampled_1_0)
# assert that the second call after setting the seed continues to produce the same results
np.testing.assert_equals(sampled_0_1, sampled_1_1)

tests/end-to-end/univariate/test_base.py Outdated Show resolved Hide resolved
tests/end-to-end/multivariate/test_base.py Outdated Show resolved Hide resolved
copulas/univariate/base.py Outdated Show resolved Hide resolved
tests/end-to-end/bivariate/test_base.py Outdated Show resolved Hide resolved
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now!

@katxiao katxiao merged commit f831e23 into master Feb 18, 2022
@katxiao katxiao deleted the set-random-seed branch February 18, 2022 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use random_state instead of random_seed
4 participants