-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide external data #264
Conversation
How about introducing a method called We could either expose the new method directly or make it a hidden/underscore method. In the first case, we would only change |
Yes, I agree with @jan-matthis . Another point: When we discussed this problem with @ppjgoncalves some time ago I though we converged on the view that it does not make so much sense to do both, provide external data and run new simulations from the simulator, didn't we? In the scenario where one provides external data from the simulator, one is always in full control of how many simulations one wants to pass. This applies to both, the single round and the multi round case. The user can always pass exactly as many external data simulations as needed for the given round. Thus, I think we don't want to have / implement this hybrid case. |
Cool, thanks! I implemented a first version as a protected function, have a look (I know, no docstrings etc yet ;) ). @janfb if I remember correctly, we made this decision under the assumption that it would be messy to code up - but I do not think it is, have a look and let me know what you think. |
Implementation for SNPE is ready, so now would be a great time to review (before I implement it also for SRE and SNL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed and left a few comments
We should decide whether we want to 1) add external_data
as an argument to all inference classes, changing the API (as proposed now), or 2) add a public method append_external_data
that users call with external data when using the advanced interface (infer
might still get the keyword).
I'm fine with both, but are more leaning towards 2). 1) might seem convenient right now, but could develop in a direction where we add more and more keywords to __call__
duplicated across all inference classes. What would be your argument for 1)?
sbi/inference/base.py
Outdated
@@ -7,7 +7,7 @@ | |||
from copy import deepcopy | |||
from datetime import datetime | |||
from pathlib import Path | |||
from typing import Callable, Dict, List, Optional, Union, cast | |||
from typing import Callable, Dict, List, Optional, Union, cast, Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run isort
for ordering those imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is resolved, it'd be good to mark it as such ;)
sbi/inference/base.py
Outdated
is_valid_x, num_nans, num_infs = handle_invalid_x(x, exclude_invalid_x) | ||
warn_on_invalid_x(num_nans, num_infs, exclude_invalid_x) | ||
|
||
# XXX Rename bank -> rounds/roundwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be a good point in time to address this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any suggestions here? theta_rounds
would be an option, but not sure if I like it more than theta_bank
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about _theta
, _theta_round
, or _theta_round_data
(same for x
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_roundwise_thetas
. I think bank
is not very clear and I find data
too uninformative -- everything is data
.
sbi/inference/base.py
Outdated
x: Tensor, | ||
external_data: Tuple[Tensor, Tensor], | ||
round: int, | ||
exclude_invalid_x: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could default to True
sbi/inference/base.py
Outdated
self, | ||
theta: Tensor, | ||
x: Tensor, | ||
external_data: Tuple[Tensor, Tensor], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove external_data
as keyword argument for this method and handle the splitting elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree here.
sbi/inference/base.py
Outdated
# In the first round, the user can externally provide data. | ||
if external_data is not None and round == 0: | ||
theta = torch.cat((external_data[0], theta)) | ||
x = torch.cat((external_data[1], x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would remove this here and potentially introduce a new method, e.g., append_external_data
(private or public). The advantage is that_append_training_data
keeps a clean signature, always requiring theta
and x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still a bit uneasy with the whole thing, i.e. it looks like the most general approach would be that you can always get simulated data and the prior (evtl. a posterior) it was simulated with, and each round is a separate unit, for which we should provide always an output, so that multi-round would look like
posterior = [prior]
for i in range(1, rounds):
posterior[i] = round(simulator, posterior[i-1])
This are way too informal and distracted thoughts now, only my recollection of something I thought back then when we discussed this. I understand that it might not be practical to do this now.
sbi/inference/base.py
Outdated
|
||
# In the first round, the user can externally provide data. | ||
if external_data is not None and round == 0: | ||
theta = torch.cat((external_data[0], theta)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically a prepend
, isn't it?
sbi/inference/snpe/snpe_base.py
Outdated
x_shape = x_shape_from_simulation(x) | ||
|
||
# Curate the data by excluding NaNs. Then append to the data banks. | ||
self._append_training_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am looking for better names - offline_simulations
or prior_simulations
come to mind (prior zweideutig...)
sbi/inference/base.py
Outdated
self, | ||
theta: Tensor, | ||
x: Tensor, | ||
external_data: Tuple[Tensor, Tensor], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to agree here.
sbi/inference/snpe/snpe_base.py
Outdated
@@ -93,7 +93,7 @@ def __call__( | |||
num_rounds: int, | |||
num_simulations_per_round: OneOrMore[int], | |||
x_o: Optional[Tensor] = None, | |||
external_data: Optional[Tensor] = None, | |||
external_data: Optional[Tuple[Tensor, Tensor]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A workable default would be a tuple of emtpy tensors, wouldn't it? Then no more specific case-handling at prepend time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
external... --> presimulated
? Another idea for names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. Nevertheless, I would suggest some changes.
At the moment the logic for combining offline and online data happens in _append_training_data
in base
. Wouldn't it make more sense to have it happen in run_simulations
(maybe with a slightly different name and moved to base
)? This function would then just take as many data as possible from the external data (if there is some) and simulate the rest.
Afterwards, a function like _append_training_data
just takes care about checking for NaNs etc and appending it to the banks.
sbi/inference/snpe/snpe_base.py
Outdated
theta, x = self._run_simulations(round_, num_sims) | ||
|
||
# Append data to self._theta_bank, self._x_bank, and self._prior_masks. | ||
self._append_training_data( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove the comment and change the function signature to something like _append_to_banks(theta, x, external_data, round_, exclude_invalid_x)
sbi/inference/snpe/snpe_base.py
Outdated
|
||
# Append data to self._theta_bank, self._x_bank, and self._prior_masks. | ||
self._append_training_data( | ||
theta, x, external_data, round_, exclude_invalid_x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to pass the external data here? Couldn't we handle this internally, e.g., in run_simulations
? E.g., introduce it to all methods (-> to base.py) and rename it to get_simulations
or so and use it to merge potential presimulated data?
sbi/inference/snpe/snpe_base.py
Outdated
def _run_simulations( | ||
self, round_: int, num_sims: int, | ||
) -> Tuple[Tensor, Tensor, Tensor]: | ||
def _run_simulations(self, round_: int, num_sims: int,) -> Tuple[Tensor, Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we dont use this function in the other methods and it kind of repeats the if-else on the rounds that happens afterwards for definiting the neural posterior. Thus, if we don't change this function, e.g., to do the merging with external data, we could as well remove it and move the if-else into the if-else in the caller code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this function call _append_to_round_bank
with the new simulations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, wouldn't it make sense to to move this into the inference base class and use it for all algorithms?
c520a77
to
f30186a
Compare
Thanks for all your feedback! It's quite tough to incorporate all of it (especially feedback coming from different people), but I tried my best. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left some comments on the latest version
sbi/inference/base.py
Outdated
is_valid_x, num_nans, num_infs = handle_invalid_x(x, exclude_invalid_x) | ||
warn_on_invalid_x(num_nans, num_infs, exclude_invalid_x) | ||
|
||
# XXX Rename bank -> rounds/roundwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about _theta
, _theta_round
, or _theta_round_data
(same for x
)?
sbi/inference/snpe/snpe_base.py
Outdated
def _run_simulations( | ||
self, round_: int, num_sims: int, | ||
) -> Tuple[Tensor, Tensor, Tensor]: | ||
def _run_simulations(self, round_: int, num_sims: int,) -> Tuple[Tensor, Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this function call _append_to_round_bank
with the new simulations?
sbi/inference/base.py
Outdated
self._external_data = (theta, x) | ||
self._presimulated_current_round = True | ||
|
||
def _prepend_presimulated(self, theta: Tensor, x: Tensor) -> Tuple[Tensor, Tensor]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could rename x and theta here because it looks like we are prepending them, while we are actually prepending the external data that was saved in self._external_data
somewhere else.
E.g., you could name the args {theta, x}_simulated
to distinguish them from the pre
-simulated theta and x.
Ok, lots of new changes, most importantly:
Variable renaming is in progress. |
@@ -157,16 +157,21 @@ def __call__( | |||
|
|||
for round_, num_sims in enumerate(num_sims_per_round): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure, but shouldn't we start counting from round_=1
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention in the code right now is that we start from round 0. Should we change it to round 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave it with initial round idx zero. is there a specific reason for changing to start counting with one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Everything is addressed from my side. Before merging, you should rebase on master (currently there is a merge conflict)
1607f2c
to
4cd745e
Compare
Great! I will have a look within the next hours. Then we can merge it today, hopefully! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! only small comments. Good to go once they are addressed (or not).
@@ -152,6 +156,108 @@ def __init__( | |||
median_observation_distances=[], epochs=[], best_validation_log_probs=[], | |||
) | |||
|
|||
def provide_presimulated( | |||
self, theta: Tensor, x: Tensor, from_round: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the same argument name as in _append_to_round_bank
, i.e., round_
, or maybe even round_idx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling all of them from_round
.
@@ -157,16 +157,21 @@ def __call__( | |||
|
|||
for round_, num_sims in enumerate(num_sims_per_round): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave it with initial round idx zero. is there a specific reason for changing to start counting with one?
sbi/utils/sbiutils.py
Outdated
@@ -216,3 +216,34 @@ def warn_on_invalid_x(num_nans: int, num_infs: int, exclude_invalid_x: bool) -> | |||
f"Found {num_nans} NaN simulations and {num_infs} Inf simulations. " | |||
"Training might fail. Consider setting `exclude_invalid_x=True`." | |||
) | |||
|
|||
|
|||
def get_data_after_round( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I get it. Maybe we find a more descriptive name here, what about changing the signature to
get_data_since_round(starting_idx, data, data_round_indices)
?
Thanks for all the feedback! I put in Jan's suggestions and am now waiting for tests. Will merge in <30 minutes (hopefully). |
Goal
We should support users that provide external data. Importantly, this should also be possible in multiple round inference (e.g. the user could run simulations on the cluster and inference on another computer), see #163
API suggestion
For single round:
For multi-round, one can then do the following (not sure if I will implement this in this PR, but this would be the idea):
Here, the
simulations_from_prior
argument tells the inference algorithm whether the new simulations came from the latest posterior or from the prior. E.g. SNPE-B needs this information to use the correct importance weights.num_simulations
indicates the number of simulations afterexternal_data
is exceeded.Implementation
At the beginning of inference, we wrap the simulator to return
external_data
until it runs out of it (at which point it falls back onto simulating).Why not let the user do the wrapping themselves (and then just provide it as simulator)?
a) I think it's unintuitive that the
simulator
does not actually simulateb) after each round, the user would need to define a new "simulator" - which will do the same simulations, but would have different
external_data
. I find this not very elegant.c) For the docs, it might be helpful to have an argument called
external_data
. It makes it easy for the user to figure out how to provide external data. If we made it a decorator, I fear that we would have to write a dedicated tutorial which shows how to do this.Why wrap it at all and not just set
theta_bank=external_theta_data
andx_bank=external_x_data
?This works just fine for the core functionality, but for many advanced features (e.g. handling NaN, z-scoring, retrain_from_scratch,...) it requires extra care, making things unnecessarily complicated).
Disadvantages / concerns
__call__()