Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide external data #264

Merged
merged 9 commits into from
Jul 13, 2020
Merged

Provide external data #264

merged 9 commits into from
Jul 13, 2020

Conversation

michaeldeistler
Copy link
Contributor

@michaeldeistler michaeldeistler commented Jul 6, 2020

Goal

We should support users that provide external data. Importantly, this should also be possible in multiple round inference (e.g. the user could run simulations on the cluster and inference on another computer), see #163

API suggestion

For single round:

infer = SNPE(simulator, prior)
posterior = infer(num_round=1, num_simulations=0, external_data=(theta,x))

For multi-round, one can then do the following (not sure if I will implement this in this PR, but this would be the idea):

posterior_r2 = infer.continue(num_rounds=1, num_simulations=0, external_data=(theta2, x2), simulations_from_prior=False)

Here, the simulations_from_prior argument tells the inference algorithm whether the new simulations came from the latest posterior or from the prior. E.g. SNPE-B needs this information to use the correct importance weights. num_simulations indicates the number of simulations after external_data is exceeded.

Implementation

At the beginning of inference, we wrap the simulator to return external_data until it runs out of it (at which point it falls back onto simulating).

Why not let the user do the wrapping themselves (and then just provide it as simulator)?

a) I think it's unintuitive that the simulator does not actually simulate
b) after each round, the user would need to define a new "simulator" - which will do the same simulations, but would have different external_data. I find this not very elegant.
c) For the docs, it might be helpful to have an argument called external_data. It makes it easy for the user to figure out how to provide external data. If we made it a decorator, I fear that we would have to write a dedicated tutorial which shows how to do this.

Why wrap it at all and not just set theta_bank=external_theta_data and x_bank=external_x_data?

This works just fine for the core functionality, but for many advanced features (e.g. handling NaN, z-scoring, retrain_from_scratch,...) it requires extra care, making things unnecessarily complicated).

Disadvantages / concerns

  • we add an additional argument to __call__()
  • any more concerns?

@michaeldeistler michaeldeistler added the enhancement New feature or request label Jul 6, 2020
@michaeldeistler michaeldeistler self-assigned this Jul 6, 2020
@michaeldeistler michaeldeistler requested a review from ojwenzel July 6, 2020 12:17
@janfb janfb marked this pull request as draft July 6, 2020 13:08
@jan-matthis
Copy link
Contributor

How about introducing a method called append_training_data or similar to the base class that all inference methods share? It would receive thetas and xs as inputs, as well as sampled_from_prior(bool). Strictly speaking sampled_from_prior is only sometimes needed, but I guess it's good info to have in any case. The method would take care of NaN handling and appending to theta_bank/x_bank. Since this is logic shared across all inference algorithms, it would avoid duplication of code. This method could then be used to cover the external data case as well.

We could either expose the new method directly or make it a hidden/underscore method. In the first case, we would only change infer to take an external_data argument, in the second case, all inference classes would get new keyword arguments.

@janfb
Copy link
Contributor

janfb commented Jul 6, 2020

Yes, I agree with @jan-matthis .

Another point: When we discussed this problem with @ppjgoncalves some time ago I though we converged on the view that it does not make so much sense to do both, provide external data and run new simulations from the simulator, didn't we?

In the scenario where one provides external data from the simulator, one is always in full control of how many simulations one wants to pass. This applies to both, the single round and the multi round case. The user can always pass exactly as many external data simulations as needed for the given round. Thus, I think we don't want to have / implement this hybrid case.

@michaeldeistler
Copy link
Contributor Author

michaeldeistler commented Jul 6, 2020

Cool, thanks! I implemented a first version as a protected function, have a look (I know, no docstrings etc yet ;) ).

@janfb if I remember correctly, we made this decision under the assumption that it would be messy to code up - but I do not think it is, have a look and let me know what you think.

@michaeldeistler
Copy link
Contributor Author

Implementation for SNPE is ready, so now would be a great time to review (before I implement it also for SRE and SNL)

@michaeldeistler michaeldeistler marked this pull request as ready for review July 7, 2020 10:08
Copy link
Contributor

@jan-matthis jan-matthis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed and left a few comments

We should decide whether we want to 1) add external_data as an argument to all inference classes, changing the API (as proposed now), or 2) add a public method append_external_data that users call with external data when using the advanced interface (infer might still get the keyword).

I'm fine with both, but are more leaning towards 2). 1) might seem convenient right now, but could develop in a direction where we add more and more keywords to __call__ duplicated across all inference classes. What would be your argument for 1)?

Tagging @Meteore, @janfb for opinions as well

@@ -7,7 +7,7 @@
from copy import deepcopy
from datetime import datetime
from pathlib import Path
from typing import Callable, Dict, List, Optional, Union, cast
from typing import Callable, Dict, List, Optional, Union, cast, Tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run isort for ordering those imports

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is resolved, it'd be good to mark it as such ;)

is_valid_x, num_nans, num_infs = handle_invalid_x(x, exclude_invalid_x)
warn_on_invalid_x(num_nans, num_infs, exclude_invalid_x)

# XXX Rename bank -> rounds/roundwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be a good point in time to address this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any suggestions here? theta_rounds would be an option, but not sure if I like it more than theta_bank.

Copy link
Contributor

@jan-matthis jan-matthis Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about _theta, _theta_round, or _theta_round_data (same for x)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_roundwise_thetas. I think bank is not very clear and I find data too uninformative -- everything is data.

x: Tensor,
external_data: Tuple[Tensor, Tensor],
round: int,
exclude_invalid_x: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could default to True

self,
theta: Tensor,
x: Tensor,
external_data: Tuple[Tensor, Tensor],
Copy link
Contributor

@jan-matthis jan-matthis Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove external_data as keyword argument for this method and handle the splitting elsewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree here.

Comment on lines 176 to 179
# In the first round, the user can externally provide data.
if external_data is not None and round == 0:
theta = torch.cat((external_data[0], theta))
x = torch.cat((external_data[1], x))
Copy link
Contributor

@jan-matthis jan-matthis Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would remove this here and potentially introduce a new method, e.g., append_external_data (private or public). The advantage is that_append_training_data keeps a clean signature, always requiring theta and x.

Copy link
Contributor

@alvorithm alvorithm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still a bit uneasy with the whole thing, i.e. it looks like the most general approach would be that you can always get simulated data and the prior (evtl. a posterior) it was simulated with, and each round is a separate unit, for which we should provide always an output, so that multi-round would look like

posterior = [prior]
for i in range(1, rounds):
   posterior[i] = round(simulator, posterior[i-1])

This are way too informal and distracted thoughts now, only my recollection of something I thought back then when we discussed this. I understand that it might not be practical to do this now.


# In the first round, the user can externally provide data.
if external_data is not None and round == 0:
theta = torch.cat((external_data[0], theta))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically a prepend, isn't it?

x_shape = x_shape_from_simulation(x)

# Curate the data by excluding NaNs. Then append to the data banks.
self._append_training_data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am looking for better names - offline_simulations or prior_simulations come to mind (prior zweideutig...)

self,
theta: Tensor,
x: Tensor,
external_data: Tuple[Tensor, Tensor],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree here.

@@ -93,7 +93,7 @@ def __call__(
num_rounds: int,
num_simulations_per_round: OneOrMore[int],
x_o: Optional[Tensor] = None,
external_data: Optional[Tensor] = None,
external_data: Optional[Tuple[Tensor, Tensor]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workable default would be a tuple of emtpy tensors, wouldn't it? Then no more specific case-handling at prepend time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

external... --> presimulated? Another idea for names

Copy link
Contributor

@janfb janfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. Nevertheless, I would suggest some changes.

At the moment the logic for combining offline and online data happens in _append_training_data in base. Wouldn't it make more sense to have it happen in run_simulations (maybe with a slightly different name and moved to base)? This function would then just take as many data as possible from the external data (if there is some) and simulate the rest.

Afterwards, a function like _append_training_data just takes care about checking for NaNs etc and appending it to the banks.

sbi/inference/base.py Outdated Show resolved Hide resolved
theta, x = self._run_simulations(round_, num_sims)

# Append data to self._theta_bank, self._x_bank, and self._prior_masks.
self._append_training_data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove the comment and change the function signature to something like _append_to_banks(theta, x, external_data, round_, exclude_invalid_x)


# Append data to self._theta_bank, self._x_bank, and self._prior_masks.
self._append_training_data(
theta, x, external_data, round_, exclude_invalid_x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to pass the external data here? Couldn't we handle this internally, e.g., in run_simulations? E.g., introduce it to all methods (-> to base.py) and rename it to get_simulations or so and use it to merge potential presimulated data?

def _run_simulations(
self, round_: int, num_sims: int,
) -> Tuple[Tensor, Tensor, Tensor]:
def _run_simulations(self, round_: int, num_sims: int,) -> Tuple[Tensor, Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont use this function in the other methods and it kind of repeats the if-else on the rounds that happens afterwards for definiting the neural posterior. Thus, if we don't change this function, e.g., to do the merging with external data, we could as well remove it and move the if-else into the if-else in the caller code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this function call _append_to_round_bank with the new simulations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, wouldn't it make sense to to move this into the inference base class and use it for all algorithms?

@michaeldeistler
Copy link
Contributor Author

Thanks for all your feedback! It's quite tough to incorporate all of it (especially feedback coming from different people), but I tried my best.

Copy link
Contributor

@jan-matthis jan-matthis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some comments on the latest version

is_valid_x, num_nans, num_infs = handle_invalid_x(x, exclude_invalid_x)
warn_on_invalid_x(num_nans, num_infs, exclude_invalid_x)

# XXX Rename bank -> rounds/roundwise.
Copy link
Contributor

@jan-matthis jan-matthis Jul 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about _theta, _theta_round, or _theta_round_data (same for x)?

sbi/inference/base.py Outdated Show resolved Hide resolved
sbi/inference/base.py Outdated Show resolved Hide resolved
sbi/inference/snpe/snpe_base.py Outdated Show resolved Hide resolved
def _run_simulations(
self, round_: int, num_sims: int,
) -> Tuple[Tensor, Tensor, Tensor]:
def _run_simulations(self, round_: int, num_sims: int,) -> Tuple[Tensor, Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this function call _append_to_round_bank with the new simulations?

self._external_data = (theta, x)
self._presimulated_current_round = True

def _prepend_presimulated(self, theta: Tensor, x: Tensor) -> Tuple[Tensor, Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could rename x and theta here because it looks like we are prepending them, while we are actually prepending the external data that was saved in self._external_data somewhere else.

E.g., you could name the args {theta, x}_simulated to distinguish them from the pre-simulated theta and x.

@michaeldeistler
Copy link
Contributor Author

michaeldeistler commented Jul 8, 2020

Ok, lots of new changes, most importantly:

  • we now have a "setter" to store data in the banks called _append_to_round_bank.
  • to get the data from the banks, we have a "getter" called get_from_round_bank
  • theta_bank and x_bank now hold all simulations (not just the valid ones). The "getter" then filters out the valid simulations.

Variable renaming is in progress.

sbi/inference/base.py Outdated Show resolved Hide resolved
sbi/inference/base.py Show resolved Hide resolved
sbi/inference/snpe/snpe_base.py Outdated Show resolved Hide resolved
sbi/inference/snpe/snpe_base.py Outdated Show resolved Hide resolved
sbi/inference/snpe/snpe_base.py Show resolved Hide resolved
sbi/inference/snpe/snpe_base.py Outdated Show resolved Hide resolved
sbi/inference/base.py Outdated Show resolved Hide resolved
sbi/inference/base.py Outdated Show resolved Hide resolved
sbi/inference/base.py Show resolved Hide resolved
@@ -157,16 +157,21 @@ def __call__(

for round_, num_sims in enumerate(num_sims_per_round):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure, but shouldn't we start counting from round_=1 now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention in the code right now is that we start from round 0. Should we change it to round 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave it with initial round idx zero. is there a specific reason for changing to start counting with one?

Copy link
Contributor

@jan-matthis jan-matthis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Everything is addressed from my side. Before merging, you should rebase on master (currently there is a merge conflict)

@janfb
Copy link
Contributor

janfb commented Jul 13, 2020

Great! I will have a look within the next hours. Then we can merge it today, hopefully!

Copy link
Contributor

@janfb janfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! only small comments. Good to go once they are addressed (or not).

@@ -152,6 +156,108 @@ def __init__(
median_observation_distances=[], epochs=[], best_validation_log_probs=[],
)

def provide_presimulated(
self, theta: Tensor, x: Tensor, from_round: int = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the same argument name as in _append_to_round_bank, i.e., round_, or maybe even round_idx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling all of them from_round.

sbi/inference/base.py Outdated Show resolved Hide resolved
sbi/inference/base.py Show resolved Hide resolved
sbi/inference/snle/snle_base.py Show resolved Hide resolved
@@ -157,16 +157,21 @@ def __call__(

for round_, num_sims in enumerate(num_sims_per_round):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave it with initial round idx zero. is there a specific reason for changing to start counting with one?

@@ -216,3 +216,34 @@ def warn_on_invalid_x(num_nans: int, num_infs: int, exclude_invalid_x: bool) ->
f"Found {num_nans} NaN simulations and {num_infs} Inf simulations. "
"Training might fail. Consider setting `exclude_invalid_x=True`."
)


def get_data_after_round(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I get it. Maybe we find a more descriptive name here, what about changing the signature to

get_data_since_round(starting_idx, data, data_round_indices)

?

@michaeldeistler
Copy link
Contributor Author

michaeldeistler commented Jul 13, 2020

Thanks for all the feedback! I put in Jan's suggestions and am now waiting for tests. Will merge in <30 minutes (hopefully).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants