Provide external data #264

michaeldeistler · 2020-07-06T11:56:32Z

Goal

We should support users that provide external data. Importantly, this should also be possible in multiple round inference (e.g. the user could run simulations on the cluster and inference on another computer), see #163

API suggestion

For single round:

infer = SNPE(simulator, prior)
posterior = infer(num_round=1, num_simulations=0, external_data=(theta,x))

For multi-round, one can then do the following (not sure if I will implement this in this PR, but this would be the idea):

posterior_r2 = infer.continue(num_rounds=1, num_simulations=0, external_data=(theta2, x2), simulations_from_prior=False)

Here, the simulations_from_prior argument tells the inference algorithm whether the new simulations came from the latest posterior or from the prior. E.g. SNPE-B needs this information to use the correct importance weights. num_simulations indicates the number of simulations after external_data is exceeded.

Implementation

At the beginning of inference, we wrap the simulator to return external_data until it runs out of it (at which point it falls back onto simulating).

Why not let the user do the wrapping themselves (and then just provide it as simulator)?

a) I think it's unintuitive that the simulator does not actually simulate
b) after each round, the user would need to define a new "simulator" - which will do the same simulations, but would have different external_data. I find this not very elegant.
c) For the docs, it might be helpful to have an argument called external_data. It makes it easy for the user to figure out how to provide external data. If we made it a decorator, I fear that we would have to write a dedicated tutorial which shows how to do this.

Why wrap it at all and not just set `theta_bank=external_theta_data` and `x_bank=external_x_data`?

This works just fine for the core functionality, but for many advanced features (e.g. handling NaN, z-scoring, retrain_from_scratch,...) it requires extra care, making things unnecessarily complicated).

Disadvantages / concerns

we add an additional argument to __call__()
any more concerns?

jan-matthis · 2020-07-06T13:52:39Z

How about introducing a method called append_training_data or similar to the base class that all inference methods share? It would receive thetas and xs as inputs, as well as sampled_from_prior(bool). Strictly speaking sampled_from_prior is only sometimes needed, but I guess it's good info to have in any case. The method would take care of NaN handling and appending to theta_bank/x_bank. Since this is logic shared across all inference algorithms, it would avoid duplication of code. This method could then be used to cover the external data case as well.

We could either expose the new method directly or make it a hidden/underscore method. In the first case, we would only change infer to take an external_data argument, in the second case, all inference classes would get new keyword arguments.

janfb · 2020-07-06T13:57:40Z

Yes, I agree with @jan-matthis .

Another point: When we discussed this problem with @ppjgoncalves some time ago I though we converged on the view that it does not make so much sense to do both, provide external data and run new simulations from the simulator, didn't we?

In the scenario where one provides external data from the simulator, one is always in full control of how many simulations one wants to pass. This applies to both, the single round and the multi round case. The user can always pass exactly as many external data simulations as needed for the given round. Thus, I think we don't want to have / implement this hybrid case.

michaeldeistler · 2020-07-06T16:44:23Z

Cool, thanks! I implemented a first version as a protected function, have a look (I know, no docstrings etc yet ;) ).

@janfb if I remember correctly, we made this decision under the assumption that it would be messy to code up - but I do not think it is, have a look and let me know what you think.

michaeldeistler · 2020-07-07T10:06:40Z

Implementation for SNPE is ready, so now would be a great time to review (before I implement it also for SRE and SNL)

jan-matthis

I reviewed and left a few comments

We should decide whether we want to 1) add external_data as an argument to all inference classes, changing the API (as proposed now), or 2) add a public method append_external_data that users call with external data when using the advanced interface (infer might still get the keyword).

I'm fine with both, but are more leaning towards 2). 1) might seem convenient right now, but could develop in a direction where we add more and more keywords to __call__ duplicated across all inference classes. What would be your argument for 1)?

Tagging @Meteore, @janfb for opinions as well

jan-matthis · 2020-07-07T10:43:44Z

sbi/inference/base.py

@@ -7,7 +7,7 @@
 from copy import deepcopy
 from datetime import datetime
 from pathlib import Path
-from typing import Callable, Dict, List, Optional, Union, cast
+from typing import Callable, Dict, List, Optional, Union, cast, Tuple


Run isort for ordering those imports

I think this is resolved, it'd be good to mark it as such ;)

jan-matthis · 2020-07-07T10:44:57Z

sbi/inference/base.py

+        is_valid_x, num_nans, num_infs = handle_invalid_x(x, exclude_invalid_x)
+        warn_on_invalid_x(num_nans, num_infs, exclude_invalid_x)
+
+        # XXX Rename bank -> rounds/roundwise.


Could be a good point in time to address this

any suggestions here? theta_rounds would be an option, but not sure if I like it more than theta_bank.

How about _theta, _theta_round, or _theta_round_data (same for x)?

_roundwise_thetas. I think bank is not very clear and I find data too uninformative -- everything is data.

jan-matthis · 2020-07-07T10:46:02Z

sbi/inference/base.py

+        x: Tensor,
+        external_data: Tuple[Tensor, Tensor],
+        round: int,
+        exclude_invalid_x: bool,


Could default to True

jan-matthis · 2020-07-07T10:47:14Z

sbi/inference/base.py

+        self,
+        theta: Tensor,
+        x: Tensor,
+        external_data: Tuple[Tensor, Tensor],


I would remove external_data as keyword argument for this method and handle the splitting elsewhere

I tend to agree here.

jan-matthis · 2020-07-07T10:48:15Z

sbi/inference/base.py

+        # In the first round, the user can externally provide data.
+        if external_data is not None and round == 0:
+            theta = torch.cat((external_data[0], theta))
+            x = torch.cat((external_data[1], x))


Would remove this here and potentially introduce a new method, e.g., append_external_data (private or public). The advantage is that_append_training_data keeps a clean signature, always requiring theta and x.

alvorithm

I am still a bit uneasy with the whole thing, i.e. it looks like the most general approach would be that you can always get simulated data and the prior (evtl. a posterior) it was simulated with, and each round is a separate unit, for which we should provide always an output, so that multi-round would look like

posterior = [prior]
for i in range(1, rounds):
   posterior[i] = round(simulator, posterior[i-1])

This are way too informal and distracted thoughts now, only my recollection of something I thought back then when we discussed this. I understand that it might not be practical to do this now.

alvorithm · 2020-07-07T11:31:16Z

sbi/inference/base.py

+
+        # In the first round, the user can externally provide data.
+        if external_data is not None and round == 0:
+            theta = torch.cat((external_data[0], theta))


This is technically a prepend, isn't it?

alvorithm · 2020-07-07T11:32:33Z

sbi/inference/snpe/snpe_base.py

            x_shape = x_shape_from_simulation(x)

+            # Curate the data by excluding NaNs. Then append to the data banks.
+            self._append_training_data(


I am looking for better names - offline_simulations or prior_simulations come to mind (prior zweideutig...)

alvorithm · 2020-07-07T11:33:23Z

sbi/inference/base.py

+        self,
+        theta: Tensor,
+        x: Tensor,
+        external_data: Tuple[Tensor, Tensor],


I tend to agree here.

alvorithm · 2020-07-07T11:34:36Z

sbi/inference/snpe/snpe_base.py

@@ -93,7 +93,7 @@ def __call__(
        num_rounds: int,
        num_simulations_per_round: OneOrMore[int],
        x_o: Optional[Tensor] = None,
-        external_data: Optional[Tensor] = None,
+        external_data: Optional[Tuple[Tensor, Tensor]] = None,


A workable default would be a tuple of emtpy tensors, wouldn't it? Then no more specific case-handling at prepend time.

external... --> presimulated? Another idea for names

janfb

Looks good so far. Nevertheless, I would suggest some changes.

At the moment the logic for combining offline and online data happens in _append_training_data in base. Wouldn't it make more sense to have it happen in run_simulations (maybe with a slightly different name and moved to base)? This function would then just take as many data as possible from the external data (if there is some) and simulate the rest.

Afterwards, a function like _append_training_data just takes care about checking for NaNs etc and appending it to the banks.

sbi/inference/base.py

janfb · 2020-07-07T14:04:44Z

sbi/inference/snpe/snpe_base.py

+            theta, x = self._run_simulations(round_, num_sims)
+
+            # Append data to self._theta_bank, self._x_bank, and self._prior_masks.
+            self._append_training_data(


maybe remove the comment and change the function signature to something like _append_to_banks(theta, x, external_data, round_, exclude_invalid_x)

janfb · 2020-07-07T14:06:58Z

sbi/inference/snpe/snpe_base.py

+
+            # Append data to self._theta_bank, self._x_bank, and self._prior_masks.
+            self._append_training_data(
+                theta, x, external_data, round_, exclude_invalid_x


why do we need to pass the external data here? Couldn't we handle this internally, e.g., in run_simulations? E.g., introduce it to all methods (-> to base.py) and rename it to get_simulations or so and use it to merge potential presimulated data?

janfb · 2020-07-07T14:13:08Z

sbi/inference/snpe/snpe_base.py

-    def _run_simulations(
-        self, round_: int, num_sims: int,
-    ) -> Tuple[Tensor, Tensor, Tensor]:
+    def _run_simulations(self, round_: int, num_sims: int,) -> Tuple[Tensor, Tensor]:


we dont use this function in the other methods and it kind of repeats the if-else on the rounds that happens afterwards for definiting the neural posterior. Thus, if we don't change this function, e.g., to do the merging with external data, we could as well remove it and move the if-else into the if-else in the caller code.

Couldn't this function call _append_to_round_bank with the new simulations?

Also, wouldn't it make sense to to move this into the inference base class and use it for all algorithms?

michaeldeistler · 2020-07-07T14:44:23Z

Thanks for all your feedback! It's quite tough to incorporate all of it (especially feedback coming from different people), but I tried my best.

jan-matthis

I've left some comments on the latest version

jan-matthis · 2020-07-08T11:18:38Z

sbi/inference/base.py

+        is_valid_x, num_nans, num_infs = handle_invalid_x(x, exclude_invalid_x)
+        warn_on_invalid_x(num_nans, num_infs, exclude_invalid_x)
+
+        # XXX Rename bank -> rounds/roundwise.


How about _theta, _theta_round, or _theta_round_data (same for x)?

sbi/inference/base.py

sbi/inference/snpe/snpe_base.py

jan-matthis · 2020-07-08T11:24:19Z

sbi/inference/snpe/snpe_base.py

-    def _run_simulations(
-        self, round_: int, num_sims: int,
-    ) -> Tuple[Tensor, Tensor, Tensor]:
+    def _run_simulations(self, round_: int, num_sims: int,) -> Tuple[Tensor, Tensor]:


Couldn't this function call _append_to_round_bank with the new simulations?

janfb · 2020-07-08T12:48:25Z

sbi/inference/base.py

+        self._external_data = (theta, x)
+        self._presimulated_current_round = True
+
+    def _prepend_presimulated(self, theta: Tensor, x: Tensor) -> Tuple[Tensor, Tensor]:


Maybe you could rename x and theta here because it looks like we are prepending them, while we are actually prepending the external data that was saved in self._external_data somewhere else.

E.g., you could name the args {theta, x}_simulated to distinguish them from the pre-simulated theta and x.

michaeldeistler · 2020-07-08T15:27:40Z

Ok, lots of new changes, most importantly:

we now have a "setter" to store data in the banks called _append_to_round_bank.
to get the data from the banks, we have a "getter" called get_from_round_bank
theta_bank and x_bank now hold all simulations (not just the valid ones). The "getter" then filters out the valid simulations.

Variable renaming is in progress.

sbi/inference/base.py

sbi/inference/snpe/snpe_base.py

sbi/inference/base.py

jan-matthis · 2020-07-08T17:49:38Z

sbi/inference/snpe/snpe_base.py

@@ -157,16 +157,21 @@ def __call__(

        for round_, num_sims in enumerate(num_sims_per_round):


Not 100% sure, but shouldn't we start counting from round_=1 now?

The convention in the code right now is that we start from round 0. Should we change it to round 1?

@Meteore @janfb

I would leave it with initial round idx zero. is there a specific reason for changing to start counting with one?

jan-matthis

Thanks! Everything is addressed from my side. Before merging, you should rebase on master (currently there is a merge conflict)

janfb · 2020-07-13T10:03:22Z

Great! I will have a look within the next hours. Then we can merge it today, hopefully!

janfb

Great! only small comments. Good to go once they are addressed (or not).

janfb · 2020-07-13T12:27:42Z

sbi/inference/base.py

@@ -152,6 +156,108 @@ def __init__(
            median_observation_distances=[], epochs=[], best_validation_log_probs=[],
        )

+    def provide_presimulated(
+        self, theta: Tensor, x: Tensor, from_round: int = 0


use the same argument name as in _append_to_round_bank, i.e., round_, or maybe even round_idx?

Calling all of them from_round.

sbi/inference/base.py

sbi/inference/snle/snle_base.py

janfb · 2020-07-13T12:33:59Z

sbi/inference/snpe/snpe_base.py

@@ -157,16 +157,21 @@ def __call__(

        for round_, num_sims in enumerate(num_sims_per_round):


I would leave it with initial round idx zero. is there a specific reason for changing to start counting with one?

janfb · 2020-07-13T12:39:07Z

sbi/utils/sbiutils.py

@@ -216,3 +216,34 @@ def warn_on_invalid_x(num_nans: int, num_infs: int, exclude_invalid_x: bool) ->
                f"Found {num_nans} NaN simulations and {num_infs} Inf simulations. "
                "Training might fail. Consider setting `exclude_invalid_x=True`."
            )
+
+
+def get_data_after_round(


Ah, now I get it. Maybe we find a more descriptive name here, what about changing the signature to

get_data_since_round(starting_idx, data, data_round_indices)

?

michaeldeistler · 2020-07-13T12:57:56Z

Thanks for all the feedback! I put in Jan's suggestions and am now waiting for tests. Will merge in <30 minutes (hopefully).

michaeldeistler added the enhancement New feature or request label Jul 6, 2020

michaeldeistler requested review from alvorithm, jan-matthis and janfb July 6, 2020 11:56

michaeldeistler self-assigned this Jul 6, 2020

michaeldeistler requested a review from ojwenzel July 6, 2020 12:17

janfb marked this pull request as draft July 6, 2020 13:08

michaeldeistler marked this pull request as ready for review July 7, 2020 10:08

jan-matthis reviewed Jul 7, 2020

View reviewed changes

alvorithm reviewed Jul 7, 2020

View reviewed changes

janfb reviewed Jul 7, 2020

View reviewed changes

michaeldeistler force-pushed the external-data branch from c520a77 to f30186a Compare July 7, 2020 14:25

jan-matthis requested changes Jul 8, 2020

View reviewed changes

janfb reviewed Jul 8, 2020

View reviewed changes

jan-matthis reviewed Jul 8, 2020

View reviewed changes

jan-matthis reviewed Jul 13, 2020

View reviewed changes

jan-matthis approved these changes Jul 13, 2020

View reviewed changes

michaeldeistler and others added 7 commits July 13, 2020 11:54

external data for snpe

f155232

setter and getter interface for data

442583d

external data for SNLE

85d0dc2

external data for SNRE

f58f54a

moved run_simulations to base and use it in snle/snre

1e88ffa

additional tests

dba4bce

moved get_data_from_round to utils

69c8d1b

moved mask_sims_from_prior to utils

4cd745e

michaeldeistler force-pushed the external-data branch from 1607f2c to 4cd745e Compare July 13, 2020 09:57

janfb approved these changes Jul 13, 2020

View reviewed changes

renaming some functions for external data

d8c9c8c

michaeldeistler merged commit 506a888 into main Jul 13, 2020

michaeldeistler deleted the external-data branch July 13, 2020 13:21

jan-matthis mentioned this pull request Jul 14, 2020

Add the possibility to combine presimulated data and a simulator #163

Closed

alvorithm mentioned this pull request Jul 23, 2020

Checkpointing #273

Merged

		@@ -157,16 +157,21 @@ def __call__(

		for round_, num_sims in enumerate(num_sims_per_round):

Provide external data #264

Provide external data #264

Conversation

michaeldeistler commented Jul 6, 2020 • edited Loading

Goal

API suggestion

Implementation

Why not let the user do the wrapping themselves (and then just provide it as simulator)?

Why wrap it at all and not just set theta_bank=external_theta_data and x_bank=external_x_data?

Disadvantages / concerns

jan-matthis commented Jul 6, 2020

janfb commented Jul 6, 2020

michaeldeistler commented Jul 6, 2020 • edited Loading

michaeldeistler commented Jul 7, 2020

jan-matthis left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-matthis Jul 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-matthis Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-matthis Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

alvorithm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janfb left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaeldeistler commented Jul 7, 2020

jan-matthis left a comment

Choose a reason for hiding this comment

jan-matthis Jul 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaeldeistler commented Jul 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jan-matthis left a comment

Choose a reason for hiding this comment

janfb commented Jul 13, 2020

janfb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaeldeistler commented Jul 13, 2020 • edited Loading

michaeldeistler commented Jul 6, 2020 •

edited

Loading

Why wrap it at all and not just set `theta_bank=external_theta_data` and `x_bank=external_x_data`?

michaeldeistler commented Jul 6, 2020 •

edited

Loading

jan-matthis left a comment •

edited

Loading

jan-matthis Jul 8, 2020 •

edited

Loading

jan-matthis Jul 7, 2020 •

edited

Loading

jan-matthis Jul 7, 2020 •

edited

Loading

janfb left a comment •

edited

Loading

jan-matthis Jul 8, 2020 •

edited

Loading

michaeldeistler commented Jul 8, 2020 •

edited

Loading

michaeldeistler commented Jul 13, 2020 •

edited

Loading