Changes to how data is stored #685

tbmiller-astro · 2022-04-26T22:31:53Z

Changes to how data is stored within the inference class, discussed in #678. By default all the behavior should be the same as before but the changes help reduce the overall memory overhead and add extra flexibility when dealing with large datasets.

Summary of major changes:

Simulated data is now stored in a torch Dataset rather than in lists as before. In train() only a call to get_dataloaders is needed.
get_dataloader is changed to create the Dataloader from the saved Dataset. Takes the argument start_round to define which round(s) of simulations to load.
All data checks have been moved to append_simulations(). Three arguments have been added: return_self to control if the method returns a copy of the class, data_device to control where data lives, independent of the device used for training and warn_if_zscoring to control if the z-scoring check is initiated.

janfb · 2022-05-24T16:18:33Z

Hi @tbmiller-astro,

thanks a lot for this PR, and sorry for the delay in the review--it seems we are all very busy at the moment. But I plan to have a look at this soon!

Best,
Jan

michaeldeistler · 2022-06-21T10:13:14Z

Hi @tbmiller-astro,

thanks a lot for the PR. Your code is very clean and easy to follow, and your PR description also helped a lot! Thanks!

I like that you introduced a data_device and that all data checks are moved to append_simulaltions().

I have two questions regarding your rational behind other changes though:

Why is it preferable to store all data in a torch.TensorDataset instead of in lists of Tensors? One could also have each Tensor in the list lie on the data_device, no? Is there a strong reason to directly concatenate the Tensors into a single torch.TensorDataset instead of doing this during .train()?
In what scenario would one use append_simulations(..., return_self=False)? You write that it controls if the method returns a copy of the class. However, since there is no deepcopy(self), I am not sure if return_self=True actually doubles the memory footprint.

Thanks a lot for the PR!
Michael

tbmiller-astro · 2022-06-21T14:56:38Z

Hey @michaeldeistler, thanks a lot and these are great questions!

Yes, I don't think there would a functional difference between what I implemented and what you are describing. It seemed like a cleaner implementation as the data would be put into a torch.TensorDataset eventually. I also like that in my PR in .train() the torch.DataLoader is built from a single function call, rather then calling .get_simulations() and then making a separate torch.TensorDataset and then using .get_data_loaders() as it was before. This could also be implemented if the data is stored in a list of torch.Tensors as you describe too.
This was an attempt to help further reduce the memory footprint if needed. I thought that not returning self would help achieve this. As you mention though, this is likely a misunderstanding on my part of how memory is managed in python. If it is not actually helping than this option should probably be removed.

Happy to continue making changes if requested!

Tim

michaeldeistler

Hi Tim,

thanks a lot for your answer, this makes sense! I left a few minor comments below.

I think that I would prefer to store the data as lists of Tensors instead of as torch.TensorDataset. Here's why:

It does not require get_simulations_indices(), but we can instead use the simpler get_simulations_since_round()
Not all operations (e.g, z-scroing) require the torch.TensorDataset. So, I would prefer to generate the torch.TensorDataset only when it is really needed (i.e. in .train()).
My main reason is that lists provide a nice way of structuring data that was passed over several rounds. This structure is lost when all simulations are concatenated into a single torch.TensorDataset. Sure, one can potentially recover the structure via _num_sims_per_round, but it's just less easy to see what is happening.

So, to be clear, I would only revert these changes in get_simulations() and these in append_simulations(). I am pefectly fine with your suggestion that the get_dataloaders() should, in a single call, generate the dataloader. However, I would suggest to generate the dataloaders from the lists instead of from the torch.TensorDataset.

Please let me know if you are up for implementing these changes! If you are busy, we can also merge your PR as it is right now and I'll make the changes myself.

Thanks again!

Best wishes
Michael

sbi/inference/base.py

sbi/inference/snle/snle_base.py

sbi/inference/snpe/snpe_base.py

sbi/inference/snre/snre_base.py

tests/base_test.py

sbi/examples/minimal.py

tbmiller-astro · 2022-06-22T18:52:39Z

Hey @michaeldeistler, sounds good! Happy to make the changes suggested.

For the data checks, I've found they have a big affect on the memory footprint. Both the functions warn_if_zscoring_changes_data and handle_invalid_x create copies of the entire dataset. I could change it so that they are all under one option, like run_data_checks (or something similar) to reduce the number of arguments?

michaeldeistler · 2022-06-23T08:02:33Z

Hi Tim,

awesome, thanks!

Hmmm I was not aware of this. However, handle_invalid_data is performed regardless of the warn_on_invalid, see here. I do not think that there is a way around this, since invalid simulations must be detected and removed, otherwise training will fail.

So, since we already have to copy the dataset once (in handle_invalid_data), it should also be fine to copy it during warn_if_zscoring_changes_data(). Or am I missing something?

I agree that we should ensure that we have as few copies of the whole dataset as possible. I created an issue #695 to track this.

For now, I would suggest that we remove the option from append_simulations(). We might come back to this in a future PR, but I'd like to think about this a bit more.

tbmiller-astro · 2022-06-23T15:06:34Z

Just addressed these requests. Also changed .validate_x_and_theta() to take data_device as an additional argument and produce more meaningful messages with these changes. - Tim

michaeldeistler

This looks great! Thanks so much for your taking the time to fix these things and going through the review process! I left very minor comments (but feel free to ignore them if you are busy). I'll merge the PR on Monday.

All the best and thanks again!
Michael

michaeldeistler · 2022-06-24T09:25:19Z

sbi/inference/snle/snle_base.py

@@ -176,10 +188,14 @@ def train(
        # This is passed into NeuralPosterior, to create a neural posterior which
        # can `sample()` and `log_prob()`. The network is accessible via `.net`.
        if self._neural_net is None or retrain_from_scratch:
+
+            # Get theta,x from dataset to initialize NN
+            theta, x, _ = self.get_simulations()


I think this should be

theta, x, _ = self.get_simulations(starting_round=start_idx)

michaeldeistler · 2022-06-24T09:27:52Z

sbi/inference/snpe/snpe_base.py

+            x = self._x_roundwise[0][:training_batch_size]
+            theta = self._theta_roundwise[0][:training_batch_size]
+            self._neural_net = self._build_neural_net(theta.to("cpu"), x.to("cpu"))
+            self._x_shape = x_shape_from_simulation(x.to("cpu"))


Could you replace these lines with the code that you also used in snle_base.py and snre_base.py (i.e. use get_simulations)? Or is there a reason to not use get_simulations() here?

michaeldeistler · 2022-06-24T09:28:45Z

sbi/inference/snre/snre_base.py

@@ -183,11 +198,14 @@ def train(
        # This is passed into NeuralPosterior, to create a neural posterior which
        # can `sample()` and `log_prob()`. The network is accessible via `.net`.
        if self._neural_net is None or retrain_from_scratch:
+
+            # Get theta,x from dataset to initialize NN
+            theta, x, _ = self.get_simulations()


I think this should be

theta, x, _ = self.get_simulations(starting_round=start_idx)

janfb

Overall great! Thanks a lot for working on this!
see comments below.

janfb · 2022-06-25T12:57:12Z

sbi/inference/snle/snle_base.py

            self._neural_net = self._build_neural_net(
-                theta[self.train_indices], x[self.train_indices]
+                theta[:training_batch_size].to("cpu"), x[:training_batch_size].to("cpu")


the theta and x batches are used by the neural net builder to build the standardizing net using sample mean and std. I am wondering whether using only first the training_batch_size data points might affect the accuracy of the standardizing transform...

janfb · 2022-06-25T12:58:25Z

sbi/inference/snpe/snpe_base.py

-                theta[self.train_indices], x[self.train_indices]
+
+            # Get theta,x from dataset to initialize NN
+            x = self._x_roundwise[0][:training_batch_size]


same as above applies here: here we are taking only the first training_batch_size instead of the entire training data set to estimate the standard transform, no?

janfb · 2022-06-25T12:59:05Z

sbi/inference/snre/snre_base.py

            self._neural_net = self._build_neural_net(
-                theta[self.train_indices], x[self.train_indices]
+                theta[:training_batch_size].to("cpu"), x[:training_batch_size].to("cpu")


same comment as above.

janfb · 2022-06-25T13:07:41Z

sbi/utils/user_input_checks.py

@@ -647,7 +647,7 @@ def check_estimator_arg(estimator: Union[str, Callable]) -> None:


 def validate_theta_and_x(
-    theta: Any, x: Any, training_device: str = "cpu"
+    theta: Any, x: Any, data_device: str = "cpu", training_device: str = "cpu"
 ) -> Tuple[Tensor, Tensor]:
    r"""
    Checks if the passed $(\theta, x)$ are valid.


I think it would be great to update this docstring and be more explicit about what this function is checking. What's the difference between training and data device, whats the overall goal of this function?

tbmiller-astro · 2022-06-26T13:41:55Z

Thanks both! I should be able to make these final changes tomorrow! - Tim

michaeldeistler · 2022-06-29T14:50:21Z

Thanks again! This is really fantastic work!

All the best
Michael

tbmiller-astro and others added 14 commits March 18, 2022 14:52

Initial testing of get_dataloaders

5200ee0

Minor typo

750fb65

Changes to SNPE append_simulations and get_dataloaders.

178e3d1

Minor bug fix

e1f7fd3

Minor bugfix when using multiple rounds

d77bed7

Merge branch 'mackelab:main' into main

ba69f8d

Merge branch 'mackelab:main' into main

1d0d625

Added to SNRE and SNLE

48ace32

Trying to fix whitespace issue

713398b

new_branch

0d16766

Renormalize - fixes line endings

1767ec8

Ran isort, made sure _round is updated correctly

d14fec3

Fixed bug with indicies

2c87aa2

Changes to tests to match new syntax

7eb4551

Now compatible with black and pyright

980a95b

michaeldeistler requested changes Jun 22, 2022

View reviewed changes

michaeldeistler mentioned this pull request Jun 23, 2022

How many copies of the dataset are created in our pipeline? #695

Open

3 tasks

tbmiller-astro and others added 2 commits June 23, 2022 09:38

Merge branch 'mackelab:main' into data_loader_changes

52addc1

Reverted to storing data as tensors in list

4198a82

michaeldeistler approved these changes Jun 24, 2022

View reviewed changes

michaeldeistler linked an issue Jun 24, 2022 that may be closed by this pull request

Cannot train using a dataset exceeding GPU size #678

Closed

janfb reviewed Jun 25, 2022

View reviewed changes

Use all data to initialize NN

0cbc4fa

michaeldeistler merged commit a4e4b2f into sbi-dev:main Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to how data is stored #685

Changes to how data is stored #685

tbmiller-astro commented Apr 26, 2022

janfb commented May 24, 2022

michaeldeistler commented Jun 21, 2022 •

edited

Loading

tbmiller-astro commented Jun 21, 2022

michaeldeistler left a comment •

edited

Loading

tbmiller-astro commented Jun 22, 2022

michaeldeistler commented Jun 23, 2022

tbmiller-astro commented Jun 23, 2022

michaeldeistler left a comment

michaeldeistler Jun 24, 2022

michaeldeistler Jun 24, 2022

michaeldeistler Jun 24, 2022

janfb left a comment

janfb Jun 25, 2022

janfb Jun 25, 2022

janfb Jun 25, 2022

janfb Jun 25, 2022

tbmiller-astro commented Jun 26, 2022

michaeldeistler commented Jun 29, 2022

Changes to how data is stored #685

Changes to how data is stored #685

Conversation

tbmiller-astro commented Apr 26, 2022

janfb commented May 24, 2022

michaeldeistler commented Jun 21, 2022 • edited Loading

tbmiller-astro commented Jun 21, 2022

michaeldeistler left a comment • edited Loading

Choose a reason for hiding this comment

tbmiller-astro commented Jun 22, 2022

michaeldeistler commented Jun 23, 2022

tbmiller-astro commented Jun 23, 2022

michaeldeistler left a comment

Choose a reason for hiding this comment

michaeldeistler Jun 24, 2022

Choose a reason for hiding this comment

michaeldeistler Jun 24, 2022

Choose a reason for hiding this comment

michaeldeistler Jun 24, 2022

Choose a reason for hiding this comment

janfb left a comment

Choose a reason for hiding this comment

janfb Jun 25, 2022

Choose a reason for hiding this comment

janfb Jun 25, 2022

Choose a reason for hiding this comment

janfb Jun 25, 2022

Choose a reason for hiding this comment

janfb Jun 25, 2022

Choose a reason for hiding this comment

tbmiller-astro commented Jun 26, 2022

michaeldeistler commented Jun 29, 2022

michaeldeistler commented Jun 21, 2022 •

edited

Loading

michaeldeistler left a comment •

edited

Loading