Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to how data is stored #685

Merged
merged 18 commits into from
Jun 29, 2022

Conversation

tbmiller-astro
Copy link
Contributor

Changes to how data is stored within the inference class, discussed in #678. By default all the behavior should be the same as before but the changes help reduce the overall memory overhead and add extra flexibility when dealing with large datasets.

Summary of major changes:

  • Simulated data is now stored in a torch Dataset rather than in lists as before. In train() only a call to get_dataloaders is needed.
  • get_dataloader is changed to create the Dataloader from the saved Dataset. Takes the argument start_round to define which round(s) of simulations to load.
  • All data checks have been moved to append_simulations(). Three arguments have been added: return_self to control if the method returns a copy of the class, data_device to control where data lives, independent of the device used for training and warn_if_zscoring to control if the z-scoring check is initiated.

@janfb
Copy link
Contributor

janfb commented May 24, 2022

Hi @tbmiller-astro,

thanks a lot for this PR, and sorry for the delay in the review--it seems we are all very busy at the moment. But I plan to have a look at this soon!

Best,
Jan

@michaeldeistler
Copy link
Contributor

michaeldeistler commented Jun 21, 2022

Hi @tbmiller-astro,

thanks a lot for the PR. Your code is very clean and easy to follow, and your PR description also helped a lot! Thanks!

I like that you introduced a data_device and that all data checks are moved to append_simulaltions().

I have two questions regarding your rational behind other changes though:

  1. Why is it preferable to store all data in a torch.TensorDataset instead of in lists of Tensors? One could also have each Tensor in the list lie on the data_device, no? Is there a strong reason to directly concatenate the Tensors into a single torch.TensorDataset instead of doing this during .train()?

  2. In what scenario would one use append_simulations(..., return_self=False)? You write that it controls if the method returns a copy of the class. However, since there is no deepcopy(self), I am not sure if return_self=True actually doubles the memory footprint.

Thanks a lot for the PR!
Michael

@tbmiller-astro
Copy link
Contributor Author

Hey @michaeldeistler, thanks a lot and these are great questions!

  1. Yes, I don't think there would a functional difference between what I implemented and what you are describing. It seemed like a cleaner implementation as the data would be put into a torch.TensorDataset eventually. I also like that in my PR in .train() the torch.DataLoader is built from a single function call, rather then calling .get_simulations() and then making a separate torch.TensorDataset and then using .get_data_loaders() as it was before. This could also be implemented if the data is stored in a list of torch.Tensors as you describe too.

  2. This was an attempt to help further reduce the memory footprint if needed. I thought that not returning self would help achieve this. As you mention though, this is likely a misunderstanding on my part of how memory is managed in python. If it is not actually helping than this option should probably be removed.

Happy to continue making changes if requested!

Tim

Copy link
Contributor

@michaeldeistler michaeldeistler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tim,

thanks a lot for your answer, this makes sense! I left a few minor comments below.

I think that I would prefer to store the data as lists of Tensors instead of as torch.TensorDataset. Here's why:

  1. It does not require get_simulations_indices(), but we can instead use the simpler get_simulations_since_round()
  2. Not all operations (e.g, z-scroing) require the torch.TensorDataset. So, I would prefer to generate the torch.TensorDataset only when it is really needed (i.e. in .train()).
  3. My main reason is that lists provide a nice way of structuring data that was passed over several rounds. This structure is lost when all simulations are concatenated into a single torch.TensorDataset. Sure, one can potentially recover the structure via _num_sims_per_round, but it's just less easy to see what is happening.

So, to be clear, I would only revert these changes in get_simulations() and these in append_simulations(). I am pefectly fine with your suggestion that the get_dataloaders() should, in a single call, generate the dataloader. However, I would suggest to generate the dataloaders from the lists instead of from the torch.TensorDataset.

Please let me know if you are up for implementing these changes! If you are busy, we can also merge your PR as it is right now and I'll make the changes myself.

Thanks again!

Best wishes
Michael

sbi/inference/base.py Outdated Show resolved Hide resolved
sbi/inference/base.py Outdated Show resolved Hide resolved
sbi/inference/snle/snle_base.py Outdated Show resolved Hide resolved
sbi/inference/snle/snle_base.py Outdated Show resolved Hide resolved
sbi/inference/snpe/snpe_base.py Outdated Show resolved Hide resolved
sbi/inference/snre/snre_base.py Outdated Show resolved Hide resolved
tests/base_test.py Outdated Show resolved Hide resolved
sbi/examples/minimal.py Outdated Show resolved Hide resolved
@tbmiller-astro
Copy link
Contributor Author

Hey @michaeldeistler, sounds good! Happy to make the changes suggested.

For the data checks, I've found they have a big affect on the memory footprint. Both the functions warn_if_zscoring_changes_data and handle_invalid_x create copies of the entire dataset. I could change it so that they are all under one option, like run_data_checks (or something similar) to reduce the number of arguments?

@michaeldeistler
Copy link
Contributor

Hi Tim,

awesome, thanks!

Hmmm I was not aware of this. However, handle_invalid_data is performed regardless of the warn_on_invalid, see here. I do not think that there is a way around this, since invalid simulations must be detected and removed, otherwise training will fail.

So, since we already have to copy the dataset once (in handle_invalid_data), it should also be fine to copy it during warn_if_zscoring_changes_data(). Or am I missing something?

I agree that we should ensure that we have as few copies of the whole dataset as possible. I created an issue #695 to track this.

For now, I would suggest that we remove the option from append_simulations(). We might come back to this in a future PR, but I'd like to think about this a bit more.

@tbmiller-astro
Copy link
Contributor Author

Just addressed these requests. Also changed .validate_x_and_theta() to take data_device as an additional argument and produce more meaningful messages with these changes. - Tim

Copy link
Contributor

@michaeldeistler michaeldeistler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks so much for your taking the time to fix these things and going through the review process! I left very minor comments (but feel free to ignore them if you are busy). I'll merge the PR on Monday.

All the best and thanks again!
Michael

@@ -176,10 +188,14 @@ def train(
# This is passed into NeuralPosterior, to create a neural posterior which
# can `sample()` and `log_prob()`. The network is accessible via `.net`.
if self._neural_net is None or retrain_from_scratch:

# Get theta,x from dataset to initialize NN
theta, x, _ = self.get_simulations()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be

theta, x, _ = self.get_simulations(starting_round=start_idx)

x = self._x_roundwise[0][:training_batch_size]
theta = self._theta_roundwise[0][:training_batch_size]
self._neural_net = self._build_neural_net(theta.to("cpu"), x.to("cpu"))
self._x_shape = x_shape_from_simulation(x.to("cpu"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace these lines with the code that you also used in snle_base.py and snre_base.py (i.e. use get_simulations)? Or is there a reason to not use get_simulations() here?

@@ -183,11 +198,14 @@ def train(
# This is passed into NeuralPosterior, to create a neural posterior which
# can `sample()` and `log_prob()`. The network is accessible via `.net`.
if self._neural_net is None or retrain_from_scratch:

# Get theta,x from dataset to initialize NN
theta, x, _ = self.get_simulations()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be

theta, x, _ = self.get_simulations(starting_round=start_idx)

@michaeldeistler michaeldeistler linked an issue Jun 24, 2022 that may be closed by this pull request
Copy link
Contributor

@janfb janfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall great! Thanks a lot for working on this!
see comments below.

self._neural_net = self._build_neural_net(
theta[self.train_indices], x[self.train_indices]
theta[:training_batch_size].to("cpu"), x[:training_batch_size].to("cpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the theta and x batches are used by the neural net builder to build the standardizing net using sample mean and std. I am wondering whether using only first the training_batch_size data points might affect the accuracy of the standardizing transform...

theta[self.train_indices], x[self.train_indices]

# Get theta,x from dataset to initialize NN
x = self._x_roundwise[0][:training_batch_size]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above applies here: here we are taking only the first training_batch_size instead of the entire training data set to estimate the standard transform, no?

self._neural_net = self._build_neural_net(
theta[self.train_indices], x[self.train_indices]
theta[:training_batch_size].to("cpu"), x[:training_batch_size].to("cpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above.

@@ -647,7 +647,7 @@ def check_estimator_arg(estimator: Union[str, Callable]) -> None:


def validate_theta_and_x(
theta: Any, x: Any, training_device: str = "cpu"
theta: Any, x: Any, data_device: str = "cpu", training_device: str = "cpu"
) -> Tuple[Tensor, Tensor]:
r"""
Checks if the passed $(\theta, x)$ are valid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to update this docstring and be more explicit about what this function is checking. What's the difference between training and data device, whats the overall goal of this function?

@tbmiller-astro
Copy link
Contributor Author

Thanks both! I should be able to make these final changes tomorrow! - Tim

@michaeldeistler
Copy link
Contributor

Thanks again! This is really fantastic work!

All the best
Michael

@michaeldeistler michaeldeistler merged commit a4e4b2f into sbi-dev:main Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot train using a dataset exceeding GPU size
3 participants