Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with custom DataLoader in PosteriorEstimator.train #415

Closed
narendramukherjee opened this issue Jan 25, 2021 · 7 comments
Closed

Comments

@narendramukherjee
Copy link
Contributor

Hi,

First of all, thanks a ton to the Macke lab team for building this toolbox. It has made it really easy to get started with using SBI in so many application domains! :)

I wanted to check in as right now the NN data loading process is fixed inside PosteriorEstimator.train. I have a situation where I am using a 1D ConvNet as an embedding network for time-series data that are unequal in length. So I need to add in a collate_fn to a custom DataLoader to pad the unequal length sequences in each batch before feeding them into the embedding network. The output from the embedding net into the density estimator will be fixed length, but the input into the embedding net can be variable length. The only way I can think of doing this right now is to write my own custom implementation of PosteriorEstimator.train (where only the data loading part changes), so I wanted to ask if:
a) There's any other way of doing this?
b) If no, would there be interest if I filed a PR where one could pass a custom Dataset/DataLoader to the train method?

Finally, I had a very basic question about the train method - I am unable to understand the utility of this line when the theta and x get re-written right afterwards here. I am probably missing something very basic, and would love to have clarification from you :)

@janfb
Copy link
Contributor

janfb commented Jan 25, 2021

Hi @narendramukherjee and thanks for creating the issue!

Yes, here our API seems a bit stiff. It would be great if you make a PR for fixing it! We think that an addition argument in train would make most sense. This argument could be dataloader_kwargs and would contain just the kwargs passed to data.Dataloader in train(), e.g, here https://github.com/mackelab/sbi/blob/f6dd229c3804d135c5689c1745d1ad095621b9b9/sbi/inference/snpe/snpe_base.py#L221-L226 and here https://github.com/mackelab/sbi/blob/f6dd229c3804d135c5689c1745d1ad095621b9b9/sbi/inference/snpe/snpe_base.py#L228-L233. We would need that change in the train functions for SNLE and SNRE as well.
Then one could use a custom collate_fn without having to rewrite the train() method. Does that make sense to you and would that fix your problem?

Regarding your other point -- you are absolutely right, this seems to be a left over from a refactoring we did recently. We will remove it, thanks!

@narendramukherjee
Copy link
Contributor Author

Hi @janfb , absolutely, your suggestion makes sense and I will work on it and file a PR. It might take me a few days to get to doing this though, and I shall tag you as soon as it is ready :D

@janfb
Copy link
Contributor

janfb commented Jan 28, 2021

Great, @narendramukherjee ! No worries, take your time and let us know if you need help.

@narendramukherjee
Copy link
Contributor Author

@janfb Thanks for your previous comments - I looked into this a bit more today, and I feel like the API expects the simulation outputs (x) to be a multi-dimensional tensor, which is causing several issues (for eg., while validating theta and x).

Let me explain my problem scenario a bit more - my simulations generate time-series which have variable lengths (right now, they vary from 20 to 3000). Each time-series itself is a torch Tensor (of shape (length x 5)), but since the length varies, I cannot concatenate them together to have a 3D tensor of shape (num_simulations x length x 5). That is exactly the reason I wanted to use a collate_fn in the DataLoader during training in the first place: I was going to pad the unequal sequence lengths to the max len of the batch inside the collate_fn.

So, right now, I have the simulations in a list, where each element of the list is a torch Tensor of shape (length x 5) (because the time-series have 5-D observations). I was planning to leave the simulated data in this form (i.e, as a list of Tensors), and convert each batch of x into a tensor of fixed dimensions that could enter the density estimator by using my collate_fn and embedding_net.

I wrote a custom dataset like this:

class SbiDataset(torch.utils.data.Dataset):
    "Characterizes a dataset for PyTorch"

    def __init__(self, theta, x, prior_masks):
        "Initialization"
        self.theta = theta
        self.x = x
        self.prior_masks = prior_masks

    def __len__(self):
        "Denotes the total number of samples"
        return len(self.theta)

    def __getitem__(self, index):
        "Generates one sample of data"
        return self.theta[index], self.x[index], self.prior_masks[index]

And a collate_fn like this (it pads using the last observation in each time series):

def pad_collate(batch):
    (theta, x, prior_masks) = zip(*batch)
    x_lens = [len(simulation) for simulation in x]
    max_x_len = max(x_lens)

    out_x = x[0].data.new_empty((len(x), max_x_len, 5))
    for i, simulation in enumerate(x):
        out_x[i, : x_lens[i], :] = simulation
        if x_lens[i] < max_x_len:
            out_x[i, x_lens[i] :, :] = simulation[-1, :].repeat(
                (max_x_len - x_lens[i], 1)
            )

    return theta, out_x.permute(0, 2, 1), prior_masks

So basically, each batch would yield a (theta, x, prior_masks) tuple as SNPE.train expects. Just that the x is a list of tensors initially, and each batch is converted into a properly shaped tensor during padding. When I try to append_simulations with such a x, I get errors when validation happens here: https://github.com/narendramukherjee/sbi/blob/main/sbi/inference/snpe/snpe_base.py#L109
If I switch off that validation, I see that there are a bunch of places where x is treated as a tensor (like asking for its .size()) which doesn't work as in my case, x is a list unless it is yielded by my custom DataLoader.

Wondering what your thoughts are on this issue, and how I can approach it? Is there something different I can do (and I am happy to work on any API changes that can accommodate this)? I don't really want to pad all my sequences to the size of the biggest one across all simulations, as their sizes go across a couple orders of magnitude, and that is why I thought of doing that on a batch-by-batch level.

Thanks again for all your help, and for putting sbi together in the first place :)

@janfb
Copy link
Contributor

janfb commented Feb 1, 2021

I see the problem.

Changing the requirement that x is a Tensor would be a major change which we would prefer to avoid.

My first approach would indeed be to pad all your simulations to the same length using, e.g., NaNs, and then to use the collate_function to deal with that. I think this will not really be a memory issue unless you have millions of simulations, will it?

@narendramukherjee
Copy link
Contributor Author

@janfb That's a good idea - I didn't think of padding using NaNs and then replacing those inside the collate_fn. Then I also probably don't need a custom Dataset and can use TensorDataset as is the case in train right now. Will try it out and keep you posted!

@narendramukherjee
Copy link
Contributor Author

@janfb I have made an initial PR about this issue #435 . I have run into some problems as (I think) we need to add **dataloader_kwargs to the train method of both snpe_base and snpe_c (and the same for snre and snle), so I wanted to get advice first on how to go about doing this. Let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants