Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

RuntimeError: shape '[5290000, 1]' is invalid for input of size 4600 #86

Open
beneyal opened this issue Dec 11, 2018 · 15 comments
Open

Comments

@beneyal
Copy link

beneyal commented Dec 11, 2018

Hi,

When running python main.py --batch_size 20 --data data/penn --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 500 --save PTB.pt I get the following error:

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py:179: RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
  self.dropout, self.training, self.bidirectional, self.batch_first)
Traceback (most recent call last):
  File "main.py", line 240, in <module>
    train()
  File "main.py", line 196, in train
    output, hidden, rnn_hs, dropped_rnn_hs = model(data, hidden, return_h=True)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/awd-lstm-lm/model.py", line 81, in forward
    raw_output, new_h = rnn(raw_output, hidden[l])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/awd-lstm-lm/weight_drop.py", line 47, in forward
    return self.module.forward(*args)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 179, in forward
    self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: shape '[5290000, 1]' is invalid for input of size 4600

I'm using PyTorch 1.0. Any idea why this is happening?

Thanks!

@sdraper-CS
Copy link

Also hitting this. It's related to the magic around flatten_parameters, which apparently has changed in Pytorch 1.0

I have not yet had time to look into this in detail, but I will probably try to dig deeper

@sdraper-CS
Copy link

sdraper-CS commented Dec 17, 2018

[EDIT - corrected non-train-time behavior (oops!)]

Here's my hacky (very) solution, which I think is working ok (and should work with both Pytorch 1.0 and earlier versions). It does a little more tensor copying, but in practice they tend not to be huge tensors and its a once-per-minibatch thing, so not much overall impact (at least in my usage):

class BackHook(torch.nn.Module):
    def __init__(self, hook):
        super(BackHook, self).__init__()
        self._hook = hook
        self.register_backward_hook(self._backward)

    def forward(self, *inp):
        return inp

    @staticmethod
    def _backward(self, grad_in, grad_out):
        self._hook()
        return None


class WeightDrop(torch.nn.Module):
    """
    Implements drop-connect, as per Merity et al https://arxiv.org/abs/1708.02182
    """
    def __init__(self, module, weights, dropout=0, variational=False):
        super(WeightDrop, self).__init__()
        self.module = module
        self.weights = weights
        self.dropout = dropout
        self.variational = variational
        self._setup()
        self.hooker = BackHook(lambda: self._backward())

    def _setup(self):
        for name_w in self.weights:
            print('Applying weight drop of {} to {}'.format(self.dropout, name_w))
            w = getattr(self.module, name_w)
            self.register_parameter(name_w + '_raw', Parameter(w.data))

    def _setweights(self):
        for name_w in self.weights:
            raw_w = getattr(self, name_w + '_raw')
            if self.training:
                mask = raw_w.new_ones((raw_w.size(0), 1))
                mask = torch.nn.functional.dropout(mask, p=self.dropout, training=True)
                w = mask.expand_as(raw_w) * raw_w
                setattr(self, name_w + "_mask", mask)
            else:
                w = raw_w
            rnn_w = getattr(self.module, name_w)
            rnn_w.data.copy_(w)

    def _backward(self):
        # transfer gradients from embeddedRNN to raw params
        for name_w in self.weights:
            raw_w = getattr(self, name_w + '_raw')
            rnn_w = getattr(self.module, name_w)
            raw_w.grad = rnn_w.grad * getattr(self, name_w + "_mask")

    def forward(self, *args):
        self._setweights()
        return self.module(*self.hooker(*args))

@DavidNemeskey
Copy link

DavidNemeskey commented Jan 10, 2019

@sdraper-CS I am very curious as to what this magic actually does and why it is needed. Could you elaborate on that?

@DavidNemeskey
Copy link

@sdraper-CS Thanks for your solution, BTW, it seems to work (I am still running it, waiting for the results to see if they match the paper). However, I got an error on this line, because pickle didn't like the lambda:

self.hooker = BackHook(lambda: self._backward())

Changing the line to

self.hooker = BackHook(self._backward)

solved the error.

@ink-pad
Copy link

ink-pad commented Jan 10, 2019

Encountered the following issue with #86 (comment) solution.

AttributeError: 'WeightDrop' object has no attribute 'weight_hh_l0_mask'

However, this PR pytorch/pytorch#15766 seems to be working perfectly. I haven't tested it completely though.

@sdraper-CS
Copy link

@sdraper-CS I am very curious as to what this magic actually does and why is it needed. Could you elaborate on that?

The issue is that the changes in PyTorch 1.0 make it difficult to emplace a new tensor for the weights on each batch, so instead the idea is to mask the elements of the existing weights tensor in-situ. However, this means that the gradients also need to be masked on the back-pass (because we didn't actually forward through F.dropout), so the BackHook hooks the backward pass and does a similar masking on the gradient tensors. I'm still not 100% sure I got this completely right, so I'd be interested in your eventual results (seems to be behaving correctly and in the way you'd expect for a regularizer for me, but that's weak evidence of actual correctness!)

@DavidNemeskey
Copy link

I have run the word-PTB LSTM model, and reached 74.54 PPL at the point where the code changes the optimizer to ASGD (and then it broke with KeyError: 'ax' on prm.data = optimizer.state[prm]['ax'].clone()). That is close to what I got earlier (around 70-72), though it does not really agree with the ablation analysis in the paper, which reports 66 without ASGD (but maybe with fine-tuning, so who knows).

BTW QRNN stops at around 770 PPL, so that also needs to be properly updated to 1.0...

I guess I'll just go back to 0.4 for now to be on the safe side.

@sdraper-CS
Copy link

@DavidNemeskey I am now pretty confident that the approach is working correctly. I have retrained an NER model based on the Lample paper from 2017 with my modified version of this class, and am able to recover the same model performance as before

@DavidNemeskey
Copy link

@sdraper-CS I ran both the original and your code under Pytorch 0.4, and found the following:

  • you should modify the loops on lines 244 and 260 in main.py to enumerate named_parameters() and exclude everything that has _raw in the name, because apparently those parameters are not part of optimizer.state. Do you know why? Shouldn't everything that is returned by parameters() (and has a grad) be optimized?
  • the scores are not exactly the same. With the hyperparameter setting listed in README.md, I get
    • SGD: 67, ASGD (final): 61.4 (dev) / 59.1 (test)
    • SGD: 72.8, ASGD (final): 64.2 (dev) / 61.63 (test)

So I guess it works, it's just that the hyperparameters might need recalibration.

@sdraper-CS
Copy link

@DavidNemeskey That's odd. I'm not sure why the raw_ parameters would not be in the optimizer (as you say anything the model.parameters() enumerates should be in the optimized set), however they will always receive 0 gradient [directly anyway] since they are not part of the froward-prop'd graph (that's why we have to copy values from them on forward, and to their gradients on backward). However, because we perform this copying to the gradient on the back hook they SHOULD have gradients by the time the optimizer sees them (and it should have them in its optimization set - the presence of the 'real' parameters there also is redundant since any updates to THOSE weights are discarded by the value copy on the forward pass [removing those from the optimized set is an optimization I really should make sometime]).
I'll run my code through and take a look at the optimizer set I see in the debugger (at least for CPU, though at that level it shouldn't matter) to see if I can see the same issue as you.
It's POSSIBLE that this may manifest with some optimizers but not others - I have not experimented widely (my model just uses SGD with Nesterov momentum + an annealing schedule). I'll see what I can find and get back to you when I have more information

@DavidNemeskey
Copy link

@sdraper-CS I did another experiment and replaced the line

self.register_parameter(name_w + '_raw', Parameter(w.data))

with just

setattr(self, name_w + '_raw', w.data)

i.e. the _raw things are now not parameters at all. Consequently, ASGD doesn't blow up (as the _raw tensors are not returned by parameters()), AND the code works (i.e. I get similar results to when I manually excluded _raw parameters in main.py). I am still trying to understand why...

@sdraper-CS
Copy link

@DavidNemeskey That really doesn't make sense to me! Stepping through in the debugger I AM getting the _raw variants in the optimizer params (for both SGD and Adam), and it SHOULD be necessary to register the raw variants as Parameters (so I cannot explain your observations).
To provide some framework for analysis, here is a description of exactly how it is intended to work during the forward and backward training passes:

Setup:

  1. Raw parameters and underlying RNN parameters are included in the overall model parameters (provided we register the raw parameters)
  2. Optimizer is initialized with the model parameters

Forward pass:
3. Dropout mask is constructed (and preserved) as the forward pass goes through the DropConnect wrapper
4. Raw parameters values are multiplied by the mask and the result (masked values) are copied into the underlying LSTMs weights tensor
5. Forward pass through the underlying LSTM (which now has masked weights) occurs

Backward pass:
6. Back hook is invoked and copies the gradients from the underlying LSTM weights and masks them according to the dropout mask, copying the result to the raw_parameter gradient
7. Optimizer step updates the parameters according to the gradients. This will update the raw_parameters according to the copied gradients. It will also (but actually redundantly) update the underlying LSTM weights according to their (unmasked) gradients, but this is actually irrelevant (apart from optimizer performance, so we could improve by removing these from the model parameters reported to the optimizer) because on next forward pass we will anyway overwrite the LSTM weights with the raw weights

It is thus critical that the raw parameters are part of the optimized set. If they were not the expected behavior would be that we never learn anything, since the raw weights would not be updated and we'd continue to copy whatever value they were initialized with into the underlying LSTM on each forward pass.

The above analysis does highlight one subtle point, which is that any weight initialization you intend to apply to the LSTM needs to be applied BEFORE the LSTM is wrapped inside a WeightDrop wrapper (else you'll be initializing weight that end up not actually being used, and the effective initialization will be zeros). I also think I might have a bug is the gradient normalization, since the mask produced by Dropout is weighted (so the non-0 elements have . weight that normalizes the mean), but because I reuse the same mask to mask the gradients on the back-pass I'm probably double-counting the normalization (I'll need to look into that more).

Sorry I cannot explain your exact findings, but hopefully the above explanation will help your analysis of what is happening in your case

@NProkoptsev
Copy link

Have anyone checked fast ai implementaion for pytorch 1.0 ?
https://github.com/fastai/fastai/blob/master/fastai/text/models/awd_lstm.py

@daemon
Copy link

daemon commented Jun 30, 2019

@NProkoptsev You probably already know this by now, but just for everyone else who sees this: the fastai implementation works for PyTorch 1.0.

@DavidNemeskey
Copy link

@daemon You are right, it works, but it cannot reproduce the numbers in the paper either. I think that boat has sailed with Pytorch 0.4; at least until someone does a full hyperparameter search for 1.0.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants