Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Wrong shape for input_ids (shape torch.Size([18])) or attention_mask (shape torch.Size([18])) #10

Closed
youssefavx opened this issue Sep 13, 2020 · 14 comments

Comments

@youssefavx
Copy link

youssefavx commented Sep 13, 2020

After running the example code provided I get this error:

>>> import simalign
>>> 
>>> source_sentence = "Sir Nils Olav III. was knighted by the norwegian king ."
>>> target_sentence = "Nils Olav der Dritte wurde vom norwegischen König zum Ritter geschlagen ."
>>> model = simalign.SentenceAligner()
2020-09-13 18:02:40,806 - simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
I0913 18:02:40.806071 4394976704 simalign.py:47] Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
>>> result = model.get_word_aligns(source_sentence.split(), target_sentence.split())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 181, in get_word_aligns
    vectors = self.embed_loader.get_embed_list(list(bpe_lists))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 65, in get_embed_list
    outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 65, in <listcomp>
    outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/transformers/modeling_bert.py", line 806, in forward
    extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/transformers/modeling_utils.py", line 248, in get_extended_attention_mask
    input_shape, attention_mask.shape
ValueError: Wrong shape for input_ids (shape torch.Size([18])) or attention_mask (shape torch.Size([18]))

I wonder if this is due to my recent update of transformers. If so, that's going to be difficult for me to solve because the newest version of transformers has a fill-mask feature that was not available in previous versions that I'm going to need in conjunction with simalign's invaluable functionality.

Hopefully, this is unrelated. I did cancel the download then restart it again (and it seemed to restart from a fresh file though I could be wrong).

@youssefavx
Copy link
Author

youssefavx commented Sep 13, 2020

Is it possible to use custom models with simalign? (I'm mostly interested in alignment from english to english, not other languages)

@youssefavx
Copy link
Author

youssefavx commented Sep 13, 2020

Maybe this might be the issue?

huggingface/transformers#20 (comment)

@pdufter
Copy link
Member

pdufter commented Sep 14, 2020

Hi @youssefavx thanks for pointing this issue out. Custom models should work with simalign (just pass the path to the model when instantiating SentenceAligner. For your error message: Which version of transformers are you using?

@youssefavx
Copy link
Author

youssefavx commented Sep 14, 2020

Hey @pdufter Awesome! Will experiment with custom models (I assume I could just use the model name like with transformers library? or do I have to find an actual path to those?).

I'm running version 3.1.0.

@pdufter
Copy link
Member

pdufter commented Sep 16, 2020

At the moment we only tested on transformers==2.3.0

@youssefavx
Copy link
Author

Unfortunately I can’t really downgrade because there’s new functionality in the new transformers that is essential.

Do you know if there’s a way to run both versions of a package at the same time in the same application?

If not, then I guess I’ll try to debug this one and report back.

@pdufter
Copy link
Member

pdufter commented Sep 16, 2020

I do not know whether you can run two versions at the same time. But we anyway plan to make simalign usable with newer transformer versions and to add new features soon, if that helps. In the meantime, if you find the issue, any pull request is obviously highly appreciated.

@youssefavx
Copy link
Author

@pdufter Will do if I solve it!

@youssefavx
Copy link
Author

youssefavx commented Sep 17, 2020

Okay, I think I fixed this (or found the problem) but my fix breaks simalign for earlier versions of transformers. I really don't think this is because of any impossibility to have compatibility with earlier versions but more due to my ignorance.

I should note that I have:

  1. Zero experience with Pytorch
  2. Very little experience with transformers

Perhaps you could add an if statement in the code like "if the version is earlier" (a much better if statement would obviously be one that allows you to detect nested arrays inside a tensor or something, as we dont know what huggingface will do at any point in time when they change their packages. It's probably also less tedious to set up the latter)

So here's the problem:

In this function:

def get_embed_list(self, sent_pair):
		if self.emb_model is not None:
			sent_ids = [self.tokenizer.convert_tokens_to_ids(x) for x in sent_pair]
                         
       #Guilty variable!! 20 years in prison for uuu variable!
			inputs = [self.tokenizer.prepare_for_model(sent, return_token_type_ids=True, return_tensors='pt')['input_ids'] for sent in sent_ids]
      #^ This right here

			outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
			# use vectors from layer 8
			vectors = [x[2][self.layer].cpu().detach().numpy()[0][1:-1] for x in outputs]

			return vectors
		else:
			return None

When I print the "inputs" variable (after updating transformers to 3.1.0):

inputs= [tensor([  101, 12852, 33288, 46495, 10652,   119, 10134, 96820, 27521, 10336,
        10155, 10105, 31515, 16997, 11630, 20636,   119,   102]), tensor([  101, 33288, 46495, 10118, 11612, 81898, 10283, 11036, 31515, 16997,
        11611, 17260, 10580, 32017, 95023,   119,   102])]

The tensor that you get is different, which is why we get this error I assume: ValueError: Wrong shape for input_ids (shape torch.Size([18])) or attention_mask (shape torch.Size([18]))

Whereas when I downgrade transformers, and I print inputs again:

inputs= [tensor([[  101, 12852, 33288, 46495, 10652,   119, 10134, 96820, 27521, 10336,
         10155, 10105, 31515, 16997, 11630, 20636,   119,   102]]), tensor([[  101, 33288, 46495, 10118, 11612, 81898, 10283, 11036, 31515, 16997,
         11611, 17260, 10580, 32017, 95023,   119,   102]])]

So all I had to do was add another (array?) to it. Keep in mind I have no clue whatsoever how to do this appropriately, nor do I have any clue what I'm doing.

I searched online, and came across this solution.

So, in this function, here's the edit I made:

for in_ids in inputs:
	in_ids.resize_(1,len(in_ids))

In this function:

def get_embed_list(self, sent_pair):
		if self.emb_model is not None:
			sent_ids = [self.tokenizer.convert_tokens_to_ids(x) for x in sent_pair]

             
			inputs = [self.tokenizer.prepare_for_model(sent, return_token_type_ids=True, return_tensors='pt')['input_ids'] for sent in sent_ids]

			for in_ids in inputs:
				in_ids.resize_(1,len(in_ids))



			outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]

			# use vectors from layer 8
			vectors = [x[2][self.layer].cpu().detach().numpy()[0][1:-1] for x in outputs]

			return vectors
		else:
			return None

So you may have better ideas as to what the implications of this edit are and how to better implement it.

@youssefavx
Copy link
Author

And testing to make sure that the in_ids are the same before and after the resize:

in_ids before resize tensor([  101, 12852, 33288, 46495, 10652,   119, 10134, 96820, 27521, 10336,
        10155, 10105, 31515, 16997, 11630, 20636,   119,   102])
in_ids after resize tensor([[  101, 12852, 33288, 46495, 10652,   119, 10134, 96820, 27521, 10336,
         10155, 10105, 31515, 16997, 11630, 20636,   119,   102]])

@masoudjs
Copy link
Member

@youssefavx
Thank you for putting the time to fix this.
I think we should add the attention matrix as the new list.
I am updating the model to Transformers 3. I will finish it today or tomorrow.

@youssefavx
Copy link
Author

youssefavx commented Sep 17, 2020

@masoudjs Thank you for making such a useful and essential tool

@Lukecn1
Copy link

Lukecn1 commented Nov 16, 2020

I had the same issue, but it was resolved by wrapping my data in a torch Dataloader. I am not sure as to why that solved the problem, but solve it, it did.

@ZhuoerFeng
Copy link

I had the same issue, but it was resolved by wrapping my data in a torch Dataloader. I am not sure as to why that solved the problem, but solve it, it did.

Modules in torch accept inputs in form of [batch_size, ...] therefore perform .unsqueeze(0) on input/attention_masks tensors would help. which is done by torch.utils.data.DataLoader

@pdufter pdufter closed this as completed Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants