Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFLongformerForMaskedMLM example throws ValueError "shapes are incompatible" #11488

Closed
1 task
fredo838 opened this issue Apr 28, 2021 · 8 comments · Fixed by #11559
Closed
1 task

TFLongformerForMaskedMLM example throws ValueError "shapes are incompatible" #11488

fredo838 opened this issue Apr 28, 2021 · 8 comments · Fixed by #11559

Comments

@fredo838
Copy link
Contributor

fredo838 commented Apr 28, 2021

An official example of the TFLongFormerX page does not work.

Environment info

  • transformers version: 2.4.1
  • Platform: ubuntu 20.04
  • Python version: python3.8
  • PyTorch version (GPU?): N/A
  • Tensorflow version (GPU?): 2.4.1
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@patrickvonplaten (Longformer)
@Rocketknight1 (tensorflow)
@sgugger (maintained examples )

Information

Model I am using: Longformer

The problem arises when using:

  • [x ] the official example scripts: (give details below)
  • my own modified scripts: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. docker run -it --rm python:3.8 bash (no gpus attached)
  2. python3 -m pip install pip --upgrade
  3. python3 -m pip install transformers tensorflow
  4. python3 -> launch interactive shell
  5. run following lines:
from transformers import LongformerTokenizer, TFLongformerForMaskedLM
import tensorflow as tf
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = TFLongformerForMaskedLM.from_pretrained('allenai/longformer-base-4096')
inputs = tokenizer("The capital of France is [MASK].", return_tensors="tf")
inputs["labels"] = tokenizer("The capital of France is Paris.", return_tensors="tf")["input_ids"]
outputs = model(inputs)
# loss = outputs.loss
# logits = outputs.logits

This throws following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1012, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/transformers/models/longformer/modeling_tf_longformer.py", line 2140, in call
    loss = None if inputs["labels"] is None else self.compute_loss(inputs["labels"], prediction_scores)
  File "/usr/local/lib/python3.8/site-packages/transformers/modeling_tf_utils.py", line 158, in compute_loss
    reduced_logits = tf.boolean_mask(tf.reshape(logits, (-1, shape_list(logits)[2])), active_loss)
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py", line 1831, in boolean_mask_v2
    return boolean_mask(tensor, mask, name, axis)
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/ops/array_ops.py", line 1751, in boolean_mask
    shape_tensor[axis:axis + ndims_mask].assert_is_compatible_with(shape_mask)
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/framework/tensor_shape.py", line 1134, in assert_is_compatible_with
    raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (11,) and (9,) are incompatible
@fredo838 fredo838 changed the title simple TFLongformerForMaskedMLM throws ValueError "shapes are incompatible" TFLongformerForMaskedMLM example throws ValueError "shapes are incompatible" Apr 28, 2021
@Rocketknight1
Copy link
Member

Hi! The model is working fine here, but the problem is that "[MASK]" and "Paris" are being tokenized as different numbers of tokens, which is where your shape error is coming from. Can you link me to the exact script you got this example from?

@fredo838
Copy link
Contributor Author

It's under this headline, here's the permalink: https://huggingface.co/transformers/model_doc/longformer.html#tflongformerformaskedlm

@fredo838
Copy link
Contributor Author

ah so it's probably just updating inputs["labels"] = tokenizer("The capital of France is Paris.", return_tensors="tf")["input_ids"] to inputs["labels"] = tokenizer("The capital of [MASK] is Paris.", return_tensors="tf")["input_ids"], no?

@Rocketknight1
Copy link
Member

Rocketknight1 commented Apr 29, 2021

I checked and you're absolutely right, the example as written does not work. I did some digging and the problem is that the mask sequence for this model is actually '<mask>' and not '[MASK]'. Therefore, 'Paris' actually does get correctly tokenized as one token but '[MASK]' does not get recognized as a special character and is 'spelled out' with three word-piece tokens instead. (You can see what splits the tokenizer chose by using tokenizer.convert_ids_to_tokens() on the tokenized inputs).

The example should work if you replace '[MASK]' with '<mask>'. Can you try that and let me know? If it works, we can make a PR to fix this example!

@fredo838
Copy link
Contributor Author

fredo838 commented Apr 30, 2021

So now the following example:

import tensorflow as tf
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
model = TFLongformerForMaskedLM.from_pretrained('allenai/longformer-base-4096')
inputs = tokenizer("The capital of France is <mask>.", return_tensors="tf")
inputs["labels"] = tokenizer("The capital of France is Paris.", return_tensors="tf")["input_ids"]
outputs = model(inputs)
loss = outputs.loss
logits = outputs.logits
preds = tf.argmax(logits, axis=2)
predicted_tokens = tokenizer.convert_ids_to_tokens(tf.squeeze(preds))
print("predicted_tokens: ", predicted_tokens)

yields:

['<s>', 'The', 'Ġcapital', 'Ġof', 'ĠFrance', 'Ġis', 'ĠParis', '.', '</s>']

So at least we're doing something right, but there's still this weird Ġ character on every non-first token.

@Rocketknight1
Copy link
Member

Ah, yes! The Ġ character is used to indicate word breaks. If you want to see the pure string output without it, try using the decode() method instead of convert_ids_to_tokens().

Other than that, though, your example looks good! I talked with people on the team and we can't use it directly, annoyingly - the examples are all built from the same template, so we can't easily change just one. Still, we can pass some arguments to make sure our example works for Longformer in future.

The relevant bit is here. If you'd like to try it yourself, you can submit a PR to add the argument mask='<mask>' to the add_code_sample_docstrings decorator. If that sounds like a lot of work, just let me know and I'll make the PR and credit you for spotting it!

@fredo838
Copy link
Contributor Author

fredo838 commented May 3, 2021

@Rocketknight1 I added a PR (#11559)

@Rocketknight1
Copy link
Member

Closing this because we have the PR now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants