Alignment between spaCy tokens and Transformer outputs #6563

thiippal · 2020-12-14T13:25:54Z

thiippal
Dec 14, 2020

Hey everyone,

if I've understood correctly, the Transformer-based models contain alignment information between spaCy tokens and Transformer outputs, which is available under Doc._.trf_data.align.

If I use the Transformer-based model for English (en_core_web_trf) to process the sentence "They robbed a bank.", the tokens under Doc._.trf_data.tokens['input_texts'] are the following:

[['<s>', 'They', 'Ġrobbed', 'Ġa', 'Ġbank', '.', '</s>']]

Based on the alignment information under Doc._.trf_data.align, I assume that the tokens <s> and </s> indicate the beginning and the end of the input sequence: if I use the index of the first spaCy token to retrieve the alignment data from Doc._.trf_data.align[0].data, this returns the following output: array([[1]], dtype=int32). In other words, the token <s> is passed over.

However, if I feed a sentence that contains a presumably out-of-vocabulary word such as "shibainu" to the model, this results in the following tokenization:

[['<s>', 'sh', 'ib', 'ain', 'u', 'Ġis', 'Ġa', 'Ġdog', '</s>']]

What confuses me is that when I access the alignment information under Doc._.trf_data.align for this sentence, the token <s> is included in the alignment information for the spaCy Token "shibainu":

array([[0],
       [1],
       [2],
       [3],
       [4]], dtype=int32)

What might explain this behaviour?

adrianeboyd · 2020-12-14T14:36:36Z

adrianeboyd
Dec 14, 2020

No, the token <s> is at index 0, so this is saying that tokens 'sh', 'ib', 'ain', 'u' (tokens 1..4 in the transformer tokens) in the transformer tokenization line up to the spacy token shibainu (token 0 in the spacy doc).

5 replies

thiippal Dec 14, 2020
Author

Bah, I can't even copy-paste properly at the end of the working day.

I've amended the original post, but here's the alignment information for "shibainu":

array([[0],
       [1],
       [2],
       [3],
       [4]], dtype=int32)

adrianeboyd Dec 15, 2020

Ah, I do see what you're describing now. This is fun: the alignment algorithm doesn't know anything special about the boundary tokens and it's aligning the s in <s> with shibainu. If you use thibainu, it doesn't get aligned.

thiippal Dec 15, 2020
Author

Weird! Just checked that the same behaviour can be observed for longer sequences, e.g.

nlp("Sicily is my favourite place.")._.trf_data.tokens['input_texts']

Returns the following:

[['<s>', 'S', 'ic', 'ily', 'Ġis', 'Ġmy', 'Ġfavourite', 'Ġplace', '.', '</s>']]

And the alignment information for spaCy Token at index 0:

nlp("Sicily is my favourite place.")._.trf_data.align[0]

Ragged(data=array([[0],
       [2],
       [3]], dtype=int32), lengths=array([3], dtype=int32), data_shape=(-1, 1), _cumsums=None)

This is even weirder, because the alignment contains the boundary token, but also skips the next one!

adrianeboyd Dec 15, 2020

Aligning using the plain text including special tokens is going to be a bit hacky, is more or less my helpful analysis. Because these are model dependent, it's potentially tricky to provide a general-purpose alignment algorithm, but these results are not good and I think we can update spacy-transformers to ignore any of the things in tokenizer.all_special_tokens or similar during alignment.

adrianeboyd Dec 15, 2020

However, this is very unfortunate:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
assert tokenizer.batch_encode_plus(["<s>"], add_special_tokens=True)["input_ids"] == [[0, 0, 2]]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment between spaCy tokens and Transformer outputs #6563

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Alignment between spaCy tokens and Transformer outputs #6563

thiippal Dec 14, 2020

Replies: 1 comment · 5 replies

adrianeboyd Dec 14, 2020

thiippal Dec 14, 2020 Author

adrianeboyd Dec 15, 2020

thiippal Dec 15, 2020 Author

adrianeboyd Dec 15, 2020

adrianeboyd Dec 15, 2020

thiippal
Dec 14, 2020

Replies: 1 comment 5 replies

adrianeboyd
Dec 14, 2020

thiippal Dec 14, 2020
Author

thiippal Dec 15, 2020
Author