Alignment between spaCy tokens and Transformer outputs #6563
Unanswered
thiippal
asked this question in
Help: Coding & Implementations
Replies: 1 comment 5 replies
-
No, the token |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey everyone,
if I've understood correctly, the Transformer-based models contain alignment information between spaCy tokens and Transformer outputs, which is available under
Doc._.trf_data.align
.If I use the Transformer-based model for English (
en_core_web_trf
) to process the sentence "They robbed a bank.", the tokens underDoc._.trf_data.tokens['input_texts']
are the following:[['<s>', 'They', 'Ġrobbed', 'Ġa', 'Ġbank', '.', '</s>']]
Based on the alignment information under
Doc._.trf_data.align
, I assume that the tokens<s>
and</s>
indicate the beginning and the end of the input sequence: if I use the index of the first spaCy token to retrieve the alignment data fromDoc._.trf_data.align[0].data
, this returns the following output:array([[1]], dtype=int32)
. In other words, the token<s>
is passed over.However, if I feed a sentence that contains a presumably out-of-vocabulary word such as "shibainu" to the model, this results in the following tokenization:
[['<s>', 'sh', 'ib', 'ain', 'u', 'Ġis', 'Ġa', 'Ġdog', '</s>']]
What confuses me is that when I access the alignment information under
Doc._.trf_data.align
for this sentence, the token<s>
is included in the alignment information for the spaCy Token "shibainu":What might explain this behaviour?
Beta Was this translation helpful? Give feedback.
All reactions