-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use torchtext for sequence labelling with wordpiece tokeniers #619
Comments
@mttk any ideas? |
This is an oversight with LABEL = data.LabelField() with LABEL = data.Field(is_target=True, unk_token=None) You will get misaligned (sequence-length wise) batches but that's fine as you know how to align them. |
Hi John! Did you get this problem solved? |
@haorannlp Sorry! Not using torchtext lately and don't remember the details of this problem. |
|
@haorannlp Try AllenNLP! |
❓ Questions and Help
Description
Hi,
In a previous issue (#609), I asked how to use the tokenizer from the Transformers library with torch text.
I now would like to be able to use this tokenizer and torchtext to load sequence labelling datasets. The issue I am facing is that the tokenizer introduces wordpiece tokens, which ends up breaking the alignment between tokens and labels.
Ignoring labels, I am able to load a sequence labelling dataset with a Transformer tokenizer like so,
The data comes from here, and is a tab-seperated file with four columns. The first column contains words, the last labels and each sentence is sperated by a newline, e.g.
But when I try to load the labels, e.g.
I get issues when trying to access the batch
Which I am guessing arise because the number of items in the text and label fields are no longer the same.
Has anyone come across this issue and been able to solve it? I know how to write a function to re-align the labels with the wordpiece tokenized text. Where might I insert that function in the loading process?
The text was updated successfully, but these errors were encountered: