-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr] KeyError: None #618
Comments
This happens due to redefining the unk token which is hardcoded at one point. It's a pretty big bug, thanks for spotting it. I will issue a fix soon. |
When running on my own custom dataset using
Could somebody help me fix this issue? |
Your vocab is likely initialized without an UNK token. Can you paste the initialization of the Fields for the dataset? |
@mttk
It runs successfully with training dataset, however, with validating dataset / testing dataset, it shows error:
The testing code is below:
|
From what I see, you have token-wise labels (the POS tags). Since the LabelField assumes there is no tokenization (a single label), it treats this To fix this, for every output Field that contains sequential data (in this case, I assume that is the
|
@mttk
And using LabelField, it works! |
Great, I'm glad that I inadvertendly helped! |
Hi @mttk , Error:
I used the
Any ideas to fix this issue? |
Can you try setting Right now, you use the HF tokenizer to convert tokens to IDs, and torchtext Fields by default construct a vocabulary (and expect strings as keys). You don't need a vocab because you're already using the pretrained one from HF so you can just disable it in torchtext. |
@mttk |
Hi @mttk Any ideas? KeyError:
Preprocessing:
|
I think it's weird in the |
Will be retired soon. #985 |
🐛 Bug
Describe the bug
I came across this error when using data.Field. It only happen when I define my own unk_token and set min_freq >1 at the same time.
To Reproduce
the code I use:
`SRC = data.Field(lower=True, unk_token="my_unk_token")
TGT = data.Field(lower=True)
train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(SRC, TGT))
SRC.build_vocab(train, min_freq=10)
train_iter = data.BucketIterator(dataset=train, batch_size=64,
sort_key=lambda x: data.interleave_keys(len(x.src), len(x.trg)))
batch = next(iter(train_iter))`
The text was updated successfully, but these errors were encountered: