-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How might I use the tokenizers from the HuggingFace Transformers library #609
Comments
To your questions, I think you need to check the dimension of the If you set I recently add a third-party tokenizer to torchtext (a.k.a sentencepiece), it is based on a different structure so you may also take a look (here). |
@zhangguanheng66 Hi, and thanks for the response.
Sorry, I am not sure what "dimension" means in this context?
My understanding is that >>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
>>> tokenizer.encode("This is a test")
[1188, 1110, 170, 2774] Is there something I am misunderstanding? |
The dimension means the output from the I see. So you don't need to build @mttk Under current torchtext structure, is there a way to combine the tokenization and numericalization step together? |
@JohnGiorgi your code is exactly how this should be done. You don't use torchtext's vocab and instead provide your own tokenization. The error happens due to the padding step done while batching data, where the default padding token in The solution is simply to fetch the pad index from the tokenizer and set that (int) value to the pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode, pad_token=pad_index) This works for my case (I copied your code and used it with a different dataset). |
@mttk Thank you so much. This is exactly what I was looking for! Final question, do I still need to call |
In this case, no. build vocab is relevant only when you use a vocab.
…On Sun, Oct 6, 2019, 16:00 John Giorgi ***@***.***> wrote:
@mttk <https://github.com/mttk> Thank you so much. This is exactly what I
was looking for!
Final question, do I still need to call TEXT.build_vocab() in my case?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#609?email_source=notifications&email_token=AAW6LSYJ6J5P6ARRWAH7BSLQNJGXXA5CNFSM4I4VVAC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOT6OA#issuecomment-538787640>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAW6LS5XN6OFJXNE5U2B3LDQNJGXXANCNFSM4I4VVACQ>
.
|
I'm doing something of the form from pytorch_transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2') and then going through the above suggestions to get pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
unk_index = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
TEXT = torchtext.data.Field(use_vocab=False, tokenize=tokenizer.encode, pad_token=pad_index, unk_token=unk_index)
train, test, val = dataset.splits(TEXT)
train_iter, test_iter, valid_iter = BPTTIterator.splits((train, test, val), batch_size=batch_size, bptt_len=bptt_len) but I still get the error Any idea on how to fix this?? |
Can you paste the full error trace? |
I did some work on it this afternoon, and found that the data within TEXT = torchtext.data.Field(use_vocab=False,
tokenize=tokenizer.encode,
pad_token=pad_index,
unk_token=unk_index,
eos_token=eos_index) but that's still not fixing the issue, and |
I understand the issue, but without the whole script or the python error trace, it's hard to pinpoint where the error occurred. If you could paste either, it would be of great help. |
I'm also getting the same error. I want to use GPT2Tokenizer in encoding the sentence with the torchtext. from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode)
fields = [("src", TEXT), ("trg", TEXT)]
train_data, valid_data = data.TabularDataset.splits(
path=data_dir,
train=train_file,
test=valid_file,
format="CSV",
fields=fields,
)
train_iterator, valid_iterator = data.BucketIterator.splits(
(train_data, valid_data),
batch_size=batch_size,
device = device)
next(iter(train_iterator)) Error. Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/iterator.py", line 156, in __iter__
yield Batch(minibatch, self.dataset, self.device)
File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/batch.py", line 34, in __init__
setattr(self, name, field.process(batch, device=device))
File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/field.py", line 237, in process
tensor = self.numericalize(padded, device=device)
File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/field.py", line 359, in numericalize
var = torch.tensor(arr, dtype=self.dtype, device=device)
TypeError: an integer is required (got type str) |
Could you check this reply and see if it works for you? |
May be I have a similar question: when I use the package 'torchtext' to process the data(ner data), the train example:TEXT: ['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间', '的', '海', '域', '。'];LABEL: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O'],when generate the batch data and input to model(Bi_LSTM+CRF),it do not work,
too many dimensions 'str'
|
@JohnGiorgi and @mttk, May I state one or two questions?
{
'input_ids': list[int] or tensor,
'attention_mask': list[int] or tensor
} Torchtext still apply in these cases? |
Btw, I don't find the given solutions very efficient. Combining numericalization and tokenization into one ( |
@jindal2309 Were you able to fix this issue? I am also facing the same error. |
@neerajsharma9195 @jindal2309 @Mrxiexianzhao The array getting passed to @mateuszpieniak A simple solution is to write a custom class that inherits from class HuggingFaceField(Field):
def __init__(self, tokenizer):
super().__init__(tokenize=tokenizer.tokenize)
self.tokenizer = tokenizer
def numericalize(self, arr):
arr = [self.tokenizer.convert_tokens_to_ids(x) for x in arr]
return torch.tensor(arr) Although... I'm not sure why it's more efficient to numericalize on the fly. If you are going to tokenize the whole dataset from the start, why not numericalize it too? |
@neerajsharma9195 @jindal2309 @Mrxiexianzhao I was able to reproduce the error.
|
If anyone also ran into similar issue with sentencepiece tokenizer, I got it working with in this way:
With this now you can create bucketiterator like this:
|
Here is an example to use sentencepiece as building block in data processing pipeline. #887 |
@zhangguanheng66 I was using the Brown Corpus, which is tokenized already. |
❓ Questions and Help
Description
TL;DR: Has anyone been able to successfully integrate the transformers library tokenizer with torchtext?
I wanted to use the torchtext library to process/load data for use with the transformers library. I was able to set their tokenizer in a Field object, and build a vocabulary without issue
But I am stuck on how to numericalize according to their tokenizers vocab. So I tried to numericalize in the field with their tokenizers
encode
method and setvocab=False
.But then I get strange issues when trying to access the batch,
Any suggestions on how to go about this?
The text was updated successfully, but these errors were encountered: