How might I use the tokenizers from the HuggingFace Transformers library #609

JohnGiorgi · 2019-10-02T12:27:39Z

❓ Questions and Help

Description

TL;DR: Has anyone been able to successfully integrate the transformers library tokenizer with torchtext?

I wanted to use the torchtext library to process/load data for use with the transformers library. I was able to set their tokenizer in a Field object, and build a vocabulary without issue

from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer

path = 'path/to/med_nli/'

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

TEXT = data.Field(use_vocab=True, tokenize=tokenizer.tokenize)
LABEL = data.LabelField()

fields = {'sentence1': ('premise', TEXT),
          'sentence2': ('hypothesis', TEXT),
          'gold_label': ('label', LABEL)}

train, valid, test = data.TabularDataset.splits(
    path=path, 
    train='mli_train_v1.jsonl',
    validation='mli_dev_v1.jsonl',
    test='mli_test_v1.jsonl',
    format='json', 
    fields=fields
)

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, valid, test), batch_sizes=(16, 256, 256)
)

TEXT.build_vocab(train)
LABEL.build_vocab(train)

Note, I am using the MedNLI dataset but it appears to be formatted according to the SNLI dataset.

But I am stuck on how to numericalize according to their tokenizers vocab. So I tried to numericalize in the field with their tokenizers encode method and set vocab=False.

from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer

path = 'path/to/med_nli/'

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode)
LABEL = data.LabelField()

fields = {'sentence1': ('premise', TEXT),
          'sentence2': ('hypothesis', TEXT),
          'gold_label': ('label', LABEL)}

train, valid, test = data.TabularDataset.splits(
    path=path, 
    train='mli_train_v1.jsonl',
    validation='mli_dev_v1.jsonl',
    test='mli_test_v1.jsonl',
    format='json', 
    fields=fields
)

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, valid, test), batch_sizes=(16, 256, 256)
)

# TEXT.build_vocab(train)
LABEL.build_vocab(train)

But then I get strange issues when trying to access the batch,

batch = next(iter(train_iter))
print("Numericalize premises:\n", batch.premise)
print("Numericalize hypotheses:\n", batch.hypothesis)
print("Entailment labels:\n", batch.label)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-9919119fad82> in <module>
----> 1 batch = next(iter(train_iter))

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/iterator.py in __iter__(self)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/batch.py in __init__(self, data, dataset, device)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in process(self, batch, device)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in numericalize(self, arr, device)

ValueError: too many dimensions 'str'

Any suggestions on how to go about this?

The text was updated successfully, but these errors were encountered:

zhangguanheng66 · 2019-10-02T15:29:55Z

To your questions, I think you need to check the dimension of the encode func.

If you set build_vocab to True, it will build a vocab based on torchtext.vocab. To use HuggingFace's vocab, you need to find their API for tokenization+numericalization.

I recently add a third-party tokenizer to torchtext (a.k.a sentencepiece), it is based on a different structure so you may also take a look (here).

JohnGiorgi · 2019-10-02T17:45:33Z

@zhangguanheng66 Hi, and thanks for the response.

I think you need to check the dimension of the encode func.

Sorry, I am not sure what "dimension" means in this context?

To use HuggingFace's vocab, you need to find their API for tokenization+numericalization.

My understanding is that encode is their API for tokenization+numericalization. E.g.

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
>>> tokenizer.encode("This is a test")
[1188, 1110, 170, 2774]

Is there something I am misunderstanding?

zhangguanheng66 · 2019-10-02T18:37:44Z

The dimension means the output from the encode func.

I see. So you don't need to build torchtext.vocab here. The API is actually a tokenizer plus numericalizer. You just need to convert it into torch.Tensor(tokenizer.encode("This is a test")). In the sentencepiece example, sentencepiece_numericalizer (a.k.a. EncodeAsIds) is equivalent to tokenizer.encode().

@mttk Under current torchtext structure, is there a way to combine the tokenization and numericalization step together?

mttk · 2019-10-03T16:34:47Z

@JohnGiorgi your code is exactly how this should be done. You don't use torchtext's vocab and instead provide your own tokenization. The error happens due to the padding step done while batching data, where the default padding token in data.Field is '<PAD>', which is a string (and since you're not using vocabs, there's no way to convert it to an index).

The solution is simply to fetch the pad index from the tokenizer and set that (int) value to the pad_token argument:

pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode, pad_token=pad_index)

This works for my case (I copied your code and used it with a different dataset).

JohnGiorgi · 2019-10-06T21:00:40Z

@mttk Thank you so much. This is exactly what I was looking for!

Final question, do I still need to call TEXT.build_vocab() in my case?

mttk · 2019-10-06T21:23:56Z

In this case, no. build vocab is relevant only when you use a vocab.

…

On Sun, Oct 6, 2019, 16:00 John Giorgi ***@***.***> wrote: @mttk <https://github.com/mttk> Thank you so much. This is exactly what I was looking for! Final question, do I still need to call TEXT.build_vocab() in my case? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#609?email_source=notifications&email_token=AAW6LSYJ6J5P6ARRWAH7BSLQNJGXXA5CNFSM4I4VVAC2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAOT6OA#issuecomment-538787640>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAW6LS5XN6OFJXNE5U2B3LDQNJGXXANCNFSM4I4VVACQ> .

rtolsma · 2020-03-02T11:19:26Z

I'm doing something of the form

from pytorch_transformers import GPT2Tokenizer
tokenizer =  GPT2Tokenizer.from_pretrained('gpt2')

and then going through the above suggestions to get

  pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
  unk_index = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
  TEXT = torchtext.data.Field(use_vocab=False, tokenize=tokenizer.encode, pad_token=pad_index, unk_token=unk_index)
    

    train, test, val = dataset.splits(TEXT)
    train_iter, test_iter, valid_iter = BPTTIterator.splits((train, test, val), batch_size=batch_size, bptt_len=bptt_len)

but I still get the error an integer is required (got type str) when trying next(iter(train_iter)).

Any idea on how to fix this??

mttk · 2020-03-02T12:01:38Z

Can you paste the full error trace?

rtolsma · 2020-03-03T02:38:14Z

Can you paste the full error trace?

I did some work on it this afternoon, and found that the data within train after calling the dataset.splits(TEXT) function contains the GPT2 tokenizer tokens, but also contains in the output the symbol '<eos>' which is why the error was showing for integer vs. str. I attempted using the same fix as before and adding

TEXT = torchtext.data.Field(use_vocab=False, 
                                             tokenize=tokenizer.encode,
                                             pad_token=pad_index,
                                             unk_token=unk_index,
                                             eos_token=eos_index)

but that's still not fixing the issue, and '<eos>' is still showing in the data after splitting

mttk · 2020-03-03T09:42:13Z

I understand the issue, but without the whole script or the python error trace, it's hard to pinpoint where the error occurred. If you could paste either, it would be of great help.

jindal2309 · 2020-03-15T05:31:54Z

I'm also getting the same error. I want to use GPT2Tokenizer in encoding the sentence with the torchtext.

from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode)
fields = [("src", TEXT), ("trg", TEXT)]

train_data, valid_data = data.TabularDataset.splits(
    path=data_dir,
    train=train_file,
    test=valid_file,
    format="CSV",
    fields=fields,
)
train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data),
    batch_size=batch_size,
    device = device)

next(iter(train_iterator))

Error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/iterator.py", line 156, in __iter__
    yield Batch(minibatch, self.dataset, self.device)
  File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/batch.py", line 34, in __init__
    setattr(self, name, field.process(batch, device=device))
  File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/field.py", line 237, in process
    tensor = self.numericalize(padded, device=device)
  File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/field.py", line 359, in numericalize
    var = torch.tensor(arr, dtype=self.dtype, device=device)
TypeError: an integer is required (got type str)

mttk · 2020-03-15T11:12:49Z

Could you check this reply and see if it works for you?
#609 (comment)

Mrxiexianzhao · 2020-04-26T07:29:56Z

May be I have a similar question: when I use the package 'torchtext' to process the data(ner data), the train example:TEXT: ['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间', '的', '海', '域', '。']；LABEL: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O']，when generate the batch data and input to model(Bi_LSTM+CRF)，it do not work，

ValueError:

too many dimensions 'str'

The problem arises when using：

train_iter, val_iter, test_iter, vocab_text, label_num = data_iter(train_path, valid_path, test_path, TEXT, LABEL)
for epoch, batch in enumerate(train_iter):

celsofranssa · 2020-05-07T01:03:40Z

@JohnGiorgi and @mttk,

May I state one or two questions?

The tokenizer's encode method accepts some parameters, how to pass these parameters in the TEXT Field?
Some tasks require that in addition to input_ids, also attention_mask should also be forward to the Transformer, in which case the tokenizer returns a dictionary:

{
    'input_ids': list[int] or tensor,
    'attention_mask': list[int] or tensor
}

Torchtext still apply in these cases?

elkotito · 2020-05-13T15:27:14Z

@ceceu

For example you can use partial functions:

from functools import partial
partial_encode = partial(tokenizer.encode, max_length=2)

I don't really know, but it should be more flexible for sure.

Btw, I don't find the given solutions very efficient. Combining numericalization and tokenization into one (tokenizer.encode) and passing it as an argument to a Dataset results in doing both for the whole input data! Numericalization should be done on the fly with DataLoader.

neerajsharma9195 · 2020-07-16T18:35:51Z

@jindal2309 Were you able to fix this issue? I am also facing the same error.

makarr · 2020-07-17T21:41:38Z

@neerajsharma9195 @jindal2309 @Mrxiexianzhao

The array getting passed to torch.tensor() has strings in it, instead of integers. A likely reason is that tokenizer.encode() is not getting called when the dataset is constructed. Another possibility is that tokenizer.encode() is failing on some inputs. The first thing I would do is look at every Example in each Dataset, before they are passed to the iterators.

@mateuszpieniak

A simple solution is to write a custom class that inherits from Field and overrides the numericalize() method. Something like this:

class HuggingFaceField(Field):
    def __init__(self, tokenizer):
        super().__init__(tokenize=tokenizer.tokenize)
        self.tokenizer = tokenizer

    def numericalize(self, arr):
        arr = [self.tokenizer.convert_tokens_to_ids(x) for x in arr]
        return torch.tensor(arr)

Although... I'm not sure why it's more efficient to numericalize on the fly. If you are going to tokenize the whole dataset from the start, why not numericalize it too?

makarr · 2020-07-20T00:28:51Z

@neerajsharma9195 @jindal2309 @Mrxiexianzhao

I was able to reproduce the error. tokenizer.encode() is not getting called from within Field.preprocess() if the input is already tokenized as a list of strings. The Field will only call the tokenize method passed in if the input is a string. There are 3 possible solutions:

Call ' '.join() on the list of tokens,
Pass tokenizer.encode() into the Field constructor as a preprocessing Pipeline, or
Inherit from the Field class and override the methods as you see fit.

neerajsharma9195 · 2020-07-22T17:14:02Z

If anyone also ran into similar issue with sentencepiece tokenizer, I got it working with in this way:

from torchtext.data.functional import load_sp_model
from torchtext.data import Field, BucketIterator, TabularDataset
from torchtext.vocab import Vectors
sp_model = load_sp_model("model_name.model")
sp_model.set_encode_extra_options('bos:eos')
pad_index = sp_model.piece_to_id("<pad>")
SRC = Field(use_vocab=False, tokenize=sp_model.encode, pad_token=pad_index)
TGT = Field(use_vocab=False, tokenize=sp_model.encode, pad_token=pad_index)
data_fields = [('src', SRC), ('tgt', TGT)]
train_ds,val_ds,test_ds = TabularDataset.splits(path='../dir_name/', train='train.csv', validation='val.csv', test='test.csv', format='csv', fields=data_fields)

With this now you can create bucketiterator like this:

BATCH_SIZE = 100

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter = BucketIterator(train_ds, batch_size=BATCH_SIZE, device=device, sort_key=lambda x: len(x))

val_iter = BucketIterator(val_ds, batch_size=BATCH_SIZE, device=device, sort_key=lambda x: len(x))

test_iter = BucketIterator(test_ds, batch_size=BATCH_SIZE, device=device, sort_key=lambda x: len(x))

zhangguanheng66 · 2020-07-22T17:46:50Z

Here is an example to use sentencepiece as building block in data processing pipeline. #887

makarr · 2020-07-22T19:25:10Z

@zhangguanheng66 I was using the Brown Corpus, which is tokenized already.

JohnGiorgi closed this as completed Oct 6, 2019

JohnGiorgi mentioned this issue Oct 15, 2019

How to use torchtext for sequence labelling with wordpiece tokeniers #619

Closed

zhangguanheng66 mentioned this issue Oct 27, 2019

How can I train HuggingFace TransfoXLLMHeadModel on a dataset that is different than just WikiText103? #630

Open

zhangguanheng66 added legacy new datasets and building blocks labels Jul 22, 2020

This was referenced Nov 16, 2020

Information regarding sentencepiece and the tokenizer+numericalizer. mmcux/de-nds-translation#3

Closed

Can't seem to get torchtext.experimental functions. #1079

Closed

MSiba mentioned this issue Feb 4, 2022

Specified Field dtype <torchtext.legacy.data.pipeline.Pipeline object at ...> can not be used with use_vocab=False because we do not know how to numericalize it. #1581

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How might I use the tokenizers from the HuggingFace Transformers library #609

How might I use the tokenizers from the HuggingFace Transformers library #609

JohnGiorgi commented Oct 2, 2019 •

edited

Loading

zhangguanheng66 commented Oct 2, 2019

JohnGiorgi commented Oct 2, 2019

zhangguanheng66 commented Oct 2, 2019 •

edited

Loading

mttk commented Oct 3, 2019 •

edited

Loading

JohnGiorgi commented Oct 6, 2019

mttk commented Oct 6, 2019 via email

rtolsma commented Mar 2, 2020 •

edited

Loading

mttk commented Mar 2, 2020

rtolsma commented Mar 3, 2020

mttk commented Mar 3, 2020

jindal2309 commented Mar 15, 2020 •

edited

Loading

mttk commented Mar 15, 2020

Mrxiexianzhao commented Apr 26, 2020

celsofranssa commented May 7, 2020

elkotito commented May 13, 2020

neerajsharma9195 commented Jul 16, 2020

makarr commented Jul 17, 2020 •

edited

Loading

makarr commented Jul 20, 2020

neerajsharma9195 commented Jul 22, 2020 •

edited

Loading

zhangguanheng66 commented Jul 22, 2020 •

edited

Loading

makarr commented Jul 22, 2020

How might I use the tokenizers from the HuggingFace Transformers library #609

How might I use the tokenizers from the HuggingFace Transformers library #609

Comments

JohnGiorgi commented Oct 2, 2019 • edited Loading

❓ Questions and Help

zhangguanheng66 commented Oct 2, 2019

JohnGiorgi commented Oct 2, 2019

zhangguanheng66 commented Oct 2, 2019 • edited Loading

mttk commented Oct 3, 2019 • edited Loading

JohnGiorgi commented Oct 6, 2019

mttk commented Oct 6, 2019 via email

rtolsma commented Mar 2, 2020 • edited Loading

mttk commented Mar 2, 2020

rtolsma commented Mar 3, 2020

mttk commented Mar 3, 2020

jindal2309 commented Mar 15, 2020 • edited Loading

mttk commented Mar 15, 2020

Mrxiexianzhao commented Apr 26, 2020

celsofranssa commented May 7, 2020

elkotito commented May 13, 2020

neerajsharma9195 commented Jul 16, 2020

makarr commented Jul 17, 2020 • edited Loading

makarr commented Jul 20, 2020

neerajsharma9195 commented Jul 22, 2020 • edited Loading

zhangguanheng66 commented Jul 22, 2020 • edited Loading

makarr commented Jul 22, 2020

JohnGiorgi commented Oct 2, 2019 •

edited

Loading

zhangguanheng66 commented Oct 2, 2019 •

edited

Loading

mttk commented Oct 3, 2019 •

edited

Loading

rtolsma commented Mar 2, 2020 •

edited

Loading

jindal2309 commented Mar 15, 2020 •

edited

Loading

makarr commented Jul 17, 2020 •

edited

Loading

neerajsharma9195 commented Jul 22, 2020 •

edited

Loading

zhangguanheng66 commented Jul 22, 2020 •

edited

Loading