Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How might I use the tokenizers from the HuggingFace Transformers library #609

Closed
JohnGiorgi opened this issue Oct 2, 2019 · 21 comments
Closed

Comments

@JohnGiorgi
Copy link

JohnGiorgi commented Oct 2, 2019

❓ Questions and Help

Description

TL;DR: Has anyone been able to successfully integrate the transformers library tokenizer with torchtext?

I wanted to use the torchtext library to process/load data for use with the transformers library. I was able to set their tokenizer in a Field object, and build a vocabulary without issue

from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer

path = 'path/to/med_nli/'

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

TEXT = data.Field(use_vocab=True, tokenize=tokenizer.tokenize)
LABEL = data.LabelField()

fields = {'sentence1': ('premise', TEXT),
          'sentence2': ('hypothesis', TEXT),
          'gold_label': ('label', LABEL)}

train, valid, test = data.TabularDataset.splits(
    path=path, 
    train='mli_train_v1.jsonl',
    validation='mli_dev_v1.jsonl',
    test='mli_test_v1.jsonl',
    format='json', 
    fields=fields
)

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, valid, test), batch_sizes=(16, 256, 256)
)

TEXT.build_vocab(train)
LABEL.build_vocab(train)

Note, I am using the MedNLI dataset but it appears to be formatted according to the SNLI dataset.

But I am stuck on how to numericalize according to their tokenizers vocab. So I tried to numericalize in the field with their tokenizers encode method and set vocab=False.

from torchtext import data
from torchtext import datasets
from transformers import AutoTokenizer

path = 'path/to/med_nli/'

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode)
LABEL = data.LabelField()

fields = {'sentence1': ('premise', TEXT),
          'sentence2': ('hypothesis', TEXT),
          'gold_label': ('label', LABEL)}

train, valid, test = data.TabularDataset.splits(
    path=path, 
    train='mli_train_v1.jsonl',
    validation='mli_dev_v1.jsonl',
    test='mli_test_v1.jsonl',
    format='json', 
    fields=fields
)

train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, valid, test), batch_sizes=(16, 256, 256)
)

# TEXT.build_vocab(train)
LABEL.build_vocab(train)

But then I get strange issues when trying to access the batch,

batch = next(iter(train_iter))
print("Numericalize premises:\n", batch.premise)
print("Numericalize hypotheses:\n", batch.hypothesis)
print("Entailment labels:\n", batch.label)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-9919119fad82> in <module>
----> 1 batch = next(iter(train_iter))

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/iterator.py in __iter__(self)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/batch.py in __init__(self, data, dataset, device)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in process(self, batch, device)

~/miniconda3/envs/ml4h/lib/python3.7/site-packages/torchtext/data/field.py in numericalize(self, arr, device)

ValueError: too many dimensions 'str'

Any suggestions on how to go about this?

@zhangguanheng66
Copy link
Contributor

To your questions, I think you need to check the dimension of the encode func.

If you set build_vocab to True, it will build a vocab based on torchtext.vocab. To use HuggingFace's vocab, you need to find their API for tokenization+numericalization.

I recently add a third-party tokenizer to torchtext (a.k.a sentencepiece), it is based on a different structure so you may also take a look (here).

@JohnGiorgi
Copy link
Author

@zhangguanheng66 Hi, and thanks for the response.

I think you need to check the dimension of the encode func.

Sorry, I am not sure what "dimension" means in this context?

To use HuggingFace's vocab, you need to find their API for tokenization+numericalization.

My understanding is that encode is their API for tokenization+numericalization. E.g.

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
>>> tokenizer.encode("This is a test")
[1188, 1110, 170, 2774]

Is there something I am misunderstanding?

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Oct 2, 2019

The dimension means the output from the encode func.

I see. So you don't need to build torchtext.vocab here. The API is actually a tokenizer plus numericalizer. You just need to convert it into torch.Tensor(tokenizer.encode("This is a test")). In the sentencepiece example, sentencepiece_numericalizer (a.k.a. EncodeAsIds) is equivalent to tokenizer.encode().

@mttk Under current torchtext structure, is there a way to combine the tokenization and numericalization step together?

@mttk
Copy link
Contributor

mttk commented Oct 3, 2019

@JohnGiorgi your code is exactly how this should be done. You don't use torchtext's vocab and instead provide your own tokenization. The error happens due to the padding step done while batching data, where the default padding token in data.Field is '<PAD>', which is a string (and since you're not using vocabs, there's no way to convert it to an index).

The solution is simply to fetch the pad index from the tokenizer and set that (int) value to the pad_token argument:

pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode, pad_token=pad_index)

This works for my case (I copied your code and used it with a different dataset).

@JohnGiorgi
Copy link
Author

@mttk Thank you so much. This is exactly what I was looking for!

Final question, do I still need to call TEXT.build_vocab() in my case?

@mttk
Copy link
Contributor

mttk commented Oct 6, 2019 via email

@rtolsma
Copy link

rtolsma commented Mar 2, 2020

I'm doing something of the form

from pytorch_transformers import GPT2Tokenizer
tokenizer =  GPT2Tokenizer.from_pretrained('gpt2')

and then going through the above suggestions to get

  pad_index = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
  unk_index = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
  TEXT = torchtext.data.Field(use_vocab=False, tokenize=tokenizer.encode, pad_token=pad_index, unk_token=unk_index)
    

    train, test, val = dataset.splits(TEXT)
    train_iter, test_iter, valid_iter = BPTTIterator.splits((train, test, val), batch_size=batch_size, bptt_len=bptt_len)

but I still get the error an integer is required (got type str) when trying next(iter(train_iter)).

Any idea on how to fix this??

@mttk
Copy link
Contributor

mttk commented Mar 2, 2020

Can you paste the full error trace?

@rtolsma
Copy link

rtolsma commented Mar 3, 2020

Can you paste the full error trace?

I did some work on it this afternoon, and found that the data within train after calling the dataset.splits(TEXT) function contains the GPT2 tokenizer tokens, but also contains in the output the symbol '<eos>' which is why the error was showing for integer vs. str. I attempted using the same fix as before and adding

TEXT = torchtext.data.Field(use_vocab=False, 
                                             tokenize=tokenizer.encode,
                                             pad_token=pad_index,
                                             unk_token=unk_index,
                                             eos_token=eos_index)

but that's still not fixing the issue, and '<eos>' is still showing in the data after splitting

@mttk
Copy link
Contributor

mttk commented Mar 3, 2020

I understand the issue, but without the whole script or the python error trace, it's hard to pinpoint where the error occurred. If you could paste either, it would be of great help.

@jindal2309
Copy link

jindal2309 commented Mar 15, 2020

I'm also getting the same error. I want to use GPT2Tokenizer in encoding the sentence with the torchtext.

from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

TEXT = data.Field(use_vocab=False, tokenize=tokenizer.encode)
fields = [("src", TEXT), ("trg", TEXT)]

train_data, valid_data = data.TabularDataset.splits(
    path=data_dir,
    train=train_file,
    test=valid_file,
    format="CSV",
    fields=fields,
)
train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data),
    batch_size=batch_size,
    device = device)

next(iter(train_iterator))

Error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/iterator.py", line 156, in __iter__
    yield Batch(minibatch, self.dataset, self.device)
  File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/batch.py", line 34, in __init__
    setattr(self, name, field.process(batch, device=device))
  File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/field.py", line 237, in process
    tensor = self.numericalize(padded, device=device)
  File "~/miniconda3/envs/LSP/lib/python3.6/site-packages/torchtext/data/field.py", line 359, in numericalize
    var = torch.tensor(arr, dtype=self.dtype, device=device)
TypeError: an integer is required (got type str)

@mttk
Copy link
Contributor

mttk commented Mar 15, 2020

Could you check this reply and see if it works for you?
#609 (comment)

@Mrxiexianzhao
Copy link

May be I have a similar question: when I use the package 'torchtext' to process the data(ner data), the train example:TEXT: ['海', '钓', '比', '赛', '地', '点', '在', '厦', '门', '与', '金', '门', '之', '间', '的', '海', '域', '。'];LABEL: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O', 'O', 'O'],when generate the batch data and input to model(Bi_LSTM+CRF),it do not work,

ValueError:

too many dimensions 'str'

The problem arises when using:

train_iter, val_iter, test_iter, vocab_text, label_num = data_iter(train_path, valid_path, test_path, TEXT, LABEL)
for epoch, batch in enumerate(train_iter):

@celsofranssa
Copy link

@JohnGiorgi and @mttk,

May I state one or two questions?

  1. The tokenizer's encode method accepts some parameters, how to pass these parameters in the TEXT Field?
  2. Some tasks require that in addition to input_ids, also attention_mask should also be forward to the Transformer, in which case the tokenizer returns a dictionary:
{
    'input_ids': list[int] or tensor,
    'attention_mask': list[int] or tensor
}

Torchtext still apply in these cases?

@elkotito
Copy link

@ceceu

  1. For example you can use partial functions:
from functools import partial
partial_encode = partial(tokenizer.encode, max_length=2)
  1. I don't really know, but it should be more flexible for sure.

Btw, I don't find the given solutions very efficient. Combining numericalization and tokenization into one (tokenizer.encode) and passing it as an argument to a Dataset results in doing both for the whole input data! Numericalization should be done on the fly with DataLoader.

@neerajsharma9195
Copy link

@jindal2309 Were you able to fix this issue? I am also facing the same error.

@makarr
Copy link

makarr commented Jul 17, 2020

@neerajsharma9195 @jindal2309 @Mrxiexianzhao

The array getting passed to torch.tensor() has strings in it, instead of integers. A likely reason is that tokenizer.encode() is not getting called when the dataset is constructed. Another possibility is that tokenizer.encode() is failing on some inputs. The first thing I would do is look at every Example in each Dataset, before they are passed to the iterators.

@mateuszpieniak

A simple solution is to write a custom class that inherits from Field and overrides the numericalize() method. Something like this:

class HuggingFaceField(Field):
    def __init__(self, tokenizer):
        super().__init__(tokenize=tokenizer.tokenize)
        self.tokenizer = tokenizer

    def numericalize(self, arr):
        arr = [self.tokenizer.convert_tokens_to_ids(x) for x in arr]
        return torch.tensor(arr)

Although... I'm not sure why it's more efficient to numericalize on the fly. If you are going to tokenize the whole dataset from the start, why not numericalize it too?

@makarr
Copy link

makarr commented Jul 20, 2020

@neerajsharma9195 @jindal2309 @Mrxiexianzhao

I was able to reproduce the error. tokenizer.encode() is not getting called from within Field.preprocess() if the input is already tokenized as a list of strings. The Field will only call the tokenize method passed in if the input is a string. There are 3 possible solutions:

  1. Call ' '.join() on the list of tokens,
  2. Pass tokenizer.encode() into the Field constructor as a preprocessing Pipeline, or
  3. Inherit from the Field class and override the methods as you see fit.

@neerajsharma9195
Copy link

neerajsharma9195 commented Jul 22, 2020

If anyone also ran into similar issue with sentencepiece tokenizer, I got it working with in this way:

from torchtext.data.functional import load_sp_model
from torchtext.data import Field, BucketIterator, TabularDataset
from torchtext.vocab import Vectors
sp_model = load_sp_model("model_name.model")
sp_model.set_encode_extra_options('bos:eos')
pad_index = sp_model.piece_to_id("<pad>")
SRC = Field(use_vocab=False, tokenize=sp_model.encode, pad_token=pad_index)
TGT = Field(use_vocab=False, tokenize=sp_model.encode, pad_token=pad_index)
data_fields = [('src', SRC), ('tgt', TGT)]
train_ds,val_ds,test_ds = TabularDataset.splits(path='../dir_name/', train='train.csv', validation='val.csv', test='test.csv', format='csv', fields=data_fields)

With this now you can create bucketiterator like this:

BATCH_SIZE = 100

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter = BucketIterator(train_ds, batch_size=BATCH_SIZE, device=device, sort_key=lambda x: len(x))

val_iter = BucketIterator(val_ds, batch_size=BATCH_SIZE, device=device, sort_key=lambda x: len(x))

test_iter = BucketIterator(test_ds, batch_size=BATCH_SIZE, device=device, sort_key=lambda x: len(x))

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Jul 22, 2020

Here is an example to use sentencepiece as building block in data processing pipeline. #887

@makarr
Copy link

makarr commented Jul 22, 2020

@zhangguanheng66 I was using the Brown Corpus, which is tokenized already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants