Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding LUKE to HuggingFace Transformers #38

Closed
uahmad235 opened this issue Dec 21, 2020 · 46 comments
Closed

Adding LUKE to HuggingFace Transformers #38

uahmad235 opened this issue Dec 21, 2020 · 46 comments

Comments

@uahmad235
Copy link

Hi,
Is there a possibility to reproduce results for NER on CPU instead of the default GPU configuration? I am unable to find any resource for this on the repo.

I am using the following command, but there seems to be no flag/argument available to switch between CPU and GPU?

python -m examples.cli --model-file=luke_large_500k.tar.gz --output-dir=<OUTPUT_DIR> ner run --data-dir=<DATA_DIR> --fp16 --train-batch-size=2 --gradient-accumulation-steps=2 --learning-rate=1e-5 --num-train-epochs=5

Thanks in advance!

@uahmad235 uahmad235 changed the title Reproduce Experimenta Results on CPU? Reproduce Experimental Results (NER) on CPU? Dec 22, 2020
@ikuyamada
Copy link
Member

Hi,
We are sorry but we conducted all experiments using GPU and haven't tested whether the code can be run on CPU.

@uahmad235
Copy link
Author

Thanks for the reply!
Does that mean that the inference is also not possible on the CPU?

@uahmad235
Copy link
Author

uahmad235 commented Dec 22, 2020

Alright.
I guess there should be a support for inference on the CPU at least because the model is huge and requires a big costly GPU. Hope we can get that in the future.
BTW, thanks for the response, much appreciated!

@ahmad-alismail
Copy link

Hi, could the model be trained on google Colab?
Thanks!

@ikuyamada
Copy link
Member

We haven't tested that, but the code should be run on Colab GPU.

@NielsRogge
Copy link

NielsRogge commented Jan 7, 2021

Hi @afi1289, I've created a Google Colab notebook in which I would like to fine-tune LUKE on the entity typing task as explained in the README. Anyone can run this notebook, but I've stored the checkpoint as well as the data in my personal Drive.

However, once I've installed all packages using poetry, the "transformers" package is not recognized as being installed. I guess this is because poetry does install the dependencies at a different location (pip would install them in <virtualenv_name>/lib/<python_ver>/site-packages, but not sure where poetry does install all packages. Doing pip list does not show any of the required packages):

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/luke/examples/cli.py", line 14, in <module>
    from luke.utils.model_utils import ModelArchive
  File "/content/luke/luke/__init__.py", line 1, in <module>
    from .model import LukeConfig, LukeModel
  File "/content/luke/luke/model.py", line 8, in <module>
    from transformers.modeling_bert import (
ModuleNotFoundError: No module named 'transformers'

@ikuyamada would be great if you can take a look

@ikuyamada
Copy link
Member

ikuyamada commented Jan 9, 2021

@NielsRogge Thank you for creating Colab notebook! I think you need to activate the virtualenv created by Poetry by running !poetry shell. Alternatively, you can generate requirements.txt file using poetry export (see here).

@ikuyamada ikuyamada reopened this Jan 9, 2021
@NielsRogge
Copy link

Thank you for the reply. When typing !poetry shell, I get

Spawning shell within /root/.cache/pypoetry/virtualenvs/luke-_5-nTlMy-py3.6
ate
(luke-_5-nTlMy-py3.6) /content/luke#

Then I'm getting a window in which I can type things. Not sure how poetry works, should I type in a password there or something else?. Could you please edit my notebook?

@ikuyamada
Copy link
Member

Thanks for the reply. It seems that the poetry shell command does not properly work on Colab. The easiest way might be to export requirements.txt using poetry export, and install packages without using Poetry, i.e., pip install -r requirements.txt.
poetry export: https://python-poetry.org/docs/cli/#export

@NielsRogge
Copy link

NielsRogge commented Jan 14, 2021

Ok this works, thanks! Getting the following output:

Results: {
  "best_epoch": 2,
  "dev_f1": 0.7838427947598253,
  "dev_f1_epoch0": 0.7705413575158011,
  "dev_f1_epoch1": 0.7727025557368136,
  "dev_f1_epoch2": 0.7838427947598253,
  "dev_precision": 0.8224513172966781,
  "dev_precision_epoch0": 0.8146426496223126,
  "dev_precision_epoch1": 0.8073863636363636,
  "dev_precision_epoch2": 0.8224513172966781,
  "dev_recall": 0.748696558915537,
  "dev_recall_epoch0": 0.7309697601668405,
  "dev_recall_epoch1": 0.7408759124087592,
  "dev_recall_epoch2": 0.748696558915537,
  "test_f1": 0.7700739928747602,
  "test_precision": 0.8051575931232091,
  "test_recall": 0.7379201680672269
} (run@main.py:118)

The reason I run it is because I'm interested in adding the model to the Transformers library by HuggingFace. However, the model seems to have a lot of dependencies, so I wonder whether this is possible.

For each model in the Transformers library, 3 files need to be defined:

  • configuration_luke.py
  • modeling_luke.py
  • tokenization_luke.py.

The big challenge with this model is probably the tokenizer, since it's different from the usual ones (RobertaTokenizer for example has a rather simple vocabulary of 50K tokens compared to LukeTokenizer which has 50K + 500K). Would it be possible to create a txt file with 500K lines?

Also, looking at a random example of the dev set of Open Entity, the tokenization happens as follows:

Sentence:
It is always held on the May 1 holiday and draws hundreds of thousands of spectators . "
Tokens:
['<s>', '[ENTITY]', 'ĠIt', '[ENTITY]', 'Ġis', 'Ġalways', 'Ġheld', 'Ġon', 'Ġthe', 'ĠMay', 'Ġ1', 'Ġholiday', 'Ġand', 'Ġdraws', 'Ġhundreds', 'Ġof', 'Ġthousands', 'Ġof', 'Ġspectators', 'Ġ.', 'Ġ"', '</s>']

However, this is only for the word sequence. The entity sequence is then set to [1, 0]. To what do 1 and 0 correspond, [MASK] and [PAD] respectively? I also wonder why the [ENTITY] token is not mentioned in the paper.

@ikuyamada
Copy link
Member

ikuyamada commented Jan 15, 2021

Hi,
I am also interested in adding LUKE to transformers library 😃
Although LUKE has dependencies to some libraries, our model should be implemented without adding extra dependencies to transformers.
As LUKE takes entity input sequence in addition to word sequence, implementing LukeTokenizer may be challenging.
The 500k entity vocabulary is contained in the pretrained model file:

$ tar xvzf luke_large_500k.tar.gz
$ head entity_vocab.tsv
[PAD]   0
[UNK]   0
[MASK]  0
[MASK2] 0
Race and ethnicity in the United States Census  284050
United States   261500
Association football    197415
World War II    165432
France  135415
Germany 114339

We do not use the [MASK2] token in our current model.

The entity sequence [1, 0] corresponds to [MASK] and [PAD] entities, respectively. Please note that, we use only [MASK] and [PAD] pretrained entity embeddings for addressing fine-tuning tasks except for SQuAD.
Also, following past work, we also add the [ENTITY] word token to the word sequence, which slightly stabilizes the training.

@NielsRogge
Copy link

NielsRogge commented Jan 15, 2021

Ok, maybe we can first clear out some things about LUKE and then work together on adding this to the library. To get a good understanding about about how LukeTokenizer should work, I think it is really helpful to just take an example sentence and then define how it should be prepared for LUKE. Let's take the sentence from the paper, "Beyonce lives in Los Angeles".

Pre-training

During pre-training, sentences come from Wikipedia so we know which entities are in the sentence (in this case, "Beyonce" and "Los Angeles"). We randomly mask both tokens and entire entities which the model needs to predict. This happens as follows (let's tokenize the sentence first):

from transformers import RobertaTokenizer

sentence = "Beyonce lives in Los Angeles"

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
input_ids = tokenizer(sentence)["input_ids"]

for id in input_ids:
  print(id, tokenizer.decode([id]))

0 <s>
40401 Bey
25252 once
1074  lives
11  in
1287  Los
1422  Angeles
2 </s>

So for example we replace the "Bey" and "Los" tokens by [MASK], as well as the "Los Angeles" entity. So this sentence is represented as follows:

  • token sequence:
    <s>, [MASK], once, lives, in, [MASK], Angeles, </s>
  • entity sequence:
    Beyonce, [MASK]

Fine-tuning

There are different fine-tuning tasks, so we go over each of them to show how sentences are prepared for the model.

Entity typing (e.g. Open Entity)

In case of entity typing, the task is to predict the label for a given entity in a sequence. The entity for which we need to predict the label is surrounded by [ENTITY] tokens.

So, given for example that we need to predict the label of the entity "Beyonce", then this sentence is prepared as follows:

  • token sequence:
    <s>, [ENTITY], Bey, once, [ENTITY], lives, in, Los, Angeles, </s>
  • entity sequence:
    [MASK], [PAD]

As one can see, the entity for which to predict the label for is surrounded by the [ENTITY] token. This special token is learned only during fine-tuning (but its embedding is initialized using the "@" word token as can be seen here). The entity sequence consists of [MASK] and [PAD]. Here, we start from their pre-trained representations.

The label is predicted by placing a linear classifier on top of the final hidden representation of the [MASK] token.

-> input to LukeTokenizer: single sequence + information on which entity to predict the label for

Relation classification (e.g. TACRED)

In case of relation classification, the task is given 2 entities in the sentence (one head, one tail), the model needs to predict the relation between them. Suppose that "Beyonce" is the head entity and "Los Angeles" the tail entity. Then the data is prepared as follows:

  • token sequence:
    <s>, [HEAD], Bey, once, [HEAD], lives, in, [TAIL], Los, Angeles, [TAIL], </s>
  • entity sequence:
    [HEAD], [TAIL]

=> as one can see, the "head" entity is surrounded by a special [HEAD] token and the tail one by a special [TAIL] token. In the code is specified that: entity_ids = [1, 2], which refer to [HEAD] and [TAIL] respectively. Their embeddings are initialized using the [MASK] pre-trained embedding.

-> input to LukeTokenizer: single sequence + information on which entity is the head and which one is the tail

Extractive question-answering (e.g. SQuAD)

Suppose we have the question "Where does Beyonce live?", and we consider the example sentence to be the passage. Here the 500K vocabulary of entities is used, as the passage is from Wikipedia. Then this is represented as follows:

  • token sequence:
    <s>, Where, does, Beyon, ce, live, ?, </s>, Bey, once, lives, in, Los, Angeles, </s>
  • entity sequence (how entities are retrieved from the input sentence: see appendix C in the paper):
    "Beyonce, Beyonce, Los Angeles"

=> here, no [MASK] entity is used (only entities mentioned in the question + passage are included). Because the task is to predict the start and end positions of the answer, we place two linear layers, corresponding to start and end positions, on top of the output word representations to predict the answer span. This is the same architecture as the SQuAD model of BERT except for the presence of entity inputs.

-> input to LukeTokenizer: pair of sequences (question + passage)

Cloze-style question-answering (e.g. ReCoRD)

Suppose we consider the example sentence to be the passage, and the query to be "X lives in LA". Then this is represented as follows:

  • token sequence:
    <s>, X, lives, in, LA, </s>, </s>, Bey, once, lives, in, Los, Angeles, </s>
  • entity sequence:
    [MASK] [MASK] [MASK]

In the paper is said "we input [MASK] entities corresponding to the missing entity and all entities in the passage". But how do you know all entities in the passage? Nevermind, I read in the paper that entity spans are provided by the ReCoRD dataset. So in this case, there's one entity missing and there are 2 entities mentioned in the passage, so 3 [MASK] tokens in the entity sequence.

-> input to LukeTokenizer: pair of sequences (question + passage)

Named-entity recognition (e.g. CoNLL 2003)

In case of NER, since the entity spans are not provided, we need to enumerate all possible n-grams in the sentence. Therefore, the entity sequence consists of very long [MASK] entities each of which position_ids are corresponding n-gram (e.g., the first [MASK] entity corresponds to "Bey", the second one corresponds to "Bey, once", the third one corresponds to "Bey, once, lives",...

So in case of our example sentence:

  • token sequence:
    <s>, Bey, once, lives, in, Los, Angeles, </s>
  • entity sequence:
    [MASK], [MASK], [MASK], (...), [MASK]

-> input to LukeTokenizer: single sequence

=> Looking at the different ways data is prepared for the model, LukeTokenizer should probably receive a parameter indicating for which task it needs to prepare the data, as well as additional information about the entities (such as the head and tail in case of relation classification, or the entity in case of entity typing).

The output of LukeTokenizer is always the same:

  • input_ids, token_type_ids, attention_mask (token sequence) and entity_ids, entity_token_type_ids, entity_position_ids and entity_attention_mask (entity sequence)

(I'm updating this comment once I receive more information and read more in the paper)

@ikuyamada
Copy link
Member

ikuyamada commented Jan 15, 2021

Hi,
Thank you so much for your detailed and accurate understanding of LUKE!

Entity typing (e.g. Open Entity)

Here, the [ENTITY] token is learned only during fine-tuning, but for [MASK] and [PAD] we start from the pre-trained representations. The label is predicted by placing a linear classifier on top of the final hidden representation of the [MASK] token. Correct?

This is mostly correct, but the [ENTITY] token is initialized using the "@" word token.
https://github.com/studio-ousia/luke/blob/master/examples/entity_typing/main.py#L48
Also, we initialize the special word tokens in other tasks (e.g., [HEAD] and [TAIL]) using the same technique to improve the stability of training. However, this generally makes very minor effects to the performance.

Relation classification

here's it not clear to me which tokens are present in the entity sequence. I see in the code that: entity_ids = [1, 2] but not sure what 1 and 2 do refer to.

The entity sequence consists of [HEAD] (ID:1) and [TAIL] (ID:2) entities. The embeddings of these entities are initialized using the embedding of the [MASK] entity. The entity embeddings consisting of three rows ([PAD], [HEAD], and [TAIL]) are created here:
https://github.com/studio-ousia/luke/blob/master/examples/relation_classification/main.py#L55

Extractive question-answering (e.g. SQuAD)

=> correct? Then the relevance score (logit) for each entity is computed using a linear layer on top of the concatenation of the final hidden representation of the [MASK] token together with that of the entity.

In the SQuAD task, we do not use the [MASK] entity. Because the task is to predict the start and end positions of the answer, we place two linear layers, corresponding to start and end positions, on top of the output word representations to predict the answer span. This is the same architecture as the SQuAD model of BERT except for the presence of entity inputs.

Cloze-style question-answering (e.g. ReCoRD)

entity sequence:
[MASK], Beyonce, Los Angeles

We do not input entities except for the [MASK] entity in this task. Here, the entity sequence should be "[MASK] [MASK] [MASK]", where the first one corresponds to the missing entity.

Named-entity recognition (e.g. CoNLL 2003)

entity sequence:
[MASK], Beyonce, Los, Los Angeles

In the NER task, since the entity spans are not provided, we need to enumerate all possible n-grams in the sentence. Therefore, the entity sequence consists of very long [MASK] entities each of which position_ids are corresponding n-gram (e.g., the first [MASK] entity corresponds to "Bey", the second one corresponds to "Bey, once", the third one corresponds to "Bey, once, lives",...).

=> Looking at the different ways data is prepared for the model, LukeTokenizer should probably receive a parameter indicating for which task it needs to prepare the data, as well as additional information about the entities (such as the head and tail in case of relation classification, or the entity in case of entity typing).

I am not familiar with the implementation of the recent transformers models though, I think it is a good idea to add a parameter indicating the task type to the tokenizer because it significantly reduces the effort to implement the fine-tuning tasks.

@NielsRogge
Copy link

NielsRogge commented Jan 18, 2021

Ok thank you for the reply! I've updated my previous comment based on your answers (as I'll use that comment as a sort of standard which defines how LUKE works). Could you verify whether the token sequence and entity sequence for each of the downstream tasks is now defined correctly? I also updated each downstream task with what data should be provided to LukeTokenizer (i.e. a single sequence/a pair of sequences, and whether any additional information should be provided). It looks like only in case of entity typing and relation classification, additional info should be provided to LukeTokenizer.

I've listed further questions here:

  • In the paper, a lot of times "a sequence of words" is mentioned (besides the entity sequence), but I guess this is a sequence of tokens in reality?
  • in case of relation classification, does that mean that we have a [HEAD] and [TAIL] token both for the vocab of the regular tokens AND the vocab of the entities?
  • Do you have during pre-training a [MASK] token both for the vocab of the regular tokens and the vocab of the entities, which are learned independently? And only the latter is used for downstream tasks? Ok nvm, I see this is mentioned in the paper.
  • What do the numbers in the second column of the entity vocab mean?
[PAD]   0
[UNK]   0
[MASK]  0
[MASK2] 0
Race and ethnicity in the United States Census  284050
United States   261500
Association football    197415
World War II    165432
France  135415
Germany 114339

Once this is all clarified, I start working on a pull request. I wonder whether you are interested in working on this together?

Kind regards,

Niels

@ikuyamada
Copy link
Member

@NielsRogge Thanks a lot for your reply!

Could you verify whether the token sequence and entity sequence for each of the downstream tasks is now defined correctly?

Regarding the SQuAD task, the entity sequence should be "Beyonce, Beyonce, Los Angeles" not "Beyonce, Los Angeles" since the input sentence contains "Beyonce" twice.

In the paper, a lot of times "a sequence of words" is mentioned (besides the entity sequence), but I guess this is a sequence of tokens in reality?

Yes, although we use "words" for brevity, LUKE takes a sequence of tokens (or subwords).

in case of relation classification, does that mean that we have a [HEAD] and [TAIL] token both for the vocab of the regular tokens AND the vocab of the entities?

Yes, we need [HEAD] and [TAIL] in both the regular subword vocabulary and the entity vocabulary.

What do the numbers in the second column of the entity vocab mean?

Each number represents the number of Wikipedia hyperlinks that point to the corresponding entity. For example, Wikipedia contained 261,500 hyperlinks that point to the entity United States.

I am interested in adding LUKE to transformers, so please feel free to ask questions anytime.
I do not currently have time to work on this closely this month, but I think I can make time to work on this next month!

@NielsRogge
Copy link

NielsRogge commented Jan 19, 2021

Thank you, your replies help me a lot in understanding the model! I've started defining modeling_luke.py and configuration_luke.py here. I've already added some documentation (the Transformers library then auto-generates an HTML web page based on a .rst file as well as the docstrings - you can see it by running make html from the docs directory). It then looks like this:

luke_model

I start with the base model, and once I know that it outputs the same word_hidden_state and entity_hidden_state as the original implementation on the same input data, I can add the models with classification heads on top. Note that in the Transformers library, some things need to be renamed to make it consistent with the API, for example I renamed word_ids to input_ids.

Inputs of LukeModel

One thing that might be challenging is making sure inputs to LUKE can be padded/truncated similar to other models in the Transformers library. It seems like LUKE is different than other models, in the sense that it uses different seq lengths for different batches, rather than padding them all up to 512 tokens.

LUKE takes in 2 inputs: word_ids and entity_ids (and corresponding token_type_ids, position_ids and attention_mask). However as I see in the code, these are not actually padded up to a sequence length of 512 tokens but rather up to the max length in a batch, because the inputs are always rather short (for example in case of entity typing, the word_ids of the first batch are of shape (2, 33) and the entity_ids of shape (2,2)).

However, we should make it compatible with the API of Transformers. It looks like we will have two extra parameters to Luketokenizer, namely max_entity_length and max_mention_length. The API of LukeTokenizer could then look as follows:

from transformers import LukeTokenizer

tokenizer = LukeTokenizer.from_pretrained("luke-large", max_entity_length=50)
text = "Beyonce lives in Los Angeles"
encoding = tokenizer(text, padding='max_length', truncation=True, return_tensors='pt')

print(encoding["input_ids"].shape) # prints (1, 462)
print(encoding["entity_ids"].shape) # prints (1, 50)

Outputs of LukeModel and LukeEntityAwareAttentionModel

LukeModel outputs the same as BertModel, namely:

  • last_hidden_state of shape (batch_size, seq_len, hidden_size). This can be split up into word_hidden_states and entity_hidden_states, based on input_ids.shape[1].
  • pooler_output of shape (batch_size, hidden_size).

LukeEntityAwareAttentionModel outputs 2 things:

  • word_hidden_states
  • entity_hidden_states

=> what is the easiest way to perform a forward pass on these models on some input data? This is very important for the integration tests.

@ikuyamada
Copy link
Member

Hi @NielsRogge,
Thank you for your excellent work! The documentation page looks awesome!👏

However, we should make it compatible with the API of Transformers. It looks like we will have two extra parameters to Luketokenizer, namely max_entity_length and max_mention_length.

We do not add padding to the max length for computational efficiency. Therefore, adding extra padding should not affect the performance.

=> what is the easiest way to perform a forward pass on these models on some input data? This is very important for the integration tests.

I am not sure I correctly understand what you mean here, but the data to be used to run the forward pass can be created using textual inputs with entity annotations, e.g., entity typing dataset.

@NielsRogge
Copy link

NielsRogge commented Jan 19, 2021

Sorry I could have been more clear, I mean that I want to compare the outputs of my implementation with those of the original implementation on the same input data (to make sure my implementation is OK). So I wonder what the easiest way is to perform a forward pass on some dummy data with LukeModel and LukeEntityAwareAttentionModel in a single Python script.

@ikuyamada
Copy link
Member

ikuyamada commented Jan 19, 2021

Thanks for your reply.

LukeModel and LukeEntityAwareAttentionModel are transformers-based models with the forward function. So the model can be called directly using some input data.

Also, LUKE provides ModelArchive class that loads the pretrained model data (see here for its example usage). You can extract the pretrained model weights by ModelArchive.load("luke_large_500k.tar.gz"). The returned object has state_dict property which can be used to load the pretrained weights to the model instance (LukeModel or LukeEntityAwareAttentionModel) using the torch's load_state_dict function.

@Raabia-Asif
Copy link
Contributor

Finally I am able to reproduce the results using google colab notebook by @NielsRogge Thankyou
But one problem that I am facing, when I try to fine tune, the execution somehow stops after epoch 1, and hence pytorch_model.bin file is not created in the output folder, probably google colab disconnects during execution.
But I have also tried methods for stopping colab from disconnecting, like running following code in colab console:
function ClickConnect(){
console.log("Working");
document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(ClickConnect,60000)
No luck so far.
Any suggestions from anyone?

@ikuyamada
Copy link
Member

@Raabia-Asif I am sorry, but I am not familiar with Google Colab.

@NielsRogge How are things going with this? I almost completed my current project, so I can work together closely to integrate LUKE into transformers.

@NielsRogge
Copy link

Hey @ikuyamada, I'm currently working on another algorithm to be added to the Transformers library, but after that I can work on adding LUKE. My current implementation is here.

The main thing to do is to make LUKE compatible with the existing API of the Transformers library.

@ikuyamada
Copy link
Member

@NielsRogge Did you face any major issues to integrate LUKE to transformers? I am not familiar with the integration process of new models into transformers, but if you are still interested in this, we can work together to do this. I will review the code to understand the progress. Also, I will plan to release the base model of LUKE soon.

@NielsRogge
Copy link

NielsRogge commented Feb 2, 2021

@ikuyamada so as already said above, 3 files are required for each model in the Transformers library:

  • configuration_luke.py
  • tokenization_luke.py
  • modeling_luke.py

(and a conversion script that loads in the weights from the original model to the modeling_luke.py file, see convert_luke_original_pytorch_checkpoint_to_pytorch.py which I already defined (but to be updated) here.

The configuration and modeling files are mainly copies from model.py, so this is not a problem. However, for the tokenizer, the API should look something like this:

from transformers import LukeTokenizer

tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-large")
text = "This is a sentence"
encoding = tokenizer(text, return_tensors="pt", padding="max_length")

The encoding should then be a dictionary (in the Transformers library this is actually a BatchEncoding object but it behaves like a dict with some additional functionalities) with the keys input_ids, attention_mask, and token_type_ids (the latter are called segment_ids in your code I see). However, as LUKE also uses entity_ids, entity_attention_mask and entity_token_type_ids, these should be added to the encoding. And as these are created very differently depending on the downstream task, I suggest to add a 'task' parameter to the __call__ method of LukeTokenizer, and a 'entities' parameter for downstream tasks that require additional information about the sentence. When people then want to prepare a sentence for relation classification for example, it could look like this:

text = "Beyonce lives in Los Angeles"
encoding = tokenizer(text, task='relation_classification', entities=[['Beyonce', 0:7], ['Los Angeles', 17:28]], return_tensors='pt', padding='max_length')

In this case, encoding['entity_ids'] should then be equal to torch.tensor([[1, 2]]). The __call__ method automatically adds a batch dimension if there's only a single sentence provided. Note that each tokenizer in the Transformers library inherits from the base PretrainedTokenizer class, which defines some common functionalities such as padding and truncation, so it's actually overridding the __call__ method.

So that effectively means that we should mainly translate the data preparation (utils.py) files of the examples directory of this repository all into a single __call__ method of LukeTokenizer. For some tasks (such as entity typing), this seems rather straightforward, for others (such as SQuAD) more difficult.

@ikuyamada
Copy link
Member

@NielsRogge Thank you very much for the detailed explanation! I think it is a very good idea to add "task" parameter to the tokenizer. I also think that one possible direction might be supporting only common NLP tasks in the tokenizer such as entity typing (or single entity prediction), relation classification (or entity pair prediction), and named entity recognition.

As you may already notice, if we support SQuAD in the tokenizer, we need to perform the heuristic entity linking within the tokenizer, which may be too specific to this task. I think one idea is that we treat SQuAD differently from other tasks, and add its implementation as a research project. If we do not support SQuAD in the tokenizer, we need to deal only with [PAD] and [MASK] entities.

Also, if the tokenizer directly outputs the input data without modifying the model, we need to modify the model before uploading it to the cloud storage of transformers.
I think we need to change the word vocabulary and pretrained weights by adding some special word tokens and their embeddings used in the downstream tasks, i.e., [ENTITY], [HEAD], and [TAIL] word tokens.
in the relation classification task, we use two different [MASK] entities, i.e, [HEAD] and [TAIL] entities. Fortunately, our entity vocabulary includes unused [MASK2] entity, we can initialize the embedding of the [MASK2] entity using the embedding of the [MASK] entity, and treat the [MASK] and [MASK2] entities as the [HEAD] and [TAIL] entities, respectively, in the relation classification task.

If we implement based on the methods above, I think the implementation can be done in a straightforward way.
I would appreciate hearing your thoughts.

@NielsRogge
Copy link

I think one idea is that we treat SQuAD differently from other tasks, and add its implementation as a research project. If we do not support SQuAD in the tokenizer, we need to deal only with [PAD] and [MASK] entities.

Very good idea! This would indeed make the implementation of the tokenizer a lot easier.

Also, if the tokenizer directly outputs the input data without modifying the model, we need to modify the model before uploading it to the cloud storage of transformers.
I think we need to change the word vocabulary and pretrained weights by adding some special word tokens and their embeddings used in the downstream tasks, i.e., [ENTITY], [HEAD], and [TAIL] word tokens.

So you mean these should be added to the embedding layer of the model? Will the tokenizer than just have a single vocab file or will it have two?

@ikuyamada
Copy link
Member

Thanks for your reply!

So you mean these should be added to the embedding layer of the model? Will the tokenizer than just have a single vocab file or will it have two?

Yes, so the word embedding layer of LUKE will have the embeddings of the added special tokens. We need to modify the word vocabulary file by adding these special tokens. The tokenizer file will have a single modified word vocabulary file and a single entity vocabulary file.

Then the tokenizer can directly output the input sequences, e.g., <s>, [HEAD], Bey, once, [HEAD], lives, in, [TAIL], Los, Angeles, [TAIL], </s> and [MASK1], [MASK2], since the vocabulary already contain these special tokens of words and those of entities.

@NielsRogge
Copy link

Ok :) well that will be the first tokenizer in the library which has 2 vocabulary files. But what I do find a bit weird is, the entity vocab consists of 500K entities, but these are not used in any downstream task except for SQuAD, right?

If I understand correctly, the following special tokens will have to be added to the word vocabulary:

  • [ENTITY] (for entity typing)
  • [HEAD] (for relation classification)
  • [TAIL] (for relation classification)

=> what is their initialized embedding vector?

And for the entity vocabulary, we have the following special tokens:

  • [MASK]
  • [MASK2],

both of which are initialized with the pre-trained embedding of [MASK].

Also, in modeling_luke.py we should add the head models, namely:

  • LukeForEntityTyping
  • LukeForRelationClassification
  • LukeForClozeStyleQuestionAnswering
  • LukeForNamedEntityRecognition

=> however, I'm not sure whether LukeForNamedEntityRecognition will be allowed by the maintainers of the library, since all existing models for NER are simply called xxxForTokenClassification (since these simply add a token classification head on top of the base model).

I can maybe already start today or tomorrow, I'll update you!

@ikuyamada
Copy link
Member

ikuyamada commented Feb 4, 2021

But what I do find a bit weird is, the entity vocab consists of 500K entities, but these are not used in any downstream task except for SQuAD, right?

Yes, the entities other than [MASK] and [PAD] are used only in the pretraining and SQuAD tasks. If we do not support SQuAD, these entity vocab entries and corresponding embeddings can be removed from the pretrained model, which significantly reduces the model size.
However, similar to the method adopted in the SQuAD task, enriching word sequence using entities (based on heuristic entity linking or off-the-shelf entity linking system) might be beneficial for applications requiring real-world knowledge.
One idea is that we upload two kinds of LUKE model to transformers servers such as LUKE with Wikipedia entities, and LUKE without Wikipedia entities.
I would like to hear your thoughts about this.

If I understand correctly, the following special tokens will have to be added to the word vocabulary:
[ENTITY] (for entity typing)
[HEAD] (for relation classification)
[TAIL] (for relation classification)
=> what is their initialized embedding vector?

I think it is better to add general [ENT] and [ENT2] special tokens to the word vocab, and use [ENT] for [ENTITY] and [HEAD], and [ENT2] for [TAIL]. In this case, the [ENT] and [ENT2] embeddings should be initialized using the embeddings of "@" and "#", respectively.

LukeForEntityTyping
LukeForRelationClassification
LukeForClozeStyleQuestionAnswering
LukeForNamedEntityRecognition

I think "LukeForClozeStyleQuestionAnswering" is quite specific to the ReCoRD task, I think it may be reasonable to exclude it.

=> however, I'm not sure whether LukeForNamedEntityRecognition will be allowed by the maintainers of the library, since all existing models for NER are simply called xxxForTokenClassification (since these simply add a token classification head on top of the base model).

Yes, since LUKE is based on the span-based NER approach (Sohrab and Miwa, 2018), the architecture of LUKE is different from other token-based NER models. I think names like LUKEForSpanBasedNER can be an alternative.

I can maybe already start today or tomorrow, I'll update you!

Thanks 😍
Looking forward to hearing your updates!

@ikuyamada
Copy link
Member

@NielsRogge I have time to work on the task this week. So, if you agree to the methods described above, I can create the modified pretrained LUKE model and vocab files with a new conversion script. I think the implementation of the tokenizer and head models is relatively straightforward for the modified LUKE model.

@NielsRogge
Copy link

Hi,

Yes that's fine for me. I haven't got the time yet to work on it, since I also have a full time job. I hope I can make some time available this week.

Kind regards,

Niels

@ikuyamada
Copy link
Member

Hi,
Thank you for your prompt reply! It would be appreciated if you work on this when you have enough spare time.
If you are okay with it, I am planning to modify your convert_luke_original_pytorch_checkpoint_to_pytorch.py to implement what we discussed above.

Best regards,
Ikuya

@ikuyamada
Copy link
Member

I have just created a PR that updates the conversion script to implement what we have discussed here. I also added some TODOs and NOTEs that can potentially improve the implementation.
I would appreciate if you review the PR when you have time :)

Thanks,
Ikuya

@NielsRogge
Copy link

NielsRogge commented Feb 11, 2021

Hi @ikuyamada,

thank you for the PR. I'm currently setting up a simple example from the Open Entity dev set, on which I want to test both the forward pass of the original implementation as well as our implementation in modeling_luke.py. We can use this for the "integration tests" in test_modeling_luke.py, which only pass if both implementations return the same output tensors on the same input data.

However, when setting up a simple example, I've got 2 questions:

  • when using RobertaTokenizer, how does it know the index for the [ENTITY] token when calling convert_tokens_to_ids? What is the index for this? When I now create the word_ids of a sentence using this method, and decode them back, it prints <unk> instead of [ENTITY] because this special token is not part of the vocab of RobertaTokenizer.
  • What is max_mention_length set to?

Colab to reproduce: https://colab.research.google.com/drive/1-vetwRrfox77w4kTbi8dQSlad-c5EFns?usp=sharing

@ikuyamada
Copy link
Member

ikuyamada commented Feb 11, 2021

Hi @NielsRogge,

Thank you for working on creating integration tests! Please let me know if there is anything I can do.

when using RobertaTokenizer, how does it know the index for the [ENTITY] token when calling convert_tokens_to_ids?

In the new conversion script, the tokenizer is also serialized into pytorch_dump_folder_path. Therefore, the tokenizer should be able to be reconstructed by RobertaTokenizer.from_pretrained(pytorch_dump_folder_path).

>>> tokenizer = RobertaTokenizer.from_pretrained('out')
>>> tokenizer.tokenize('[ENT] Beyonce [ENT] lives in Los Angeles')
 ['[ENT]', 'Bey', 'once', '[ENT]', 'l', 'ives', 'Ġin', 'ĠLos', 'ĠAngeles']
>>> tokenizer.encode('[ENT] Beyonce [ENT] lives in Los Angeles')
[0, 50265, 40401, 25252, 50265, 462, 3699, 11, 1287, 1422, 2]

If I understand correctly, it is possible to upload these serialized tokenizer files to the cloud storage of Transformers.

What is max_mention_length set to?

max_mention_length is also a hyper-parameter defined in the metadata file (metadata.json). It represents the maximum number of tokens inside an entity span and is set to 30 by default.

@ikuyamada ikuyamada changed the title Reproduce Experimental Results (NER) on CPU? Adding LUKE to HuggingFace Transformers (Originally: Reproduce Experimental Results (NER) on CPU?) Feb 11, 2021
@NielsRogge
Copy link

NielsRogge commented Feb 11, 2021

Thank you for the quick response! Could you take a look at this notebook? https://colab.research.google.com/drive/1-vetwRrfox77w4kTbi8dQSlad-c5EFns?usp=sharing

In the notebook, I would like to perform a forward pass on the original implementation (so prior to the conversion of the HuggingFace implementation). I just want to obtain word_hidden_states and entity_hidden_states for the given sentence.

I'm getting an error when performing a forward pass, not sure why:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-11-5dd95f53c60f> in <module>()
      5     encoding[key] = torch.as_tensor(encoding[key])
      6 
----> 7 outputs = model(**encoding)
      8 outputs

4 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    545             result = self._slow_forward(*input, **kwargs)
    546         else:
--> 547             result = self.forward(*input, **kwargs)
    548         for hook in self._forward_hooks.values():
    549             hook_result = hook(self, input, result)

/content/luke/luke/model.py in forward(self, word_ids, word_segment_ids, word_attention_mask, entity_ids, entity_position_ids, entity_segment_ids, entity_attention_mask)
    207         entity_attention_mask,
    208     ):
--> 209         word_embeddings = self.embeddings(word_ids, word_segment_ids)
    210         entity_embeddings = self.entity_embeddings(entity_ids, entity_position_ids, entity_segment_ids)
    211         attention_mask = self._compute_extended_attention_mask(word_attention_mask, entity_attention_mask)

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    545             result = self._slow_forward(*input, **kwargs)
    546         else:
--> 547             result = self.forward(*input, **kwargs)
    548         for hook in self._forward_hooks.values():
    549             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/dist-packages/transformers/modeling_roberta.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds)
     57             if input_ids is not None:
     58                 # Create the position ids from the input token ids. Any padded tokens remain padded.
---> 59                 position_ids = self.create_position_ids_from_input_ids(input_ids).to(input_ids.device)
     60             else:
     61                 position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)

/usr/local/lib/python3.6/dist-packages/transformers/modeling_roberta.py in create_position_ids_from_input_ids(self, x)
     74         """
     75         mask = x.ne(self.padding_idx).long()
---> 76         incremental_indicies = torch.cumsum(mask, dim=1) * mask
     77         return incremental_indicies + self.padding_idx
     78 

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Once that works, I can provide the same sentence to my implementation in modeling_luke.py, and compare the hidden states!

@ikuyamada
Copy link
Member

Thank you for your response! I am not sure but maybe we need to pass input tensors (e.g., word_ids) as matrices instead of vectors: torch.as_tensor(encoding[key]) -> torch.as_tensor(encoding[key]).view(1, -1)

@NielsRogge
Copy link

NielsRogge commented Feb 11, 2021

Oh yes of course I should have added the batch dimension, thank you. Hmm now getting this:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-548a8dfc9b1b> in <module>()
      5     encoding[key] = torch.as_tensor(encoding[key]).unsqueeze(0)
      6 
----> 7 outputs = model(**encoding)
      8 outputs

7 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   1465         # remove once script supports set_grad_enabled
   1466         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1467     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1468 
   1469 

RuntimeError: index out of range: Tried to access index 50265 out of table with 50264 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237

That shouldn't happen right, because the word embedding matrix has indeed 50265 rows, as seen when printing the state_dict:

embeddings.word_embeddings.weight torch.Size([50265, 1024])
embeddings.position_embeddings.weight torch.Size([514, 1024])
embeddings.token_type_embeddings.weight torch.Size([1, 1024])
embeddings.LayerNorm.weight torch.Size([1024])
embeddings.LayerNorm.bias torch.Size([1024])

Btw, do you have slack? Could be handier for communication 😅

@ikuyamada
Copy link
Member

ikuyamada commented Feb 11, 2021

I am not sure why this happens, however, when adding special tokens to the tokenizer, I think we also need to change the model by (1) increasing the vocab size (convert_luke_original_pytorch_checkpoint_to_pytorch.py#L43) and (2) extending word embeddings in the state dict (convert_luke_original_pytorch_checkpoint_to_pytorch.py#L48). Maybe it fixes the issue.
Yes, I use Slack. Do you already have a Slack room?

@NielsRogge
Copy link

NielsRogge commented Feb 11, 2021

Normally you will receive an email for a Slack invite!

am not sure why this happens, however, when adding special tokens to the tokenizer, I think we also need to change the model by (1) increasing the vocab size (convert_luke_original_pytorch_checkpoint_to_pytorch.py#L43) and (2) extending word embeddings in the state dict (convert_luke_original_pytorch_checkpoint_to_pytorch.py#L48). Maybe it fixes the issue.

=> Finally, it works!! I'm now getting the word_hidden_states and entity_hidden_states of EntityAwareAttentionModel on the original implementation. These will be used for the integration tests :) next step: provide the same data to my implementation

@msiahbani
Copy link

@NielsRogge @ikuyamada Any plan to add pretraining LUKE model to transformers? or finetuning for NER?
I would appreciate any help. Thanks

@ikuyamada
Copy link
Member

Hi @msiahbani,

@Ryou0634 is currently developing our new example code based on transformers and AllenNLP. We will let you know when it is available!

@msiahbani
Copy link

Hi @ikuyamada,
Great news. Thank you.

@msiahbani
Copy link

@ikuyamada any updates? :">

@ryokan0123
Copy link
Contributor

@msiahbani
Hi, here's a new example to solve NER with LUKE and allennlp/transformers.
https://github.com/studio-ousia/luke/tree/downstream_allennlp/examples_allennlp/ner
Hope you find this helpful!

@ikuyamada ikuyamada changed the title Adding LUKE to HuggingFace Transformers (Originally: Reproduce Experimental Results (NER) on CPU?) Adding LUKE to HuggingFace Transformers Nov 16, 2021
@aakashb95
Copy link

@NielsRogge @ikuyamada
Apologies for tagging you directly here.
This issue has been extremely enlightening! A really humble request if you could share your slack transcripts pertaining to this issue 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants