-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding LUKE to HuggingFace Transformers #38
Comments
Hi, |
Thanks for the reply! |
Alright. |
Hi, could the model be trained on google Colab? |
We haven't tested that, but the code should be run on Colab GPU. |
Hi @afi1289, I've created a Google Colab notebook in which I would like to fine-tune LUKE on the entity typing task as explained in the README. Anyone can run this notebook, but I've stored the checkpoint as well as the data in my personal Drive. However, once I've installed all packages using poetry, the "transformers" package is not recognized as being installed. I guess this is because poetry does install the dependencies at a different location (pip would install them in
@ikuyamada would be great if you can take a look |
@NielsRogge Thank you for creating Colab notebook! I think you need to activate the virtualenv created by Poetry by running |
Thank you for the reply. When typing
Then I'm getting a window in which I can type things. Not sure how poetry works, should I type in a password there or something else?. Could you please edit my notebook? |
Thanks for the reply. It seems that the |
Ok this works, thanks! Getting the following output:
The reason I run it is because I'm interested in adding the model to the Transformers library by HuggingFace. However, the model seems to have a lot of dependencies, so I wonder whether this is possible. For each model in the Transformers library, 3 files need to be defined:
The big challenge with this model is probably the tokenizer, since it's different from the usual ones ( Also, looking at a random example of the dev set of Open Entity, the tokenization happens as follows:
However, this is only for the word sequence. The entity sequence is then set to [1, 0]. To what do 1 and 0 correspond, [MASK] and [PAD] respectively? I also wonder why the [ENTITY] token is not mentioned in the paper. |
Hi,
We do not use the The entity sequence |
Ok, maybe we can first clear out some things about LUKE and then work together on adding this to the library. To get a good understanding about about how Pre-trainingDuring pre-training, sentences come from Wikipedia so we know which entities are in the sentence (in this case, "Beyonce" and "Los Angeles"). We randomly mask both tokens and entire entities which the model needs to predict. This happens as follows (let's tokenize the sentence first):
So for example we replace the "Bey" and "Los" tokens by [MASK], as well as the "Los Angeles" entity. So this sentence is represented as follows:
Fine-tuningThere are different fine-tuning tasks, so we go over each of them to show how sentences are prepared for the model. Entity typing (e.g. Open Entity)In case of entity typing, the task is to predict the label for a given entity in a sequence. The entity for which we need to predict the label is surrounded by [ENTITY] tokens. So, given for example that we need to predict the label of the entity "Beyonce", then this sentence is prepared as follows:
As one can see, the entity for which to predict the label for is surrounded by the [ENTITY] token. This special token is learned only during fine-tuning (but its embedding is initialized using the "@" word token as can be seen here). The entity sequence consists of [MASK] and [PAD]. Here, we start from their pre-trained representations. The label is predicted by placing a linear classifier on top of the final hidden representation of the [MASK] token. -> input to Relation classification (e.g. TACRED)In case of relation classification, the task is given 2 entities in the sentence (one head, one tail), the model needs to predict the relation between them. Suppose that "Beyonce" is the head entity and "Los Angeles" the tail entity. Then the data is prepared as follows:
=> as one can see, the "head" entity is surrounded by a special [HEAD] token and the tail one by a special [TAIL] token. In the code is specified that: -> input to Extractive question-answering (e.g. SQuAD)Suppose we have the question "Where does Beyonce live?", and we consider the example sentence to be the passage. Here the 500K vocabulary of entities is used, as the passage is from Wikipedia. Then this is represented as follows:
=> here, no [MASK] entity is used (only entities mentioned in the question + passage are included). Because the task is to predict the start and end positions of the answer, we place two linear layers, corresponding to start and end positions, on top of the output word representations to predict the answer span. This is the same architecture as the SQuAD model of BERT except for the presence of entity inputs. -> input to Cloze-style question-answering (e.g. ReCoRD)Suppose we consider the example sentence to be the passage, and the query to be "X lives in LA". Then this is represented as follows:
In the paper is said "we input [MASK] entities corresponding to the missing entity and all entities in the passage". -> input to Named-entity recognition (e.g. CoNLL 2003)In case of NER, since the entity spans are not provided, we need to enumerate all possible n-grams in the sentence. Therefore, the entity sequence consists of very long [MASK] entities each of which position_ids are corresponding n-gram (e.g., the first [MASK] entity corresponds to "Bey", the second one corresponds to "Bey, once", the third one corresponds to "Bey, once, lives",... So in case of our example sentence:
-> input to => Looking at the different ways data is prepared for the model, The output of
(I'm updating this comment once I receive more information and read more in the paper) |
Hi, Entity typing (e.g. Open Entity)
This is mostly correct, but the [ENTITY] token is initialized using the "@" word token. Relation classification
The entity sequence consists of [HEAD] (ID:1) and [TAIL] (ID:2) entities. The embeddings of these entities are initialized using the embedding of the [MASK] entity. The entity embeddings consisting of three rows ([PAD], [HEAD], and [TAIL]) are created here: Extractive question-answering (e.g. SQuAD)
In the SQuAD task, we do not use the [MASK] entity. Because the task is to predict the start and end positions of the answer, we place two linear layers, corresponding to start and end positions, on top of the output word representations to predict the answer span. This is the same architecture as the SQuAD model of BERT except for the presence of entity inputs. Cloze-style question-answering (e.g. ReCoRD)
We do not input entities except for the [MASK] entity in this task. Here, the entity sequence should be "[MASK] [MASK] [MASK]", where the first one corresponds to the missing entity. Named-entity recognition (e.g. CoNLL 2003)
In the NER task, since the entity spans are not provided, we need to enumerate all possible n-grams in the sentence. Therefore, the entity sequence consists of very long [MASK] entities each of which position_ids are corresponding n-gram (e.g., the first [MASK] entity corresponds to "Bey", the second one corresponds to "Bey, once", the third one corresponds to "Bey, once, lives",...).
I am not familiar with the implementation of the recent transformers models though, I think it is a good idea to add a parameter indicating the task type to the tokenizer because it significantly reduces the effort to implement the fine-tuning tasks. |
Ok thank you for the reply! I've updated my previous comment based on your answers (as I'll use that comment as a sort of standard which defines how LUKE works). Could you verify whether the token sequence and entity sequence for each of the downstream tasks is now defined correctly? I also updated each downstream task with what data should be provided to I've listed further questions here:
Once this is all clarified, I start working on a pull request. I wonder whether you are interested in working on this together? Kind regards, Niels |
@NielsRogge Thanks a lot for your reply!
Regarding the SQuAD task, the entity sequence should be "Beyonce, Beyonce, Los Angeles" not "Beyonce, Los Angeles" since the input sentence contains "Beyonce" twice.
Yes, although we use "words" for brevity, LUKE takes a sequence of tokens (or subwords).
Yes, we need [HEAD] and [TAIL] in both the regular subword vocabulary and the entity vocabulary.
Each number represents the number of Wikipedia hyperlinks that point to the corresponding entity. For example, Wikipedia contained 261,500 hyperlinks that point to the entity United States. I am interested in adding LUKE to transformers, so please feel free to ask questions anytime. |
Thank you, your replies help me a lot in understanding the model! I've started defining I start with the base model, and once I know that it outputs the same Inputs of LukeModelOne thing that might be challenging is making sure inputs to LUKE can be padded/truncated similar to other models in the Transformers library. It seems like LUKE is different than other models, in the sense that it uses different seq lengths for different batches, rather than padding them all up to 512 tokens. LUKE takes in 2 inputs: However, we should make it compatible with the API of Transformers. It looks like we will have two extra parameters to
Outputs of LukeModel and LukeEntityAwareAttentionModel
=> what is the easiest way to perform a forward pass on these models on some input data? This is very important for the integration tests. |
Hi @NielsRogge,
We do not add padding to the max length for computational efficiency. Therefore, adding extra padding should not affect the performance.
I am not sure I correctly understand what you mean here, but the data to be used to run the forward pass can be created using textual inputs with entity annotations, e.g., entity typing dataset. |
Sorry I could have been more clear, I mean that I want to compare the outputs of my implementation with those of the original implementation on the same input data (to make sure my implementation is OK). So I wonder what the easiest way is to perform a forward pass on some dummy data with |
Thanks for your reply.
Also, LUKE provides |
Finally I am able to reproduce the results using google colab notebook by @NielsRogge Thankyou |
@Raabia-Asif I am sorry, but I am not familiar with Google Colab. @NielsRogge How are things going with this? I almost completed my current project, so I can work together closely to integrate LUKE into transformers. |
Hey @ikuyamada, I'm currently working on another algorithm to be added to the Transformers library, but after that I can work on adding LUKE. My current implementation is here. The main thing to do is to make LUKE compatible with the existing API of the Transformers library. |
@NielsRogge Did you face any major issues to integrate LUKE to transformers? I am not familiar with the integration process of new models into transformers, but if you are still interested in this, we can work together to do this. I will review the code to understand the progress. Also, I will plan to release the base model of LUKE soon. |
@ikuyamada so as already said above, 3 files are required for each model in the Transformers library:
(and a conversion script that loads in the weights from the original model to the The configuration and modeling files are mainly copies from model.py, so this is not a problem. However, for the tokenizer, the API should look something like this:
The encoding should then be a dictionary (in the Transformers library this is actually a
In this case, So that effectively means that we should mainly translate the data preparation ( |
@NielsRogge Thank you very much for the detailed explanation! I think it is a very good idea to add "task" parameter to the tokenizer. I also think that one possible direction might be supporting only common NLP tasks in the tokenizer such as entity typing (or single entity prediction), relation classification (or entity pair prediction), and named entity recognition. As you may already notice, if we support SQuAD in the tokenizer, we need to perform the heuristic entity linking within the tokenizer, which may be too specific to this task. I think one idea is that we treat SQuAD differently from other tasks, and add its implementation as a research project. If we do not support SQuAD in the tokenizer, we need to deal only with [PAD] and [MASK] entities. Also, if the tokenizer directly outputs the input data without modifying the model, we need to modify the model before uploading it to the cloud storage of transformers. If we implement based on the methods above, I think the implementation can be done in a straightforward way. |
Very good idea! This would indeed make the implementation of the tokenizer a lot easier.
So you mean these should be added to the embedding layer of the model? Will the tokenizer than just have a single vocab file or will it have two? |
Thanks for your reply!
Yes, so the word embedding layer of LUKE will have the embeddings of the added special tokens. We need to modify the word vocabulary file by adding these special tokens. The tokenizer file will have a single modified word vocabulary file and a single entity vocabulary file. Then the tokenizer can directly output the input sequences, e.g., |
Ok :) well that will be the first tokenizer in the library which has 2 vocabulary files. But what I do find a bit weird is, the entity vocab consists of 500K entities, but these are not used in any downstream task except for SQuAD, right? If I understand correctly, the following special tokens will have to be added to the word vocabulary:
=> what is their initialized embedding vector? And for the entity vocabulary, we have the following special tokens:
both of which are initialized with the pre-trained embedding of [MASK]. Also, in
=> however, I'm not sure whether I can maybe already start today or tomorrow, I'll update you! |
Yes, the entities other than [MASK] and [PAD] are used only in the pretraining and SQuAD tasks. If we do not support SQuAD, these entity vocab entries and corresponding embeddings can be removed from the pretrained model, which significantly reduces the model size.
I think it is better to add general [ENT] and [ENT2] special tokens to the word vocab, and use [ENT] for [ENTITY] and [HEAD], and [ENT2] for [TAIL]. In this case, the [ENT] and [ENT2] embeddings should be initialized using the embeddings of "@" and "#", respectively.
I think "LukeForClozeStyleQuestionAnswering" is quite specific to the ReCoRD task, I think it may be reasonable to exclude it.
Yes, since LUKE is based on the span-based NER approach (Sohrab and Miwa, 2018), the architecture of LUKE is different from other token-based NER models. I think names like
Thanks 😍 |
@NielsRogge I have time to work on the task this week. So, if you agree to the methods described above, I can create the modified pretrained LUKE model and vocab files with a new conversion script. I think the implementation of the tokenizer and head models is relatively straightforward for the modified LUKE model. |
Hi, Yes that's fine for me. I haven't got the time yet to work on it, since I also have a full time job. I hope I can make some time available this week. Kind regards, Niels |
Hi, Best regards, |
I have just created a PR that updates the conversion script to implement what we have discussed here. I also added some TODOs and NOTEs that can potentially improve the implementation. Thanks, |
Hi @ikuyamada, thank you for the PR. I'm currently setting up a simple example from the Open Entity dev set, on which I want to test both the forward pass of the original implementation as well as our implementation in However, when setting up a simple example, I've got 2 questions:
Colab to reproduce: https://colab.research.google.com/drive/1-vetwRrfox77w4kTbi8dQSlad-c5EFns?usp=sharing |
Hi @NielsRogge, Thank you for working on creating integration tests! Please let me know if there is anything I can do.
In the new conversion script, the tokenizer is also serialized into >>> tokenizer = RobertaTokenizer.from_pretrained('out')
>>> tokenizer.tokenize('[ENT] Beyonce [ENT] lives in Los Angeles')
['[ENT]', 'Bey', 'once', '[ENT]', 'l', 'ives', 'Ġin', 'ĠLos', 'ĠAngeles']
>>> tokenizer.encode('[ENT] Beyonce [ENT] lives in Los Angeles')
[0, 50265, 40401, 25252, 50265, 462, 3699, 11, 1287, 1422, 2] If I understand correctly, it is possible to upload these serialized tokenizer files to the cloud storage of Transformers.
|
Thank you for the quick response! Could you take a look at this notebook? https://colab.research.google.com/drive/1-vetwRrfox77w4kTbi8dQSlad-c5EFns?usp=sharing In the notebook, I would like to perform a forward pass on the original implementation (so prior to the conversion of the HuggingFace implementation). I just want to obtain I'm getting an error when performing a forward pass, not sure why:
Once that works, I can provide the same sentence to my implementation in |
Thank you for your response! I am not sure but maybe we need to pass input tensors (e.g., |
Oh yes of course I should have added the batch dimension, thank you. Hmm now getting this:
That shouldn't happen right, because the word embedding matrix has indeed 50265 rows, as seen when printing the state_dict:
Btw, do you have slack? Could be handier for communication 😅 |
I am not sure why this happens, however, when adding special tokens to the tokenizer, I think we also need to change the model by (1) increasing the vocab size (convert_luke_original_pytorch_checkpoint_to_pytorch.py#L43) and (2) extending word embeddings in the state dict (convert_luke_original_pytorch_checkpoint_to_pytorch.py#L48). Maybe it fixes the issue. |
Normally you will receive an email for a Slack invite!
=> Finally, it works!! I'm now getting the |
@NielsRogge @ikuyamada Any plan to add pretraining LUKE model to transformers? or finetuning for NER? |
Hi @msiahbani, @Ryou0634 is currently developing our new example code based on transformers and AllenNLP. We will let you know when it is available! |
Hi @ikuyamada, |
@ikuyamada any updates? :"> |
@msiahbani |
@NielsRogge @ikuyamada |
Hi,
Is there a possibility to reproduce results for NER on CPU instead of the default GPU configuration? I am unable to find any resource for this on the repo.
I am using the following command, but there seems to be no flag/argument available to switch between CPU and GPU?
Thanks in advance!
The text was updated successfully, but these errors were encountered: