This repository contains resources developed within the following paper:
Gizem Aydin, Seyed Amin Tabatabaei, Georgios Tsatsaronis, and Faegheh Hasibi. “Find the Funding: Entity Linking with Incomplete Funding Knowledge Bases”,
In proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’16), Newark, DE, USA, Sep 2016.
You can check the paper for detailed information.
The EDFund and ELFund datasets can be downloaded from here.
The datasets are created by Elsevier, thanks to the efforts by Zubair Afzal, Johan Boots, Heber Mc Mahon, Nishant Mintri, Seyedamin Tabatabaei, and George Tsatsaronis.
Run the notebook Domain Adaptation of BERT/BERT_TAPT.ipynb
.
- Training
$BERT_{TAPT}^{MD}$ :- Run
Named Entity Recognition/Train_BERT_TAPT_MD.ipynb
.
- Run
- Obtain predictions:
- Run
Named Entity Recognition/NER_Predictions.ipynb
.
- Run
- Evaluate the results:
- Run
Named Entity Recognition/Evaluate_NER.ipynb
.
- Run
In the first cell of each notebook, you may find the instructions on how to provide the inputs. After the inputs are provided, you can run the code from top to bottom. The code also has support for grant mentions if they are available.
- Training FunD:
- Training of the Biencoder.
- In the first round, train with random negatives:
- Run
Entity Disambiguation/BiEncoder RandomNegative Training.ipynb
.
- Run
- The rest of the training will be with both hard and random negatives. For this purpose, the predictions on Training and Val sets should be obtained:
- Compute entity embeddings: Run
Entity Disambiguation/Compute Entity Embeddings.ipynb
. - Run the notebook
Entity Disambiguation/Hard Negative Mining.ipynb
twice. Once for Training set, and a second time for Val set. - Run the notebook
Entity Disambiguation/Number Random Negatives.ipynb
to see how many random negatives per mention will be sampled for the next round. This notebook also shows some statistics on the number of hard negatives found.
- Compute entity embeddings: Run
- Train with hard negatives:
- Run
Entity Disambiguation/BiEncoder HardNegative Training.ipynb
.
- Run
- Repeat steps (2-3) for the amount of hard negative training rounds. In the original research, these steps are repeated 3 times.
- In the first round, train with random negatives:
- Training of GBM$_{F5}$:
- Run the notebook
Entity Disambiguation/Prediction with Biencoder.ipynb
twice. Once for Traning set, and a second time for the Val set. This notebook retrieves the candidate entities for these datasets, which are later used for training. - Run the notebook
Entity Disambiguation/Train GBM F5.ipynb
.
- Run the notebook
- Training of the Biencoder.
- Obtain Predictions:
- Run
Entity Disambiguation/Neural Entity Disambiguation Predictions.ipynb
.
- Run
- Evaluate the results:
- Run
Entity Disambiguation/Evalute ED Model.ipynb
.
- Run
- Obtain end-to-end predictions:
- Run
Entity Linking/Neural Entity Linking Predictions.ipynb
.
- Run
- Evaluate the results:
- Code for evaluation:
- Import
Evaluate_End2End
function from one of the following files, depending on the evaluation mode:EvaluationPoolStrict.py
: Strict matching,Normal
setting.EvaluationPoolStrictEE.py
: Strict matching,EE
setting.EvaluationPoolStrictInKB.py
: Strict matching,InKB
setting.
- Usage:
- Inputs:
all_gold_ann
: List of lists. The length of the main list is equal to the number of documents. For each document, a list stores the correct annotations. In this list, each annotation is indicated with another list of 3 elements. The first element is the start index of mention, the second element is the length of the mention, and the third element is the correct entity ID. Example correct annotation list for a document:
According to the example, there are two mentions. One starts at character index 5 and has a length of 10. The correct link for this mention is[ [5,10,"Entity_A"], [25,3,None] ]
"Entity_A"
. The other mention starts at index 25 and has a length of 3. This is a NIL mention.all_preds
: Similar toall_gold_ann
. Only difference is that it contains the predicted annotations instead of the gold ones.entity_pool
: A dictionary where keys are entity IDs and the values are Python sets only containing those entity IDs.
- Output: Prints Micro and Macro averaged Precision, Recall and F1 scores. Returns Micro averaged ones.
- Inputs:
- Import
- Code for evaluation: