GitHub - informagi/Fund-EL

This repository contains resources developed within the following paper:

Gizem Aydin, Seyed Amin Tabatabaei, Georgios Tsatsaronis, and Faegheh Hasibi. “Find the Funding: Entity Linking with Incomplete Funding Knowledge Bases”,
In proceedings of ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’16), Newark, DE, USA, Sep 2016.

You can check the paper for detailed information.

Download the Datasets

The EDFund and ELFund datasets can be downloaded from here.

The datasets are created by Elsevier, thanks to the efforts by Zubair Afzal, Johan Boots, Heber Mc Mahon, Nishant Mintri, Seyedamin Tabatabaei, and George Tsatsaronis.

Domain Adaptation of BERT, $BERT_{TAPT}$

Run the notebook Domain Adaptation of BERT/BERT_TAPT.ipynb.

Named Entity Recognition for Funding Organizations, $BERT_{TAPT}^{MD}$

Training $BERT_{TAPT}^{MD}$:
- Run Named Entity Recognition/Train_BERT_TAPT_MD.ipynb.
Obtain predictions:
- Run Named Entity Recognition/NER_Predictions.ipynb.
Evaluate the results:
- Run Named Entity Recognition/Evaluate_NER.ipynb.

In the first cell of each notebook, you may find the instructions on how to provide the inputs. After the inputs are provided, you can run the code from top to bottom. The code also has support for grant mentions if they are available.

Entiy Disambiguation for Funding Organizations, FunD

Training FunD:
- Training of the Biencoder.
  1. In the first round, train with random negatives:
    - Run Entity Disambiguation/BiEncoder RandomNegative Training.ipynb.
  2. The rest of the training will be with both hard and random negatives. For this purpose, the predictions on Training and Val sets should be obtained:
    - Compute entity embeddings: Run Entity Disambiguation/Compute Entity Embeddings.ipynb.
    - Run the notebook Entity Disambiguation/Hard Negative Mining.ipynb twice. Once for Training set, and a second time for Val set.
    - Run the notebook Entity Disambiguation/Number Random Negatives.ipynb to see how many random negatives per mention will be sampled for the next round. This notebook also shows some statistics on the number of hard negatives found.
  3. Train with hard negatives:
    - Run Entity Disambiguation/BiEncoder HardNegative Training.ipynb.
  4. Repeat steps (2-3) for the amount of hard negative training rounds. In the original research, these steps are repeated 3 times.
- Training of GBM$_{F5}$:
  1. Run the notebook Entity Disambiguation/Prediction with Biencoder.ipynb twice. Once for Traning set, and a second time for the Val set. This notebook retrieves the candidate entities for these datasets, which are later used for training.
  2. Run the notebook Entity Disambiguation/Train GBM F5.ipynb.
Obtain Predictions:
- Run Entity Disambiguation/Neural Entity Disambiguation Predictions.ipynb.
Evaluate the results:
- Run Entity Disambiguation/Evalute ED Model.ipynb.

Evaluation for Entity Linking

Obtain end-to-end predictions:
- Run Entity Linking/Neural Entity Linking Predictions.ipynb.
Evaluate the results:
- Code for evaluation:
  - Import Evaluate_End2End function from one of the following files, depending on the evaluation mode:
    - EvaluationPoolStrict.py: Strict matching, Normal setting.
    - EvaluationPoolStrictEE.py: Strict matching, EE setting.
    - EvaluationPoolStrictInKB.py: Strict matching, InKB setting.
  - Usage:
    - Inputs:
      - all_gold_ann: List of lists. The length of the main list is equal to the number of documents. For each document, a list stores the correct annotations. In this list, each annotation is indicated with another list of 3 elements. The first element is the start index of mention, the second element is the length of the mention, and the third element is the correct entity ID. Example correct annotation list for a document:
      [ [5,10,"Entity_A"], [25,3,None] ]
      According to the example, there are two mentions. One starts at character index 5 and has a length of 10. The correct link for this mention is "Entity_A". The other mention starts at index 25 and has a length of 3. This is a NIL mention.
      - all_preds: Similar to all_gold_ann. Only difference is that it contains the predicted annotations instead of the gold ones.
      - entity_pool: A dictionary where keys are entity IDs and the values are Python sets only containing those entity IDs.
    - Output: Prints Micro and Macro averaged Precision, Recall and F1 scores. Returns Micro averaged ones.

Libraries and Versions

Python==3.7.9
annoy==1.17.0
datasets==1.4.1
fuzzywuzzy==0.18.0
lightgbm==2.3.0
numpy==1.19.2
pandas==1.1.3
scipy==1.5.2
seqeval==1.2.2
torch==1.7.1
tqdm==4.49.0
transformers==3.5.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Download the Datasets

Domain Adaptation of BERT, $BERT_{TAPT}$

Named Entity Recognition for Funding Organizations, $BERT_{TAPT}^{MD}$

Entiy Disambiguation for Funding Organizations, FunD

Evaluation for Entity Linking

Libraries and Versions

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Domain Adaptation of BERT		Domain Adaptation of BERT
Entity Disambiguation		Entity Disambiguation
Entity Linking		Entity Linking
Named Entity Recognition		Named Entity Recognition
LICENSE		LICENSE
README.md		README.md

License

informagi/Fund-EL

Folders and files

Latest commit

History

Repository files navigation

Download the Datasets

Domain Adaptation of BERT, $BERT_{TAPT}$

Named Entity Recognition for Funding Organizations, $BERT_{TAPT}^{MD}$

Entiy Disambiguation for Funding Organizations, FunD

Evaluation for Entity Linking

Libraries and Versions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages