LLM-Augmented Entity Linking

Datasets, scripts, and fine-tuned model for the paper LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking.

📖 Paper: LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking
Fine-tuned ReFinED model on 🤗HuggingFace: LLMAEL-REFINED-FT

We introduce LLM-Augmented Entity Linking (LLMAEL), a plug-and-play approach to enhance entity linking through LLM data augmentation. We leverage LLMs as knowledgeable context augmenters, generating mention-centered descriptions as additional input, while preserving traditional EL models for EL execution. Experiments on 6 standard datasets show that the vanilla LLMAEL outperforms baseline EL models in most cases, while the fine-tuned LLMAEL set the new state-of-the-art results across all 6 benchmarks.

Repository Content

This repository contains LLMAEL's testing and training datasets, LLM prompts, data fusion scripts, and the fine-tuned ReFinED model checkpoint.

For the datasets directory:

The original_el_benchmarks sub-directory stores the original EL benchmarks of 3 EL models BLINK, GENRE, and ReFinED, downloaded from their respective github repositories.
The llm_raw_augmentations sub-directory contains raw context generated by backbone LLMs, specifically short entity descriptions based on the original mentions and contexts. These entries precisely match the content and order of their corresponding benchmarks in original_el_benchmarks.
The llm_augmented_el_benchmarks sub-directory stores the final LLM-augmented EL benchmarks, infusing data from llm_raw_augmentations into original_el_benchmarks (Data from this repository needs to be generated using augment_all_datasets.sh. See section "To Generate Testing and Training Data").

Installation

git clone https://github.com/THU-KEG/LLMAEL.git
cd LLMAEL

To Generate Testing and Training Data

To generate the testing and training datasets used in our paper's main experiment and ablation studies, run the following commands:

cd scripts
bash augment_all_datasets.sh

To generate datasets with other options, select and run the following python scripts:

python augment_blink_datasets_with_llm.py
python augment_genre_datasets_with_llm.py
python augment_refined_datasets_with_llm.py

with your custom parameters --llm_name, --join_strategy, --original_benchmarks_path, --llm_contexts_path, --output_path.

To Download our Fine-tuned ReFinED Model

Please download from our 🤗HuggingFace hub: LLMAEL-REFINED-FT

To Reproduce our Results

Clone the official github repositories of our 3 selected EL models: BLINK, GENRE, and ReFinED
Find the official test script of each model, respectively
Change the test datasets to our augmented datasets from llm_augmented_el_benchmarks, and run the official test script. Our main experiments are conducted using the 6 augmented test sets synthesized under context-joining strategy 4. For the vanilla LLMAEL, we used BLINK's full cross-encoder model, GENRE's AIDA model without the candidate set, and ReFinED's AIDA model. For fine-tuned LLMAEL, we customly fine-tuned ReFinED's wikipedia model using Llama3-70-b augmented AIDA train dataset.

Our results (accuracy) are as follows:

Method	AIDA	MSNBC	AQUA	ACE04	CWEB	WIKI	AVG
LLMAEL x BLINK	81.94	86.56	85.16	86.01	69.17	81.14	81.61
LLMAEL x GENRE	88.27	85.67	85.14	85.21	70.67	82.95	82.99
LLMAEL x ReFinED	92.38	86.94	88.09	88.14	73.16	85.90	85.76
LLMAEL x ReFinED (FT)	92.34	88.79	89.06	88.14	75.07	86.62	86.67

We report

Citation

@misc{xin2024llmael,
  title={LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking},
  author={Xin, Amy and Qi, Yunjia and Yao, Zijun and Zhu, Fangwei and Zeng, Kaisheng and Bin, Xu and Hou, Lei and Li, Juanzi},
  journal={arXiv preprint arXiv:2407.04020},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
exceptions		exceptions
models		models
prompts		prompts
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Augmented Entity Linking

Repository Content

Installation

To Generate Testing and Training Data

To Download our Fine-tuned ReFinED Model

To Reproduce our Results

Citation

About

Releases

Packages

Languages

THU-KEG/LLMAEL

Folders and files

Latest commit

History

Repository files navigation

LLM-Augmented Entity Linking

Repository Content

Installation

To Generate Testing and Training Data

To Download our Fine-tuned ReFinED Model

To Reproduce our Results

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages