- Code for
Unified Structure Generation for Universal Information Extraction
- Please contact Yaojie Lu (@luyaojie) for questions and suggestions.
- [2022-06-12] Update pre-training code.
- [2022-05-10] Update data preprocessing code.
General
- Python (verified on 3.8)
- CUDA (verified on 11.1/10.2)
Python Packages CUDA 10.2
conda create -n uie python=3.8
conda install -y pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
CUDA 11.1
conda create -n uie python=3.8
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
Details of preprocessing see Data preprocessing.
After that, please link the preprocessed dataset as:
ln -s dataset_processing/converted_data/ data
Data folder contains seven files:
data/text2spotasoc/absa/14lap
├── entity.schema # Entity Types for converting SEL to Record
├── relation.schema # Relation Types for converting SEL to Record
├── event.schema # Event Types for converting SEL to Record
├── record.schema # Spot/Asoc Type for constructing SSI
├── test.json
├── train.json
└── val.json
train/val/test.json are data files, and each line is a JSON instance.
Each JSON instance contains text
and record
fields, in which text
is plain text, and record
is the SEL representation of the extraction structure.
Details definition see DATASETS.md.
Note:
- Use the extra character of T5 as the structure indicators, such as
<extra_id_0>
,<extra_id_1>
,<extra_id_5>
.
Token | Role |
---|---|
<extra_id_0> | Start of Label Name |
<extra_id_1> | End of Label Name |
<extra_id_2> | Start of Input Text |
<extra_id_5> | Start of Text Span |
<extra_id_6> | NULL span for Rejection |
-
record.schema
is the record schema file for building SSI. It contains three lines: the first line is spot name list, the second line is asoc name list. And the third line is spot-to-asoc dictionary (do not use in code, can be ignored).["aspect", "opinion"] ["neutral", "positive", "negative"] {"aspect": ["neutral", "positive", "negative"], "opinion": []}
You can find the pre-trained models as following CAS Cloud Box/Google Drive links or download models using command gdown
(pip install gdown
).
uie-en-base [CAS Cloud Box] [Google Drive] [Huggingface]
uie-en-large [CAS Cloud Box] [Google Drive] [Huggingface]
uie-char-small (chinese) [CAS Cloud Box]
# Example of Google Drive
gdown 12Dkh6KLDPvXrkQ1I-1xLqODQSYjkwnvs && unzip uie-base-en.zip
gdown 15OFkWw8kJA1k2g_zehZ0pxcjTABY2iF1 && unzip uie-large-en.zip
Put all models to hf_models/
for default running scripts.
First make directories otuput
.
Training scripts as follows:
run_uie_finetune.py
: Python code entryrun_uie_finetune.bash
: Model training and evaluating process script.scripts_exp/run_exp.bash
: Model environment configuration and parameter setting entry.
The command for the training is as follows (see bash scripts and Python files for the corresponding command-line arguments):
. config/data_conf/base_model_conf_absa.ini && model_name=uie-base-en dataset_name=absa/14lap bash scripts_exp/run_exp.bash
config/data_conf/base_model_conf_absa.ini
refers to using the training settings inbase_model_conf_absa.ini
model_name=uie-base-en
refers to using uie-base-en.dataset_name=absa/14lap
refers to the dataset path.
Trained models are saved in the output_dir
specified by run_uie_finetune.bash
.
Simple Training Command
bash run_uie_finetune.bash -v -d 0 \
-b 16 \
-k 3 \
--lr 1e-4 \
--warmup_ratio 0.06 \
-i absa/14lap \
--epoch 50 \
--spot_noise 0.1 \
--asoc_noise 0.1 \
-f spotasoc \
--epoch 50 \
--map_config config/offset_map/closest_offset_en.yaml \
-m hf_models/uie-base-en \
--random_prompt
Progress logs
...
***** Running training *****
Num examples = 906
Num Epochs = 50
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 2850
Num examples = 219
Batch size = 64
...
Final Result (specific scores may different from different machines and environments)
...
test offset-rel-strict-P 67.01461377870564
test offset-rel-strict-R 59.11602209944752
test offset-rel-strict-F1 62.81800391389433
...
Metric | Definition |
---|---|
ent-(P/R/F1) | Micro-F1 of Entity (Entity Type, Entity Span) |
rel-strict-(P/R/F1) | Micro-F1 of Relation Strict (Relation Type, Arg1 Span, Arg1 Type, Arg2 Span, Arg2 Type) |
rel-boundary-(P/R/F1) | Micro-F1 of Relation Boundary (Relation Type, Arg1 Span, Arg2 Span) |
evt-trigger-(P/R/F1) | Micro-F1 of Event Trigger (Event Type, Trigger Span) |
evt-role-(P/R/F1) | Micro-F1 of Event Argument (Event Type, Arg Role, Arg Span) |
[TODO] Add detailed decription.
We construct different sequence-to-sequence tasks using different data collators.
- For pre-training,
HybirdDataCollator
constructs different seq2seq pairs for different tasks, andDataCollatorForMetaSeq2Seq
constructs ssi with Sampling Strategy. - For fine-tuning,
DataCollatorForMetaSeq2Seq
constructs the dynamic seq2seq pair with Rejection Mechanism.
We unify different types of (text, strcuture) pairs for pre-training with HybirdDataCollator. It contains multiple data collators for different instances:
DataCollatorForMetaSeq2Seq
for pair task, similiar to fine-tune stageDataCollatorForSeq2Seq
for record taskDataCollatorForT5MLM
for text task
Sampling Strategy and Rejection Mechanism can be adopted in the training process.
uie/seq2seq/data_collator/meta_data_collator.py
class DataCollatorForMetaSeq2Seq is for collating data, class DynamicSSIGenerator is for prompt samplingrun_uie_finetune.py
class DataTrainingArguments contains related parameters
Related parameters in class DataTrainingArguments are briefly introduced here:
- About Sampling Strategy
- max_prefix_length Maximum length of SSI
- ordered_prompt Whether to sort the spot/asoc of SSI or not
- record_schema record schema read from record.schema
- About Rejection Mechanism
- spot_noise The noise rate of null spot
- asoc_noise The noise rate of null asoc
To verify the performance of the UIE requires converting the generated SEL expression into Record and then evaluating it.
After training, pred_folder
will contain 'eval_preds_seq2seq.txt' or 'test_preds_seq2seq.txt'
$ python scripts/sel2record.py -h
usage: sel2record.py [-h] [-g GOLD_FOLDER] [-p PRED_FOLDER [PRED_FOLDER ...]] [-c MAP_CONFIG] [-d DECODING] [-v]
optional arguments:
-h, --help show this help message and exit
-g GOLD_FOLDER folder of golden answer
-p PRED_FOLDER [PRED_FOLDER ...]
multiple different prediction folders
-c MAP_CONFIG, --config MAP_CONFIG
offset matching strategy configuration file, more configuration files are placed in config/offset_map
-d DECODING specify structure parser, default is SpotAsoc structure
-v, --verbose print more detailed log information
After converting, pred_folder
will contain 'eval_preds_record.txt' or 'test_preds_record.txt'
$ python scripts/eval_extraction.py -h
usage: eval_extraction.py [-h] [-g GOLD_FOLDER] [-p PRED_FOLDER [PRED_FOLDER ...]] [-v] [-w] [-m] [-case]
optional arguments:
-h, --help show this help message and exit
-g GOLD_FOLDER Golden Dataset folder
-p PRED_FOLDER [PRED_FOLDER ...]
Predicted model folder
-v Show more information during running
-w Write evaluation results to predicted folder
-m Refers to the matching policy
-case Show case study
To verify the effect of structure parser, we took the golden answer SEL
as the prediction result, and evaluate its performance.
bash scripts/check_offset_map_gold_as_pred.bash <data-folder> <map-config>
If this repository helps you, please cite this paper:
Yaojie Lu, Qing Liu, Dai Dai, Xinyan Xiao, Hongyu Lin, Xianpei Han, Le Sun, Hua Wu. Unified Structure Generation for Universal Information Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5755–5772, Dublin, Ireland. Association for Computational Linguistics.
@inproceedings{lu-etal-2022-unified,
title = "Unified Structure Generation for Universal Information Extraction",
author = "Lu, Yaojie and
Liu, Qing and
Dai, Dai and
Xiao, Xinyan and
Lin, Hongyu and
Han, Xianpei and
Sun, Le and
Wu, Hua",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.395",
pages = "5755--5772",
}
The code is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License for Noncommercial use only. Any commercial use should get formal permission first.