The code is tested under python=3.10
pip install -r irra_requirements.txt
IEPile: https://github.com/zjunlp/IEPile
CrossNER: Provide post-processing
Database:https://drive.google.com/drive/folders/1xDAaTwruESNmleuIsln7IaNleNsYlHGn
url: https://github.com/hiyouga/LLaMA-Factory/tree/main
url: https://github.com/netease-youdao/BCEmbedding
We provide a CLI script to easily get the data for each step
data
├──database
│ ├──chroma-{chunksize}
| ├──docs # Unannotated data for different domains
| | ├──ai.txt
| | ├──....txt
| | └──science.txt
| └──prompts # Prompts for different models and domains
| └──{model_name}
| └──docs # Based on retrieval documents
| ├──ai.md
| ├──....md
| └──science.md
├──test
| ├──crossner
| | ├──base # Added instruction
| | | ├──cross.json
| | | ├──ai.json
| | | ├──....json
| | | └──science.json
| | └──raw # Original file
| | ├──ai.json
| | ├──....json
| | └──science.json
| ├──crossner-ec # Added expansion extraction instruction
| | └──{model_name}
| | ├──cross.ec.json
| | └──cross.base.jsonl # results of base extractor
| ├──crossner-tc # # Added type correction instruction
| | └──{model_name}
| | ├──cross.tc.{chunksize}.{top_k}.json
| | └──cross.ec.jsonl # results of expansion complementor
| └──crossner-results
| | └──{model_name}
| | └──cross.tc.{chunksize}.{top_k}.jsonl # final results
└──train
├──iepile
| └──train.ner.jsonl
├──iepile-augmentation
| └──train.ner.json
└──iepile-ec
└──{model_name}
├──train.base.jsonl
└──train.ec.json
home
└──history # CLI execution history
models
├──bce-embedding # bce-embedding folder
└──bce-reranker # bce-reranker
└──trained_llm # trained llm
src # Script files
run.py # run this via CLI!
Run the CLI and enter build_databse
.
For the base extractor, we directly incorporate the trained weights provided by iepile
We first get the augmented iepile dataset
- Run the CLI and enter
train
then chooseGet the augmented iepile dataset.
get thetrain.ner.json
in./data/train/iepile-augmentation
- Use the following LLaMA-Factory instruction to get the base extractor inference result on
train.ner.json
:
llamafactory-cli train \
--model_name_or_path {model_name} \
--template {model_template} \
--flash_attn auto \
--dataset_dir data\train\iepile-augmentation \
--dataset train.ner \
--cutoff_len 8192 \
--max_samples 1000000 \
--per_device_eval_batch_size 8 \
--predict_with_generate True \
--max_new_tokens 1024 \
--top_p 0.1 \
--temperature 0.01 \
--repetition_penalty 1.1 \
--output_dir ./data/train/iepile-ec/{model_name} \
--do_predict True \
--fp16 True
- Rename the inference result to
train.base.jsonl
and put it in./data/train/iepile-ec/{model_name}
, then go to the CLI and entertrain
then chooseGet the extension correction iepile dataset.
(no need to process dev) get thetrain.ec.json
in./data/train/iepile-ec/{model_name}
- Use the following LLaMA-Factory instruction, expansion complementor can be obtained by training base extractor with
train.ec.json
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path {model_name} \
--preprocessing_num_workers 16 \
--finetuning_type lora \
--quantization_method bitsandbytes \
--template {model_template} \
--flash_attn auto \
--dataset_dir ./data/train/iepile-ec/{model_name} \
--dataset train.ec \
--cutoff_len 2048 \
--learning_rate 5e-05 \
--num_train_epochs 3.0 \
--max_samples 100000 \
--per_device_train_batch_size 64 \
--gradient_accumulation_steps 8 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 5 \
--save_steps 100 \
--warmup_steps 0 \
--optim adamw_torch \
--packing False \
--report_to none \
--output_dir models\trained_llm/{model_name/expansion_complementor} \
--fp16 True \
--plot_loss True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all
- First you need to get basic crossNER dataset. Run the CLI and enter
test
then chooseGet the base crossner dataset
(num_schema:3) get thecross.json
in./data/test/crossner
- Use the following LLaMA-Factory instruction to get the base extractor inference result on
cross.json
:
llamafactory-cli train \
--model_name_or_path {model_name} \
--template {model_template} \
--flash_attn auto \
--dataset_dir data\test\crossner\base \
--dataset cross \
--cutoff_len 8192 \
--max_samples 1000000 \
--per_device_eval_batch_size 8 \
--predict_with_generate True \
--max_new_tokens 1024 \
--top_p 0.1 \
--temperature 0.01 \
--repetition_penalty 1.1 \
--output_dir ./data/test/crossner-ec/{model_name} \
--do_predict True \
--fp16 True
- Rename the inference result to
cross.base.jsonl
and put it in./data/test/crossner-ec/{model_name}
, then go to the CLI and entertest
then chooseGet the extension correction crossner dataset.
get thecross.ec.json
in./data/test/crossner-ec/{model_name}
- Use the following LLaMA-Factory instruction to get the base extractor inference result on
cross.ec.json
:
llamafactory-cli train \
--model_name_or_path {model_name} \
--template {model_template} \
--flash_attn auto \
--dataset_dir ./data/test/crossner-ec/{model_name} \
--dataset cross.ec \
--cutoff_len 8192 \
--max_samples 1000000 \
--per_device_eval_batch_size 8 \
--predict_with_generate True \
--max_new_tokens 1024 \
--top_p 0.1 \
--temperature 0.01 \
--repetition_penalty 1.1 \
--output_dir ./data/test/crossner-tc/{model_name} \
--do_predict True \
--fp16 True
- Rename the inference result to
cross.ec.jsonl
and put it in./data/test/crossner-tc/{model_name}
, then go to the CLI and entertest
then chooseGet the documents-based correction crossner dataset.
(top_n:20) get thecross.tc.{chunksize}.{top_k}.json
in./data/test/crossner-tc/{model_name}
- Use the following LLaMA-Factory instruction to get the vanilla model inference result on
cross.tc.{chunksize}.{top_k}.json
:
llamafactory-cli train \
--model_name_or_path {model_name} \
--template {model_template} \
--flash_attn auto \
--dataset_dir ./data/test/crossner-tc/{model_name} \
--dataset cross.tc.{chunksize}.{top_k} \
--cutoff_len 8192 \
--max_samples 1000000 \
--per_device_eval_batch_size 8 \
--predict_with_generate True \
--max_new_tokens 1024 \
--top_p 0.1 \
--temperature 0.01 \
--repetition_penalty 1.1 \
--output_dir ./data/test/crossner-results/{model_name} \
--do_predict True \
--fp16 True
- Rename the inference result to
cross.tc.{chunksize}.{top_k}.jsonl
and put it in./data/test/crossner-results/{model_name}
- Run the CLI and enter
evaluate
then chooseEvaluate the results on crossner.