Skip to content

Latest commit

 

History

History
159 lines (110 loc) · 6.7 KB

README.md

File metadata and controls

159 lines (110 loc) · 6.7 KB

1. Introduction

The code for our paper Cross-lingual Contextualized Phase Retrieval.

NOTE: Since this project contains many pipelines and each part is finished seperately during this long-term project, I have not test the whole project from scratch again, which is one thing in my TODO list. However, I think the code and scrips are helpful for people who are curious about how we implement our method. Please feel free to ask any questions about this project. My email address: li.huayang.lh6 [at] is.naist.jp

2. TODO List

  • Unify the python environment
  • Test those scripts one more time
  • Release the human annotated data for retrieval
  • Release the code for training
  • Release the pre-trained model
  • Release the code for retrieval inference
  • Release the code for MT inference

3. Environment

TODO: Unify the python environment

  1. [Preparing Training Data] GIZA++ requires python2.7
  2. [Training Model] Our project requires python3.9 + transformers==4.27.1
  3. [SFT LLM for MT] Platypus requires transformers>=4.31.0

Below is the short explanation about four critical folders:

  1. mytools: This is a library containing commonly used functions in this project, such as reading and saving files.
  2. mgiza: The code of GIZA++ for automatically inducing word alignment information from parallel data, which is important for collecting training data for CCPR.
  3. code: The main code for our project, including the code for model, dataloader, indexing, searching, etc.
  4. Platypus: The code for LLM-based translator. In our paper, we use the CCPR model to augment the LLM-based translator by integrating the retrieved information to the LLM prompt.

!!!Please install those libraries according to their README files!!!

4. Download

4.1 HF Model

Please ensure the mytools library is installed.

python ./mytools/mytools/hf_tools.py download_hf_model "sentence-transformers/LaBSE" ./huggingface/LaBSE
python ./mytools/mytools/hf_tools.py download_hf_model "FacebookAI/xlm-roberta-base" ./huggingface/xlm-roberta-base

Please also make sure you have the checkpoint of Llama-2-7B, which will be used for the MT task.

4.2 HF Dataset

for L1 in "de" "cs" "fi" "ru" "ro" "tr"
do
    python ./mytools/mytools/hf_tools.py download_wmt --ds_name wmt16 --lang_pair ${L1}-en --save_path ./wmt16_${L1}en
done

4.3 Human-annotated Word Alignment (for Retrieval Evaluation)

If you want to pre-process the data of human-annotated word alignment by youself, please download the raw data as follows:

URL Re-name
De->En https://www-i6.informatik.rwth-aachen.de/goldAlignment/ ./align_data/DeEn
Cs->En https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1804 ./align_data/CzEn
Ro->En http://web.eecs.umich.edu/~mihalcea/wpt05/data/Romanian-English.test.tar.gz ./align_data/RoEn

4.4 Newscrawl Monolingual Data (for MT Evaluation)

# make sure you are under the root directory of the project
mkdir -p newscrawl
cd newscrawl
YY=16 # an example
LANG=tr # an example
wget https://data.statmt.org/news-crawl/${LANG}/news.20${YY}.${LANG}.shuffled.deduped.gz
gzip -d news.20${YY}.${LANG}.shuffled.deduped.gz

where YY is the number of year, e.g., 16, and LANG is the language of the data, e.g, tr.

5. Ussage

5.1 Inference: Retrieval

Before runing the following script, please remember to complete some configs in the script, e.g., path to python, and also make sure that you have installed the required libraries.

Please download the pre-processed data-bin from [this link], and put it to the root directory of this project, i.e., this folder.

You can download our pre-trained retriever through this link. If you want to train your own model, please refer to Section 5.3. In addition, please download the pre-processed human-annotated data and the indices of high-quality phrases selected by humans, and put them under the root directory of our project (this folder).

cd code
bash eval_retriever.sh

If you want to pre-process your own data for retrieval, please check un-comment the code for data processing in eval_retriever.sh.

5.2 Inference: MT

Step-1: Model training

Please save unzip the pre-trained retriever and save the ckpts folder to the root path of this project (this folder). If you don't want to use the pre-trained model, please see the Section 5.3 to train your own model.

Step-2: Data Processing & Indexing & searching

Please make sure you have downloaded the newscrawl monolingual data and install the required libraries.

cd code
bash index_and_search.sh

Step-4: Instruction-tuning LLM for translation

Please save unzip the pre-trained LLM-based translator and save the ckpts folder to the root path of this project (this folder). If you don't want to use the pre-trained model, prepare training data and train it by yourself following the README of Platypus. For the training data, you need to make samples from WMT training dataset and retrieve cross-lingual phrases using the index built in previous step.

# skip this if you want to use the pre-trained translator
cd Platypus
bash fine-tuning.sh

Step-5: Decoding & Reprting Score

You can use the folllowing data to prepare the prompts for translation. Note you can prepare the training data based on the following script.

cd Platypus
bash prepare_alpaca_data_phrase_test_enxx.sh

Then, run the script for decoding and evaluation. The score for the method will be printed.

cd Platypus
bash inference_test.sh

5.3: CCPR Training

Please check whether you need set project configs, e.g., project path, for each script.

Step-1: get the word alignment information from parallel data using GIZA++

cd mgiza/mgizapp/build
bash install.sh
bash run.sh

Step-2: run the following code to automatically induce the cross-lingual phrase pairs from the parallel data.

cd code
bash preprocess_data.sh

Step-3: train model

cd code
bash train_retriever_labse_multilingual.sh