The code for our paper Cross-lingual Contextualized Phase Retrieval.
NOTE: Since this project contains many pipelines and each part is finished seperately during this long-term project, I have not test the whole project from scratch again, which is one thing in my TODO list. However, I think the code and scrips are helpful for people who are curious about how we implement our method. Please feel free to ask any questions about this project. My email address: li.huayang.lh6 [at] is.naist.jp
- Unify the python environment
- Test those scripts one more time
- Release the human annotated data for retrieval
- Release the code for training
- Release the pre-trained model
- Release the code for retrieval inference
- Release the code for MT inference
TODO: Unify the python environment
- [Preparing Training Data] GIZA++ requires python2.7
- [Training Model] Our project requires python3.9 + transformers==4.27.1
- [SFT LLM for MT] Platypus requires transformers>=4.31.0
Below is the short explanation about four critical folders:
mytools
: This is a library containing commonly used functions in this project, such as reading and saving files.mgiza
: The code of GIZA++ for automatically inducing word alignment information from parallel data, which is important for collecting training data for CCPR.code
: The main code for our project, including the code for model, dataloader, indexing, searching, etc.Platypus
: The code for LLM-based translator. In our paper, we use the CCPR model to augment the LLM-based translator by integrating the retrieved information to the LLM prompt.
!!!Please install those libraries according to their README files!!!
Please ensure the mytools
library is installed.
python ./mytools/mytools/hf_tools.py download_hf_model "sentence-transformers/LaBSE" ./huggingface/LaBSE
python ./mytools/mytools/hf_tools.py download_hf_model "FacebookAI/xlm-roberta-base" ./huggingface/xlm-roberta-base
Please also make sure you have the checkpoint of Llama-2-7B, which will be used for the MT task.
for L1 in "de" "cs" "fi" "ru" "ro" "tr"
do
python ./mytools/mytools/hf_tools.py download_wmt --ds_name wmt16 --lang_pair ${L1}-en --save_path ./wmt16_${L1}en
done
If you want to pre-process the data of human-annotated word alignment by youself, please download the raw data as follows:
URL | Re-name | |
---|---|---|
De->En | https://www-i6.informatik.rwth-aachen.de/goldAlignment/ | ./align_data/DeEn |
Cs->En | https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1804 | ./align_data/CzEn |
Ro->En | http://web.eecs.umich.edu/~mihalcea/wpt05/data/Romanian-English.test.tar.gz | ./align_data/RoEn |
# make sure you are under the root directory of the project
mkdir -p newscrawl
cd newscrawl
YY=16 # an example
LANG=tr # an example
wget https://data.statmt.org/news-crawl/${LANG}/news.20${YY}.${LANG}.shuffled.deduped.gz
gzip -d news.20${YY}.${LANG}.shuffled.deduped.gz
where YY
is the number of year, e.g., 16
, and LANG
is the language of the data, e.g, tr
.
Before runing the following script, please remember to complete some configs in the script, e.g., path to python, and also make sure that you have installed the required libraries.
Please download the pre-processed data-bin from [this link], and put it to the root directory of this project, i.e., this folder.
You can download our pre-trained retriever through this link. If you want to train your own model, please refer to Section 5.3. In addition, please download the pre-processed human-annotated data and the indices of high-quality phrases selected by humans, and put them under the root directory of our project (this folder).
cd code
bash eval_retriever.sh
If you want to pre-process your own data for retrieval, please check un-comment the code for data processing in eval_retriever.sh
.
Step-1: Model training
Please save unzip the pre-trained retriever and save the ckpts
folder to the root path of this project (this folder). If you don't want to use the pre-trained model, please see the Section 5.3 to train your own model.
Step-2: Data Processing & Indexing & searching
Please make sure you have downloaded the newscrawl monolingual data and install the required libraries.
cd code
bash index_and_search.sh
Step-4: Instruction-tuning LLM for translation
Please save unzip the pre-trained LLM-based translator and save the ckpts
folder to the root path of this project (this folder). If you don't want to use the pre-trained model, prepare training data and train it by yourself following the README of Platypus. For the training data, you need to make samples from WMT training dataset and retrieve cross-lingual phrases using the index built in previous step.
# skip this if you want to use the pre-trained translator
cd Platypus
bash fine-tuning.sh
Step-5: Decoding & Reprting Score
You can use the folllowing data to prepare the prompts for translation. Note you can prepare the training data based on the following script.
cd Platypus
bash prepare_alpaca_data_phrase_test_enxx.sh
Then, run the script for decoding and evaluation. The score for the method will be printed.
cd Platypus
bash inference_test.sh
Please check whether you need set project configs, e.g., project path, for each script.
Step-1: get the word alignment information from parallel data using GIZA++
cd mgiza/mgizapp/build
bash install.sh
bash run.sh
Step-2: run the following code to automatically induce the cross-lingual phrase pairs from the parallel data.
cd code
bash preprocess_data.sh
Step-3: train model
cd code
bash train_retriever_labse_multilingual.sh