GitHub

1. Introduction

The code for our paper Cross-lingual Contextualized Phase Retrieval.

NOTE: Since this project contains many pipelines and each part is finished seperately during this long-term project, I have not test the whole project from scratch again, which is one thing in my TODO list. However, I think the code and scrips are helpful for people who are curious about how we implement our method. Please feel free to ask any questions about this project. My email address: li.huayang.lh6 [at] is.naist.jp

2. TODO List

Unify the python environment
Test those scripts one more time
Release the human annotated data for retrieval
Release the code for training
Release the pre-trained model
Release the code for retrieval inference
Release the code for MT inference

3. Environment

TODO: Unify the python environment

[Preparing Training Data] GIZA++ requires python2.7
[Training Model] Our project requires python3.9 + transformers==4.27.1
[SFT LLM for MT] Platypus requires transformers>=4.31.0

Below is the short explanation about four critical folders:

mytools: This is a library containing commonly used functions in this project, such as reading and saving files.
mgiza: The code of GIZA++ for automatically inducing word alignment information from parallel data, which is important for collecting training data for CCPR.
code: The main code for our project, including the code for model, dataloader, indexing, searching, etc.
Platypus: The code for LLM-based translator. In our paper, we use the CCPR model to augment the LLM-based translator by integrating the retrieved information to the LLM prompt.

!!!Please install those libraries according to their README files!!!

4. Download

4.1 HF Model

Please ensure the mytools library is installed.

python ./mytools/mytools/hf_tools.py download_hf_model "sentence-transformers/LaBSE" ./huggingface/LaBSE
python ./mytools/mytools/hf_tools.py download_hf_model "FacebookAI/xlm-roberta-base" ./huggingface/xlm-roberta-base

Please also make sure you have the checkpoint of Llama-2-7B, which will be used for the MT task.

4.2 HF Dataset

for L1 in "de" "cs" "fi" "ru" "ro" "tr"
do
    python ./mytools/mytools/hf_tools.py download_wmt --ds_name wmt16 --lang_pair ${L1}-en --save_path ./wmt16_${L1}en
done

4.3 Human-annotated Word Alignment (for Retrieval Evaluation)

If you want to pre-process the data of human-annotated word alignment by youself, please download the raw data as follows:

	URL	Re-name
De->En	https://www-i6.informatik.rwth-aachen.de/goldAlignment/	./align_data/DeEn
Cs->En	https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1804	./align_data/CzEn
Ro->En	http://web.eecs.umich.edu/~mihalcea/wpt05/data/Romanian-English.test.tar.gz	./align_data/RoEn

4.4 Newscrawl Monolingual Data (for MT Evaluation)

# make sure you are under the root directory of the project
mkdir -p newscrawl
cd newscrawl
YY=16 # an example
LANG=tr # an example
wget https://data.statmt.org/news-crawl/${LANG}/news.20${YY}.${LANG}.shuffled.deduped.gz
gzip -d news.20${YY}.${LANG}.shuffled.deduped.gz

where YY is the number of year, e.g., 16, and LANG is the language of the data, e.g, tr.

5. Ussage

5.1 Inference: Retrieval

Before runing the following script, please remember to complete some configs in the script, e.g., path to python, and also make sure that you have installed the required libraries.

Please download the pre-processed data-bin from [this link], and put it to the root directory of this project, i.e., this folder.

You can download our pre-trained retriever through this link. If you want to train your own model, please refer to Section 5.3. In addition, please download the pre-processed human-annotated data and the indices of high-quality phrases selected by humans, and put them under the root directory of our project (this folder).

cd code
bash eval_retriever.sh

If you want to pre-process your own data for retrieval, please check un-comment the code for data processing in eval_retriever.sh.

5.2 Inference: MT

Step-1: Model training

Please save unzip the pre-trained retriever and save the ckpts folder to the root path of this project (this folder). If you don't want to use the pre-trained model, please see the Section 5.3 to train your own model.

Step-2: Data Processing & Indexing & searching

Please make sure you have downloaded the newscrawl monolingual data and install the required libraries.

cd code
bash index_and_search.sh

Step-4: Instruction-tuning LLM for translation

Please save unzip the pre-trained LLM-based translator and save the ckpts folder to the root path of this project (this folder). If you don't want to use the pre-trained model, prepare training data and train it by yourself following the README of Platypus. For the training data, you need to make samples from WMT training dataset and retrieve cross-lingual phrases using the index built in previous step.

# skip this if you want to use the pre-trained translator
cd Platypus
bash fine-tuning.sh

Step-5: Decoding & Reprting Score

You can use the folllowing data to prepare the prompts for translation. Note you can prepare the training data based on the following script.

cd Platypus
bash prepare_alpaca_data_phrase_test_enxx.sh

Then, run the script for decoding and evaluation. The score for the method will be printed.

cd Platypus
bash inference_test.sh

5.3: CCPR Training

Please check whether you need set project configs, e.g., project path, for each script.

Step-1: get the word alignment information from parallel data using GIZA++

cd mgiza/mgizapp/build
bash install.sh
bash run.sh

Step-2: run the following code to automatically induce the cross-lingual phrase pairs from the parallel data.

cd code
bash preprocess_data.sh

Step-3: train model

cd code
bash train_retriever_labse_multilingual.sh

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Platypus		Platypus
code		code
mgiza		mgiza
moses-scripts		moses-scripts
mytools		mytools
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Introduction

2. TODO List

3. Environment

4. Download

4.1 HF Model

4.2 HF Dataset

4.3 Human-annotated Word Alignment (for Retrieval Evaluation)

4.4 Newscrawl Monolingual Data (for MT Evaluation)

5. Ussage

5.1 Inference: Retrieval

5.2 Inference: MT

5.3: CCPR Training

About

Releases

Packages

Languages

ghrua/ccpr_release

Folders and files

Latest commit

History

Repository files navigation

1. Introduction

2. TODO List

3. Environment

4. Download

4.1 HF Model

4.2 HF Dataset

4.3 Human-annotated Word Alignment (for Retrieval Evaluation)

4.4 Newscrawl Monolingual Data (for MT Evaluation)

5. Ussage

5.1 Inference: Retrieval

5.2 Inference: MT

5.3: CCPR Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages