Please cite:
@article{sr-codemix,
author = {Mieradilijiang Maimaiti and
Yuanhang Zheng and
Ji Zhang and
Fei Huang and
Yue Zhang and
Wenpei Luo and
Kaiyu Huang},
title = {Improving Cross-lingual Representation for Semantic Retrieval with
Code-switching},
journal = {arXiv preprint arXiv: 2403.01364},
year = {2024},
}
- Prepare the query and label data (e.g.
raw_data/Tatoeba.de-en.en
andraw_data/Tatoeba.de-en.de
)
Both the query and label files should contain a series of sentences. One sentence per line.
Example:
raw_data/Tatoeba.de-en.en
:
Let 's try something .
What is it ?
Today is June 18th and it is Muiriel 's birthday !
...
raw_data/Tatoeba.de-en.de
:
Lass uns etwas versuchen !
Was ist das ?
Heute ist der 18. Juni und das ist der Geburtstag von Muiriel !
...
- Prepare the dictionary data (e.g.
raw_data/en-de.txt
andraw_data/de-en.txt
)
The dictionary data should conform the format of MUSE.
You can directly download the dictionary from MUSE, or prepare your own dictionary. For example, you may also prepare a dictionary using ConceptNet.
Example:
raw_data/en-de.txt
:
the die
the der
the dem
the den
the das
and sowie
and und
was war
...
- Run the Python script
codemix.py
- Download the pre-trained language model from huggingface.
- Run the script
train.sh
in the directorycntptm
.
- Run the script
train.sh
in the directoryft
to train the model. - Run the script
predict.sh
to obtain the vector representation of the test query and label files.