Dragoman: SOTA English to Ukrainian Machine Translation

This repository is an official implementation of paper Setting up the Data Printer with Improved English to Ukrainian Machine Translation (accepted to UNLP 2024 at LREC-Coling 2024). By using a two-phase data cleaning and data selection approach we have achieved SOTA performance on FLORES-101 English-Ukrainian devtest subset with BLEU 32.34.

Online demo: https://huggingface.co/spaces/lang-uk/dragoman

How to use

We designed this model for sentence-level English -> Ukrainian translation. Performance on multi-sentence texts is not guaranteed, please be aware.

Running the model

# pip install bitsandbytes transformers peft torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

config = PeftConfig.from_pretrained("lang-uk/dragoman")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=float16,
    bnb_4bit_use_double_quant=False,
)

model = MistralForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", quantization_config=quant_config
)
model = PeftModel.from_pretrained(model, "lang-uk/dragoman").to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.1", use_fast=False, add_bos_token=False
)

input_text = "[INST] who holds this neighborhood? [/INST]" # model input should adhere to this format
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Running the model with mlx-lm on an Apple computer

We merged Dragoman PT adapter into the base model and uploaded the quantized version of the model into https://huggingface.co/lang-uk/dragoman-4bit.

You can run the model using mlx-lm.

python -m mlx_lm.generate --model lang-uk/dragoman-4bit --prompt '[INST] who holds this neighborhood? [/INST]' --temp 0 --max-tokens 100

MLX is a recommended way of using the language model on an Apple computer with an M1 chip and newer.

Running the model with llama.cpp

We converted Dragoman PT adapter into the GGLA format.

You can download the Mistral-7B-v0.1 base model in the GGUF format (e.g. mistral-7b-v0.1.Q4_K_M.gguf) and use ggml-adapter-model.bin from this repository like this:

./main -ngl 32 -m mistral-7b-v0.1.Q4_K_M.gguf --color -c 4096 --temp 0 --repeat_penalty 1.1 -n -1 -p "[INST] who holds this neighborhood? [/INST]" --lora ./ggml-adapter-model.bin

Benchmark Results against other models on FLORES-101 devset

Model	BLEU $\uparrow$	spBLEU	chrF	chrF++
Finetuned
Dragoman P, 10 beams	30.38	37.93	59.49	56.41
Dragoman PT, 10 beams	32.34	39.93	60.72	57.82
---------------------------------------------	---------------------	-------------	----------	------------
Zero shot and few shot
LLaMa-2-7B 2-shot	20.1	26.78	49.22	46.29
RWKV-5-World-7B 0-shot	21.06	26.20	49.46	46.46
gpt-4 10-shot	29.48	37.94	58.37	55.38
gpt-4-turbo-preview 0-shot	30.36	36.75	59.18	56.19
Google Translate 0-shot	25.85	32.49	55.88	52.48
---------------------------------------------	---------------------	-------------	----------	------------
Pretrained
NLLB 3B, 10 beams	30.46	37.22	58.11	55.32
OPUS-MT, 10 beams	32.2	39.76	60.23	57.38

Training Dataset and Resources

Resulting Datasets

Cleaned Paracrawl (first phase): lang-uk/paracrawl_3m
Cleaned Multi30K (second phase): lang-uk/multi30k-extended-17k

Training steps

For more details, please refer to the paper.

First phase: Data Cleaning of Paracrawl dataset.
Second phase: Unsupervised Data Selection using k-fold perplexity filtering on Extended Multi30k-Uk.

# export turuta/Multi30k-uk to a local dataset
python download_dataset.py

# generate k-folds for perplexity evaluation
python generate_dataset.py --N 29_000 --dataset multi-30k-uk.jsonl

# train 5 models on 5 folds, resume from previous phase
python finetune_ppl.py --N 29_000 --run_type folds --prefix fold-training --lora_checkpoint exps/dragoman-p --lr 2e-5

# calculate perplexity for OOB data
python perplexity_evaluate.py --N 29_000 --lr 2e-5 --prefix fold-training

# apply perplexity filtering
python ppl_analysis.py --threshold 60

# train on the selected data
python finetune_ppl.py --N 29_000 --run_type cleaned --prefix fold-training --lora_checkpoint exps/dragoman-p --lr 2e-5

# train on the full data for comparison
python finetune_ppl.py --N 29_000 --run_type full --prefix fold-training --lora_checkpoint exps/dragoman-p --lr 2e-5

# evaluate on flores dev
python decode.py --checkpoint exps/fold-training_epochs_1_lr_2e-05_R_128_ALPH_256_N_29000_full --subset dev

# evaluate on flores devtest
python decode.py --checkpoint exps/fold-training_epochs_1_lr_2e-05_R_128_ALPH_256_N_29000_full --subset devtest

Citation

@inproceedings{paniv-etal-2024-setting,
    title = "Setting up the Data Printer with Improved {E}nglish to {U}krainian Machine Translation",
    author = "Paniv, Yurii  and
      Chaplynskyi, Dmytro  and
      Trynus, Nikita  and
      Kyrylov, Volodymyr",
    editor = "Romanyshyn, Mariana  and
      Romanyshyn, Nataliia  and
      Hlybovets, Andrii  and
      Ignatenko, Oleksii",
    booktitle = "Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.unlp-1.6",
    pages = "41--50",
    abstract = "To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning of a large pretrained language model with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences followed by a second phase of training using 17K examples selected by k-fold perplexity filtering on another dataset of higher quality. Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.",
}

Authors

Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
contextualizer		contextualizer
optimizers		optimizers
.gitignore		.gitignore
README.md		README.md
convert_corpus.py		convert_corpus.py
ct2_translate.py		ct2_translate.py
decode.py		decode.py
download_dataset.py		download_dataset.py
finetune.py		finetune.py
finetune_mce.py		finetune_mce.py
finetune_ppl.py		finetune_ppl.py
generate_dataset.py		generate_dataset.py
measure_tokenizer_compression.py		measure_tokenizer_compression.py
merge.py		merge.py
packer.py		packer.py
perplexity_evaluate.py		perplexity_evaluate.py
ppl_analysis.py		ppl_analysis.py
requirements.txt		requirements.txt
translate.py		translate.py
translate_cpu.py		translate_cpu.py
unsloth_finetune.py		unsloth_finetune.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dragoman: SOTA English to Ukrainian Machine Translation

Online demo: https://huggingface.co/spaces/lang-uk/dragoman

How to use

Running the model

Running the model with mlx-lm on an Apple computer

Running the model with llama.cpp

Benchmark Results against other models on FLORES-101 devset

Training Dataset and Resources

Resulting Datasets

Training steps

Citation

Authors

About

Releases

Packages

Contributors 3

Languages

lang-uk/dragoman

Folders and files

Latest commit

History

Repository files navigation

Dragoman: SOTA English to Ukrainian Machine Translation

Online demo: https://huggingface.co/spaces/lang-uk/dragoman

How to use

Running the model

Running the model with mlx-lm on an Apple computer

Running the model with llama.cpp

Benchmark Results against other models on FLORES-101 devset

Training Dataset and Resources

Resulting Datasets

Training steps

Citation

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages