Japanese→Korean Translator specialized in Final Fantasy XIV
FINAL FANTASY is a registered trademark of Square Enix Holdings Co., Ltd.
[Model Link (Huggingface)] [Model Link (Github)] [Demo Link]
At the beginning, this project was created to solve the [issue] in IronworksTranslator, which is to provide more accurate translation result in Final Fantasy XIV game chat.
Papago and DeepL can be a great choice in common situation, but not for the text in specific game. So I'm trying to make the alternative to help people who want to get better understand and communication in Japanese game.
You can try web demo in [Here]
This is an example of Windows GUI app using ONNX-converted model and ONNXRuntime. For more information, please visit [Here]
1.2 Training report
Before you run the code, make sure the required environments have installed.
Check docker-compose.yml
. You may want to edit following line:
devices:
- driver: nvidia
# device_ids: [ '1' ]
count: 1
capabilities: [ gpu ]
If you have only one GPU, use count: 1
.
If you need to use specific GPU, comment the count: 1
and use the device_ids: [1]
part. device_id
starts from 0, so 1 means you are to use 2nd GPU.
Then you need to create the Docker volume huggingface-cache
. Refer following command:
docker volume create --driver local --opt type=none --opt o=bind \
--opt device=/mnt/disk1/huggingface-cache \
huggingface-cache
If you properly set everything, run docker compose build
and docker compose up -d
to run the container.
Check [requirements.txt].
You can use this file with PyPI(pip install -r requirements.txt
)
from transformers import(
EncoderDecoderModel,
PreTrainedTokenizerFast,
BertJapaneseTokenizer,
)
import torch
encoder_model_name = "cl-tohoku/bert-base-japanese-v2"
decoder_model_name = "skt/kogpt2-base-v2"
src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name)
trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name)
# You should change following `./best_model` to the path of model **directory**
model = EncoderDecoderModel.from_pretrained("./best_model")
text = "ギルガメッシュ討伐戦"
# text = "ギルガメッシュ討伐戦に行ってきます。一緒に行きましょうか?"
def translate(text_src):
embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt')
embeddings = {k: v for k, v in embeddings.items()}
output = model.generate(**embeddings, max_length=500)[0, 1:-1]
text_trg = trg_tokenizer.decode(output.cpu())
return text_trg
print(translate(text))
Note that current Optimum.OnnxRuntime still requires PyTorch for backend. [Issue] You can use either [ONNX] or [quantized ONNX] model.
from transformers import BertJapaneseTokenizer,PreTrainedTokenizerFast
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from onnxruntime import SessionOptions
import torch
encoder_model_name = "cl-tohoku/bert-base-japanese-v2"
decoder_model_name = "skt/kogpt2-base-v2"
src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name)
trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name)
sess_options = SessionOptions()
sess_options.log_severity_level = 3 # mute warnings including CleanUnusedInitializersAndNodeArgs
# change subfolder to "onnxq" if you want to use the quantized model
model = ORTModelForSeq2SeqLM.from_pretrained("sappho192/ffxiv-ja-ko-translator",
sess_options=sess_options, subfolder="onnx")
texts = [
"逃げろ!", # Should be "도망쳐!"
"初めまして.", # "반가워요"
"よろしくお願いします.", # "잘 부탁드립니다."
"ギルガメッシュ討伐戦", # "길가메쉬 토벌전"
"ギルガメッシュ討伐戦に行ってきます。一緒に行きましょうか?", # "길가메쉬 토벌전에 갑니다. 같이 가실래요?"
"夜になりました", # "밤이 되었습니다"
"ご飯を食べましょう." # "음, 이제 식사도 해볼까요"
]
def translate(text_src):
embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt')
print(f'Src tokens: {embeddings.data["input_ids"]}')
embeddings = {k: v for k, v in embeddings.items()}
output = model.generate(**embeddings, max_length=500)[0, 1:-1]
print(f'Trg tokens: {output}')
text_trg = trg_tokenizer.decode(output.cpu())
return text_trg
for text in texts:
print(translate(text))
print()
Check training.ipynb
for the full example.
Before you run the training code, be sure to check that the datasets are exists in designated path.
In training.ipynb
, the code regarding dataset is in Data
section like below:
DATA_ROOT = './output'
# FILE_FFAC_FULL = 'ffac_full.csv'
# FILE_FFAC_TEST = 'ffac_test.csv'
FILE_JA_KO_TRAIN = 'ja_ko_train.csv'
FILE_JA_KO_TEST = 'ja_ko_test.csv'
# train_dataset = PairedDataset(src_tokenizer, trg_tokenizer, f'{DATA_ROOT}/{FILE_FFAC_FULL}')
# eval_dataset = PairedDataset(src_tokenizer, trg_tokenizer, f'{DATA_ROOT}/{FILE_FFAC_TEST}')
train_dataset = PairedDataset(src_tokenizer, trg_tokenizer, f'{DATA_ROOT}/{FILE_JA_KO_TRAIN}')
eval_dataset = PairedDataset(src_tokenizer, trg_tokenizer, f'{DATA_ROOT}/{FILE_JA_KO_TEST}')
Those .csv
files contain two column data pair: the first column contains the sentence in Japanese language and the second column contains the sentence in Korean language.
Because of the spec of encoder and decoder model, the length of each sentence should not exceed 512 characters. I recommend you to remove the data rows if one of column data is more than 500 characters.
You don't need to define the name of column at the first row.
The training code uses wandb
for the telemetry. Please make account in [wandb.ai] if you don't have the account.
Since the main goal of this project is to help Koreans communicate in Japanese games, so I'm not considering other languages. However, I believe you can use the structure of this project to create your own translator for your own language combinations.
- A. A model trained on a small amount of game terms is able to correctly translate the same terms
- B. Somewhat translate sentences that contain some game terms
- A. Properly translate sentences that contain some game terms
- B. Somewhat translate sentences that contain most of game terms
The translator model trained in this repository used jpn-kor
[sub-dataset] in [Helsinki-NLP/tatoeba_mt]. This dataset is shared under the [CC BY-NC-SA 4.0] licence [Source].
You can acquire the specific jpn-kor
dataset from [HuggingFace].
© SQUARE ENIX CO., LTD. All Rights Reserved.
The auto-translator is a feature in Final Fantasy XIV: A Realm Reborn that auto-translates text into whatever language a player's client is set to.
From [Final Fantasy XIV: A Realm Reborn Wiki] (CC BY-SA 3.0)
Since the Auto-Translate words and sentences contain essential terms mainly used in the game, I used this dataset as a primary source to accurately train the model.
According to the Materials Usage License ([EN] [JP]) of Final Fantasy XIV, I can use All art, text, logos, videos, screenshots, images, sounds, music and recordings from FFXIV
without any sales or commercial use
and license fees or advertising revenue
, but even so, I must immediately comply with any request by Square Enix to remove any Materials, in Square Enix's sole discretion
.
Based on above condition, I have gathered Auto-Translate text ① I see in the game myself, ② referring fandom wiki page [eLeMeN - FF14 - その他_定型文辞書].
As announced in #9 Release the dataset, I've established own rules for this repository about sharing the dataset directly:
- Never disseminate the part of dataset which is gathered by data-mining unless the Square Enix requested or permitted to do so
- If some part of dataset is gathered from the vaild source, don't share them in here directly but attach the link or guide to acquire the same data
It is to fulfill the request of Naoki Yoshida announced in Regarding Third-party Tools:
... I've made this request before, and I make it again: please refrain from disseminating mined data.
However I'm going to make up for guides to reproduce my training result as same as possible. And you can always ask a question via Discussions page.
This repository contains the library and data which has various license. If you meet a restriction, consider replacing that part into alternative.
- [KoGPT2](decoder): CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0)
- [bert-japanese](encoder): Apache-2.0 License
- [CySharp/csbindgen]: MIT License
- [sappho192/BertJapaneseTokenizer]: MIT License
- [unidic-mecab-2.1.2_bin]: BSD License
- [microsoft/onnxruntime]: MIT License
- [SciSharp/NumSharp]: Apache-2.0 License
- [MeCab.DotNet]: GPL-2.0 or LGPL-2.1 License