Web application: https://sci.ponomar.net/translate
Architecture:
- Uses large corpus of Church Slavonic texts to collect word list (with frequences)
- Words are automatically converted to civic script, preserving accents. This is the basis for conversion from civic to Church Slavonic. Except that sometimes different Church Slavonic forms reduce to the same civic form. In such a case we pick the most frequent variant of Church Slavonic form.
- Out-of-vocabulary words are converted using ML-trained interpolator (see below for the training instructions)
Install project dependencies using PyPI.
pip install -r requirements.txt
For translator: https://wandb.ai/elbat/translator/reports/---VmlldzoxNjc4NDQy https://wandb.ai/elbat/translator/reports/-2022-06-29--VmlldzoyMjQ0MDM1
For accentor: https://wandb.ai/elbat/accent/reports/Accent-training--VmlldzoyMjQwNDM4
See data/README.md
python -m translator.train
python -m accent.train
python -m translator.review
python -m accent.review
This command will use validation partition to compute the error rates. It computes error rates on accented and unaccented input separately, and also provides overall (balanced) error rate.
(and extracting vocab)
python -m translator.onnx_export
python -m accent.onnx_export
This command takes model.ckpt
(result of training) and exports model to ONNX format
creating model.onnx
and vocab.json
.
ONNX model can be used with different runtimes. For example, with in-browser JS runtime.
Web application using the trained model is in ui/
sub-directory.
This is a standard Svelte-based web app. Here is the development stanza:
Step 1. Build dependency ctc-beam-search
:
cd ctc-beam-search/
npm i
npm run build
Step 2. Run UI:
cd ui/
npm i
npm run dev