Fast and accurate phonemizer1, built in Rust
Celosia
(/siːˈloʊʃiə/ see-LOH-shee-ə) is a Rust crate that turns a sentence of natural language into its phoneme transcript automatically. It supports English (amepd), Japanese (romaji), Mandarin (pinyin), French and German (prosodylab), with language-specific data (stress, accent and tones).
🚧 WIP, DO NOT USE 🚧
This section briefly introduces the phonemization pipeline for each language.
- Look up words in
amepd
for spelling and stress. - For words that have multiple spellings, use the POS tag provided by
amepd
and aAveraged Perceptron Tagger
to disambiguate them. - For OOV (out-of-vocabulary) words, predict the spelling with a g2p2 model.
- Retrieve the full context label from the text with
openjtalk-rs
. - Parse the context label, retrieve accent phrases and their mora boundary information.
- We ignore OOV words for the UTF code doesn't contain any information of spelling.
- Segment text with
jieba-rs
into words. - Look up words in
CC-CEDICT
for pinyin and tones. - For words that have multiple spellings, use a CRF model to disambiguate them.
- We ignore OOV words for the UTF code doesn't contain any information of spelling.
Thanks for the orthography, French & German generally don't have the disambiguation (i.e. one word, multiple spelling) problem that is commonly seen in the languages above.
- Look up words in
prosody-lab
's dictionaries. - For OOVs, we predict the spelling with a g2p2 model.
The G2P
model we're using is a seq2seq transformer model, you can find more information in the module.
# test
cargo test
# benchmark
cargo bench
# build
cargo build --release
Celosia
is dual-licensed under the Apache License, Version 2.0 or the MIT license, at your option. This file may not be copied, modified, or distributed except according to those terms.
Celosiawikipedia is a small genus of edible and ornamental plants in the amaranth family, Amaranthaceae.
By Hariya1234 - Own work, CC BY-SA 3.0, Link
The 3-rd party licences can be referred at third_party/README.md.
Footnotes
-
Phonemize
here refers to the procedure of transforming one or more word(s) into a phoneme sequence:
"hello, world" -> "hh ax l ow1 _ w er1 l d" # Yes
"world" -> "w er1 l d" # Yes ↩ -
G2P
/g2p
here refers to the procedure of transforming one single word into a phoneme sequence:
"hello, world" -> "hh ax l ow1 _ w er1 l d" # No
"world" -> "w er1 l d" # Yes ↩ ↩2