Skip to content

Fast and accurate phonemizer, built in Rust

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

Patchethium/Celosia

Repository files navigation

Celosia

Fast and accurate phonemizer1, built in Rust

Celosia (/siːˈloʊʃiə/ see-LOH-shee-ə) is a Rust crate that turns a sentence of natural language into its phoneme transcript automatically. It supports English (amepd), Japanese (romaji), Mandarin (pinyin), French and German (prosodylab), with language-specific data (stress, accent and tones).

🚧 WIP, DO NOT USE 🚧

Overview

This section briefly introduces the phonemization pipeline for each language.

English

  1. Look up words in amepd for spelling and stress.
  2. For words that have multiple spellings, use the POS tag provided by amepd and a Averaged Perceptron Tagger to disambiguate them.
  3. For OOV (out-of-vocabulary) words, predict the spelling with a g2p2 model.

Japanese

  1. Retrieve the full context label from the text with openjtalk-rs.
  2. Parse the context label, retrieve accent phrases and their mora boundary information.
  3. We ignore OOV words for the UTF code doesn't contain any information of spelling.

Mandarin Chinese

  1. Segment text with jieba-rs into words.
  2. Look up words in CC-CEDICT for pinyin and tones.
  3. For words that have multiple spellings, use a CRF model to disambiguate them.
  4. We ignore OOV words for the UTF code doesn't contain any information of spelling.

French & German

Thanks for the orthography, French & German generally don't have the disambiguation (i.e. one word, multiple spelling) problem that is commonly seen in the languages above.

  1. Look up words in prosody-lab's dictionaries.
  2. For OOVs, we predict the spelling with a g2p2 model.

G2P

The G2P model we're using is a seq2seq transformer model, you can find more information in the module.

Development

# test
cargo test
# benchmark
cargo bench
# build
cargo build --release

License

Celosia is dual-licensed under the Apache License, Version 2.0 or the MIT license, at your option. This file may not be copied, modified, or distributed except according to those terms.

About the name

Celosiawikipedia is a small genus of edible and ornamental plants in the amaranth family, Amaranthaceae.

Celosia.JPG
By Hariya1234 - Own work, CC BY-SA 3.0, Link

Credits

The 3-rd party licences can be referred at third_party/README.md.

Footnotes

  1. Phonemize here refers to the procedure of transforming one or more word(s) into a phoneme sequence:
    "hello, world" -> "hh ax l ow1 _ w er1 l d" # Yes
    "world" -> "w er1 l d" # Yes

  2. G2P/g2p here refers to the procedure of transforming one single word into a phoneme sequence:
    "hello, world" -> "hh ax l ow1 _ w er1 l d" # No
    "world" -> "w er1 l d" # Yes 2

About

Fast and accurate phonemizer, built in Rust

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published