Skip to content

Commit

Permalink
Update to 2.1 (#1)
Browse files Browse the repository at this point in the history
* fixed bootphon#26

* phonemizer-2.0.1

* phonemizer-2.0.1

* phonemizer-2.0.1

* CI upload to pypi

* fixed bootphon#31

* bugfix in parsing espeak-ng version

* bugfix in parsing espeak-ng version

* update copyright

* WIP

* Allow sampa for espeak

* option to specify an alternative espeak/espeak-ng binary

* deploy only on new tags

* WIP

* Add replacing content

* add PyYaml requirement

* add test and replacement as str

* WIP

* merge PR bootphon#34 from @Rachine

* can specify an alternative festival executable

* bugfix in setup.py

* fixed sampa mapping for French

* corrected ChangeLog

* CI on multiple versions of espeak

* CI on multiple versions of espeak

* minor improvments

* punctuation processing implemented

* release phonemizer-2.1

* updated README

* updated CHANGELOG

* fixing gitlab CI

* fixing gitlab CI

* fixed issue bootphon#39

* pep8

* fixed issue bootphon#40

* fixed a test on espeak>=1.50

Co-authored-by: Mathieu Bernard <mathieu.a.bernard@inria.fr>
Co-authored-by: Rachid Riad <riadrachid3@gmail.com>
  • Loading branch information
3 people authored and ZohaibAhmed committed Feb 17, 2021
1 parent 2d7d85f commit aa7d4c2
Show file tree
Hide file tree
Showing 27 changed files with 457 additions and 543 deletions.
4 changes: 2 additions & 2 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
before_script:
# load the requested modules on oberon
- module load anaconda/3 festival/2.4 mbrola
- module load anaconda/3 festival/2.4

phonemizer-build:
stage: build
Expand All @@ -19,7 +19,7 @@ phonemizer-build:
# run the unit tests within the CI environment
- conda activate phonemizer-ci
- phonemize --version
- coverage run && coverage report
- python setup.py test

phonemizer-test-espeak-1-48-04:
stage: test
Expand Down
74 changes: 2 additions & 72 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,85 +2,15 @@

Version numbers follow [semantic versioning](https://semver.org)

## not yet released

* **improvements**

* phonemizer's logger no more conflicts with other loggers when imported from
Python (see PR [#61](https://github.com/bootphon/phonemizer/pull/61)).

## phonemizer-2.2.2

* **bugfixes**

* Fixed installation from source (bug introduced in 2.2.1, see
issue [#52](https://github.com/bootphon/phonemizer/issues/52)).

* Fixed a bug when trying to restore punctuation on an empty text (see issue
[#54](https://github.com/bootphon/phonemizer/issues/54)).

* Fixed an edge case bug when using custom punctuation marks (see issue
[#55](https://github.com/bootphon/phonemizer/issues/55)).

* Fixed regex issue that causes digits to be considered punctuation (see
issue [#60](https://github.com/bootphon/phonemizer/pull/60)).


## phonemizer-2.2.1

* **improvements**

From Python import the phonemize function using `from phonemizer import
phonemize` instead of `from phonemizer.phonemize import phonemize`. The
second import is still available for compatibility.

* **bugfixes**

* Fixed a minor bug in `utils.chunks`.

* Fixed warnings on language switching for espeak backend when using parallel
jobs (see issue [#50](https://github.com/bootphon/phonemizer/issues/50)).

* Save file in utf-8 explicitly for Windows compat (see issue
[#43](https://github.com/bootphon/phonemizer/issues/43)).

* Fixed build and tests in Dockerfile (see issue
[#45](https://github.com/bootphon/phonemizer/issues/45)).


## phonemizer-2.2

* **new features**

* New option ``--list-languages`` to list the available languages for a given
backend from the command line.

* The ``--sampa`` option of the ``espeak`` backend has been replaced by a new
backend ``espeak-mbrola``.

* The former ``--sampa`` option (introduced in phonemizer-2.0) outputs
phones that are not standard SAMPA but are adapted to the espeak TTS
front-end.

* On the other hand the ``espeak-mbrola`` backend allows espeak to output
phones in standard SAMPA (adapted to the mbrola TTS front-end). This
backend requires mbrola to be installed, as well as additional mbrola
voices to support needed languages. **This backend does not support word
separation nor punctuation preservation**.
## not yet released

* **bugfixes**

* Fixed issues with punctuation processing on some corner cases, see issues
* fixed issues with punctuation processing on some corner cases, see issues
[#39](https://github.com/bootphon/phonemizer/issues/39) and
[#40](https://github.com/bootphon/phonemizer/issues/40).

* Improvments and updates in the documentation (Readme, ``phonemize --help``
and Python code).

* Fixed a test when using ``espeak>=1.50``.

* Empty lines are correctly ignored when reading text from a file.


## phonemizer-2.1

Expand Down
141 changes: 133 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,10 @@ https://doi.org/10.5281/zenodo.1045825)

# Phonemizer -- *foʊnmaɪzɚ*

* The phonemizer allows simple phonemization of words and texts in many languages.
* Simple text to phones converter for multiple languages, based on
[festival](http://www.cstr.ed.ac.uk/projects/festival),
[espeak-ng](https://github.com/espeak-ng/espeak-ng/)
and [segments](https://github.com/cldf/segments).

* Provides both the `phonemize` command-line tool and the Python function
`phonemizer.phonemize`.
Expand Down Expand Up @@ -76,7 +79,7 @@ the phonemizer.
### Docker image

Alternatively you can run the phonemizer within docker, using the
provided `Dockerfile**. To build the docker image, have a:
provided `Dockerfile`. To build the docker image, have a:

$ git clone https://github.com/bootphon/phonemizer
$ cd phonemizer
Expand Down Expand Up @@ -116,8 +119,8 @@ For a complete list of available options, have a:
See the installed backends with the `--version` option:

$ phonemize --version
phonemizer-2.2
available backends: espeak-ng-1.49.3, espeak-mbrola, festival-2.5.0, segments-2.0.1
phonemizer-2.0
available backends: festival-2.5.0, espeak-ng-1.49.3, segments-2.0.1


### Input/output exemples
Expand Down Expand Up @@ -202,8 +205,8 @@ The exhaustive list of supported languages is available with the command

### Token separators

You can specify separators for phones, syllables (**festival** only) and
words (excepted **espeak-mbrola**).
You can specify separators for phones, syllables (festival only) and
words.

$ echo "hello world" | phonemize -b festival -w ' ' -p ''
hhaxlow werld
Expand All @@ -230,6 +233,18 @@ a space for both phones and words):

### Punctuation

By default the punctuation is removed in the phonemized output. You can preserve
it using the ``--preserve-punctuation`` option:

$ echo "hello, world!" | phonemize --strip
həloʊ wɜːld

$ echo "hello, world!" | phonemize --preserve-punctuation --strip
həloʊ, wɜːld!


### Options

By default the punctuation is removed in the phonemized output. You can preserve
it using the ``--preserve-punctuation`` option (not supported by the
**espeak-mbrola** backend):
Expand All @@ -243,7 +258,25 @@ it using the ``--preserve-punctuation`` option (not supported by the

### Espeak specific options

* The **espeak** backend can output the stresses on phones:
$ echo "bonjour le monde" | phonemize -b espeak -l fr-fr -p ' ' -w ';eword '
b ɔ̃ ʒ u ʁ ;eword l ə- ;eword m ɔ̃ d ;eword

* In Japanese, using **segments**

$ echo 'konnichiwa' | phonemize -b segments -l japanese
konnitʃiwa

$ echo 'konnichiwa' | phonemize -b segments -l ./phonemizer/share/japanese.g2p
konnitʃiwa

* **Espeak** can output SAMPA phonemes instead of IPA ones (this is only supported
by espeak-ng, not by the original espeak)

$ echo "hello world" | phonemize -l en-us -b espeak --sampa
h@loU w3:ld

* **Espeak** can output the stresses on phones (this is not supported by festival
or segments backends)

$ echo "hello world" | phonemize -l en-us -b espeak --with-stress
həlˈoʊ wˈɜːld
Expand All @@ -267,9 +300,101 @@ it using the ``--preserve-punctuation`` option (not supported by the
[WARNING] removed 1 utterances containing language switches (applying "remove-utterance" policy)


### Supported languages

* Languages supported by festival are:

en-us -> english-us

* Languages supported by the segments backend are:

chintang -> ./phonemizer/share/chintang.g2p
cree -> ./phonemizer/share/cree.g2p
inuktitut -> ./phonemizer/share/inuktitut.g2p
japanese -> ./phonemizer/share/japanese.g2p
sesotho -> ./phonemizer/share/sesotho.g2p
yucatec -> ./phonemizer/share/yucatec.g2p

Instead of a language you can also provide a file specifying a
grapheme to phone mapping (see the files above for exemples).

* Languages supported by espeak are (espeak-ng supports even more of
them), type `phonemize --help` for an exhaustive list:

af -> afrikaans
an -> aragonese
bg -> bulgarian
bs -> bosnian
ca -> catalan
cs -> czech
cy -> welsh
da -> danish
de -> german
el -> greek
en -> default
en-gb -> english
en-sc -> en-scottish
en-uk-north -> english-north
en-uk-rp -> english_rp
en-uk-wmids -> english_wmids
en-us -> english-us
en-wi -> en-westindies
eo -> esperanto
es -> spanish
es-la -> spanish-latin-am
et -> estonian
fa -> persian
fa-pin -> persian-pinglish
fi -> finnish
fr-be -> french-Belgium
fr-fr -> french
ga -> irish-gaeilge
grc -> greek-ancient
hi -> hindi
hr -> croatian
hu -> hungarian
hy -> armenian
hy-west -> armenian-west
id -> indonesian
is -> icelandic
it -> italian
jbo -> lojban
ka -> georgian
kn -> kannada
ku -> kurdish
la -> latin
lfn -> lingua_franca_nova
lt -> lithuanian
lv -> latvian
mk -> macedonian
ml -> malayalam
ms -> malay
ne -> nepali
nl -> dutch
no -> norwegian
pa -> punjabi
pl -> polish
pt-br -> brazil
pt-pt -> portugal
ro -> romanian
ru -> russian
sk -> slovak
sq -> albanian
sr -> serbian
sv -> swedish
sw -> swahili-test
ta -> tamil
tr -> turkish
vi -> vietnam
vi-hue -> vietnam_hue
vi-sgn -> vietnam_sgn
zh -> Mandarin
zh-yue -> cantonese


## Licence

**Copyright 2015-2021 Mathieu Bernard**
**Copyright 2015-2020 Mathieu Bernard**

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
Expand Down
4 changes: 2 additions & 2 deletions phonemizer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2015-2021 Mathieu Bernard
# Copyright 2015-2020 Mathieu Bernard
#
# This file is part of phonologizer: you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand All @@ -14,4 +14,4 @@
# along with phonologizer. If not, see <http://www.gnu.org/licenses/>.
"""Multilingual text to phones converter"""

__version__ = '2.0.2-resemble'
__version__ = '2.1-resemble'
2 changes: 1 addition & 1 deletion phonemizer/backend/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2015-2021 Mathieu Bernard
# Copyright 2015-2020 Mathieu Bernard
#
# This file is part of phonologizer: you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand Down
22 changes: 19 additions & 3 deletions phonemizer/backend/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2015-2021 Mathieu Bernard
# Copyright 2015-2020 Mathieu Bernard
#
# This file is part of phonemizer: you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand Down Expand Up @@ -90,7 +90,16 @@ def is_supported_language(cls, language):
def phonemize(self, text, separator=default_separator,
strip=False, njobs=1):
"""Returns the `text` phonemized for the given language"""
text, text_type, punctuation_marks = self._phonemize_preprocess(text)
# remember the text type for output (either list or string)
text_type = type(text)

# deals with punctuation: remove it and keep track of it for
# restoration at the end if asked for
punctuation_marks = []
if self.preserve_punctuation:
text, punctuation_marks = self._punctuator.preserve(text)
else:
text = self._punctuator.remove(text)

if njobs == 1:
# phonemize the text forced as a string
Expand All @@ -113,7 +122,14 @@ def phonemize(self, text, separator=default_separator,
# restore the log as it was before parallel processing
self.logger = log_storage

return self._phonemize_postprocess(text, text_type, punctuation_marks)
# restore the punctuation is asked for
if self.preserve_punctuation:
text = self._punctuator.restore(text, punctuation_marks)

# output the result formatted as a string or a list of strings
# according to type(text)
return (list2str(text) if text_type in six.string_types
else str2list(text))

@abc.abstractmethod
def _phonemize_aux(self, text, separator, strip):
Expand Down
Loading

0 comments on commit aa7d4c2

Please sign in to comment.