Update to 2.1 (#1)

* fixed bootphon#26 * phonemizer-2.0.1 * phonemizer-2.0.1 * phonemizer-2.0.1 * CI upload to pypi * fixed bootphon#31 * bugfix in parsing espeak-ng version * bugfix in parsing espeak-ng version * update copyright * WIP * Allow sampa for espeak * option to specify an alternative espeak/espeak-ng binary * deploy only on new tags * WIP * Add replacing content * add PyYaml requirement * add test and replacement as str * WIP * merge PR bootphon#34 from @Rachine * can specify an alternative festival executable * bugfix in setup.py * fixed sampa mapping for French * corrected ChangeLog * CI on multiple versions of espeak * CI on multiple versions of espeak * minor improvments * punctuation processing implemented * release phonemizer-2.1 * updated README * updated CHANGELOG * fixing gitlab CI * fixing gitlab CI * fixed issue bootphon#39 * pep8 * fixed issue bootphon#40 * fixed a test on espeak>=1.50 Co-authored-by: Mathieu Bernard <mathieu.a.bernard@inria.fr> Co-authored-by: Rachid Riad <riadrachid3@gmail.com>
resemble-ai · Feb 17, 2021 · aa7d4c2 · aa7d4c2
1 parent 2d7d85f
commit aa7d4c2
Show file tree

Hide file tree

Showing 27 changed files with 457 additions and 543 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -1,6 +1,6 @@
 before_script:
   # load the requested modules on oberon
-  - module load anaconda/3 festival/2.4 mbrola
+  - module load anaconda/3 festival/2.4
 
 phonemizer-build:
   stage: build
@@ -19,7 +19,7 @@ phonemizer-build:
 # run the unit tests within the CI environment
 - conda activate phonemizer-ci
 - phonemize --version
-- coverage run && coverage report
+- python setup.py test
 
 phonemizer-test-espeak-1-48-04:
   stage: test

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,85 +2,15 @@
 
 Version numbers follow [semantic versioning](https://semver.org)
 
-## not yet released
-
-* **improvements**
-
-  * phonemizer's logger no more conflicts with other loggers when imported from
-    Python (see PR [#61](https://github.com/bootphon/phonemizer/pull/61)).
-
-## phonemizer-2.2.2
-
-* **bugfixes**
-
-  * Fixed installation from source (bug introduced in 2.2.1, see
-    issue [#52](https://github.com/bootphon/phonemizer/issues/52)).
-
-  * Fixed a bug when trying to restore punctuation on an empty text (see issue
-    [#54](https://github.com/bootphon/phonemizer/issues/54)).
 
-  * Fixed an edge case bug when using custom punctuation marks (see issue
-    [#55](https://github.com/bootphon/phonemizer/issues/55)).
-
-  * Fixed regex issue that causes digits to be considered punctuation (see
-    issue [#60](https://github.com/bootphon/phonemizer/pull/60)).
-
-
-## phonemizer-2.2.1
-
-* **improvements**
-
-  From Python import the phonemize function using `from phonemizer import
-  phonemize` instead of `from phonemizer.phonemize import phonemize`. The
-  second import is still available for compatibility.
-
-* **bugfixes**
-
-  * Fixed a minor bug in `utils.chunks`.
-
-  * Fixed warnings on language switching for espeak backend when using parallel
-    jobs (see issue [#50](https://github.com/bootphon/phonemizer/issues/50)).
-
-  * Save file in utf-8 explicitly for Windows compat (see issue
-    [#43](https://github.com/bootphon/phonemizer/issues/43)).
-
-  * Fixed build and tests in Dockerfile (see issue
-    [#45](https://github.com/bootphon/phonemizer/issues/45)).
-
-
-## phonemizer-2.2
-
-* **new features**
-
-  * New option ``--list-languages`` to list the available languages for a given
-    backend from the command line.
-
-  * The ``--sampa`` option of the ``espeak`` backend has been replaced by a new
-    backend ``espeak-mbrola``.
-
-    * The former ``--sampa`` option (introduced in phonemizer-2.0) outputs
-      phones that are not standard SAMPA but are adapted to the espeak TTS
-      front-end.
-
-    * On the other hand the ``espeak-mbrola`` backend allows espeak to output
-      phones in standard SAMPA (adapted to the mbrola TTS front-end). This
-      backend requires mbrola to be installed, as well as additional mbrola
-      voices to support needed languages. **This backend does not support word
-      separation nor punctuation preservation**.
+## not yet released
 
 * **bugfixes**
 
-  * Fixed issues with punctuation processing on some corner cases, see issues
+  * fixed issues with punctuation processing on some corner cases, see issues
     [#39](https://github.com/bootphon/phonemizer/issues/39) and
     [#40](https://github.com/bootphon/phonemizer/issues/40).
 
-  * Improvments and updates in the documentation (Readme, ``phonemize --help``
-    and Python code).
-
-  * Fixed a test when using ``espeak>=1.50``.
-
-  * Empty lines are correctly ignored when reading text from a file.
-
 
 ## phonemizer-2.1
 

diff --git a/README.md b/README.md
@@ -9,7 +9,10 @@ https://doi.org/10.5281/zenodo.1045825)
 
 # Phonemizer -- *foʊnmaɪzɚ*
 
-* The phonemizer allows simple phonemization of words and texts in many languages.
+* Simple text to phones converter for multiple languages, based on
+  [festival](http://www.cstr.ed.ac.uk/projects/festival),
+  [espeak-ng](https://github.com/espeak-ng/espeak-ng/)
+  and [segments](https://github.com/cldf/segments).
 
 * Provides both the `phonemize` command-line tool and the Python function
   `phonemizer.phonemize`.
@@ -76,7 +79,7 @@ the phonemizer.
 ### Docker image
 
 Alternatively you can run the phonemizer within docker, using the
-provided `Dockerfile**. To build the docker image, have a:
+provided `Dockerfile`. To build the docker image, have a:
 
     $ git clone https://github.com/bootphon/phonemizer
     $ cd phonemizer
@@ -116,8 +119,8 @@ For a complete list of available options, have a:
 See the installed backends with the `--version` option:
 
     $ phonemize --version
-    phonemizer-2.2
-    available backends: espeak-ng-1.49.3, espeak-mbrola, festival-2.5.0, segments-2.0.1
+    phonemizer-2.0
+    available backends: festival-2.5.0, espeak-ng-1.49.3, segments-2.0.1
 
 
 ### Input/output exemples
@@ -202,8 +205,8 @@ The exhaustive list of supported languages is available with the command
 
 ### Token separators
 
-You can specify separators for phones, syllables (**festival** only) and
-words (excepted **espeak-mbrola**).
+You can specify separators for phones, syllables (festival only) and
+words.
 
     $ echo "hello world" | phonemize -b festival -w ' ' -p ''
     hhaxlow werld
@@ -230,6 +233,18 @@ a space for both phones and words):
 
 ### Punctuation
 
+By default the punctuation is removed in the phonemized output. You can preserve
+it using the ``--preserve-punctuation`` option:
+
+    $ echo "hello, world!" | phonemize --strip
+    həloʊ wɜːld
+
+    $ echo "hello, world!" | phonemize --preserve-punctuation --strip
+    həloʊ, wɜːld!
+
+
+### Options
+
 By default the punctuation is removed in the phonemized output. You can preserve
 it using the ``--preserve-punctuation`` option (not supported by the
 **espeak-mbrola** backend):
@@ -243,7 +258,25 @@ it using the ``--preserve-punctuation`` option (not supported by the
 
 ### Espeak specific options
 
-* The **espeak** backend can output the stresses on phones:
+        $ echo "bonjour le monde" | phonemize -b espeak -l fr-fr -p ' ' -w ';eword '
+        b ɔ̃ ʒ u ʁ ;eword l ə- ;eword m ɔ̃ d ;eword
+
+* In Japanese, using **segments**
+
+        $ echo 'konnichiwa' | phonemize -b segments -l japanese
+        konnitʃiwa
+
+        $ echo 'konnichiwa' | phonemize -b segments -l ./phonemizer/share/japanese.g2p
+        konnitʃiwa
+
+* **Espeak** can output SAMPA phonemes instead of IPA ones (this is only supported
+  by espeak-ng, not by the original espeak)
+
+        $ echo "hello world" | phonemize -l en-us -b espeak --sampa
+        h@loU w3:ld
+
+* **Espeak** can output the stresses on phones (this is not supported by festival
+  or segments backends)
 
         $ echo "hello world" | phonemize -l en-us -b espeak --with-stress
         həlˈoʊ wˈɜːld
@@ -267,9 +300,101 @@ it using the ``--preserve-punctuation`` option (not supported by the
         [WARNING] removed 1 utterances containing language switches (applying "remove-utterance" policy)
 
 
+### Supported languages
+
+* Languages supported by festival are:
+
+        en-us	->	english-us
+
+* Languages supported by the segments backend are:
+
+        chintang  -> ./phonemizer/share/chintang.g2p
+	    cree	  -> ./phonemizer/share/cree.g2p
+	    inuktitut -> ./phonemizer/share/inuktitut.g2p
+	    japanese  -> ./phonemizer/share/japanese.g2p
+	    sesotho	  -> ./phonemizer/share/sesotho.g2p
+	    yucatec	  -> ./phonemizer/share/yucatec.g2p
+
+  Instead of a language you can also provide a file specifying a
+  grapheme to phone mapping (see the files above for exemples).
+
+* Languages supported by espeak are (espeak-ng supports even more of
+  them), type `phonemize --help` for an exhaustive list:
+
+        af	->	afrikaans
+        an	->	aragonese
+        bg	->	bulgarian
+        bs	->	bosnian
+        ca	->	catalan
+        cs	->	czech
+        cy	->	welsh
+        da	->	danish
+        de	->	german
+        el	->	greek
+        en	->	default
+        en-gb	->	english
+        en-sc	->	en-scottish
+        en-uk-north	->	english-north
+        en-uk-rp	->	english_rp
+        en-uk-wmids	->	english_wmids
+        en-us	->	english-us
+        en-wi	->	en-westindies
+        eo	->	esperanto
+        es	->	spanish
+        es-la	->	spanish-latin-am
+        et	->	estonian
+        fa	->	persian
+        fa-pin	->	persian-pinglish
+        fi	->	finnish
+        fr-be	->	french-Belgium
+        fr-fr	->	french
+        ga	->	irish-gaeilge
+        grc	->	greek-ancient
+        hi	->	hindi
+        hr	->	croatian
+        hu	->	hungarian
+        hy	->	armenian
+        hy-west	->	armenian-west
+        id	->	indonesian
+        is	->	icelandic
+        it	->	italian
+        jbo	->	lojban
+        ka	->	georgian
+        kn	->	kannada
+        ku	->	kurdish
+        la	->	latin
+        lfn	->	lingua_franca_nova
+        lt	->	lithuanian
+        lv	->	latvian
+        mk	->	macedonian
+        ml	->	malayalam
+        ms	->	malay
+        ne	->	nepali
+        nl	->	dutch
+        no	->	norwegian
+        pa	->	punjabi
+        pl	->	polish
+        pt-br	->	brazil
+        pt-pt	->	portugal
+        ro	->	romanian
+        ru	->	russian
+        sk	->	slovak
+        sq	->	albanian
+        sr	->	serbian
+        sv	->	swedish
+        sw	->	swahili-test
+        ta	->	tamil
+        tr	->	turkish
+        vi	->	vietnam
+        vi-hue	->	vietnam_hue
+        vi-sgn	->	vietnam_sgn
+        zh	->	Mandarin
+        zh-yue	->	cantonese
+
+
 ## Licence
 
-**Copyright 2015-2021 Mathieu Bernard**
+**Copyright 2015-2020 Mathieu Bernard**
 
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by

diff --git a/phonemizer/__init__.py b/phonemizer/__init__.py
@@ -1,4 +1,4 @@
-# Copyright 2015-2021 Mathieu Bernard
+# Copyright 2015-2020 Mathieu Bernard
 #
 # This file is part of phonologizer: you can redistribute it and/or
 # modify it under the terms of the GNU General Public License as
@@ -14,4 +14,4 @@
 # along with phonologizer. If not, see <http://www.gnu.org/licenses/>.
 """Multilingual text to phones converter"""
 
-__version__ = '2.0.2-resemble'
+__version__ = '2.1-resemble'
diff --git a/phonemizer/backend/__init__.py b/phonemizer/backend/__init__.py
@@ -1,4 +1,4 @@
-# Copyright 2015-2021 Mathieu Bernard
+# Copyright 2015-2020 Mathieu Bernard
 #
 # This file is part of phonologizer: you can redistribute it and/or
 # modify it under the terms of the GNU General Public License as

diff --git a/phonemizer/backend/base.py b/phonemizer/backend/base.py
@@ -1,4 +1,4 @@
-# Copyright 2015-2021 Mathieu Bernard
+# Copyright 2015-2020 Mathieu Bernard
 #
 # This file is part of phonemizer: you can redistribute it and/or
 # modify it under the terms of the GNU General Public License as
@@ -90,7 +90,16 @@ def is_supported_language(cls, language):
     def phonemize(self, text, separator=default_separator,
                   strip=False, njobs=1):
         """Returns the `text` phonemized for the given language"""
-        text, text_type, punctuation_marks = self._phonemize_preprocess(text)
+        # remember the text type for output (either list or string)
+        text_type = type(text)
+
+        # deals with punctuation: remove it and keep track of it for
+        # restoration at the end if asked for
+        punctuation_marks = []
+        if self.preserve_punctuation:
+            text, punctuation_marks = self._punctuator.preserve(text)
+        else:
+            text = self._punctuator.remove(text)
 
         if njobs == 1:
             # phonemize the text forced as a string
@@ -113,7 +122,14 @@ def phonemize(self, text, separator=default_separator,
             # restore the log as it was before parallel processing
             self.logger = log_storage
 
-        return self._phonemize_postprocess(text, text_type, punctuation_marks)
+        # restore the punctuation is asked for
+        if self.preserve_punctuation:
+            text = self._punctuator.restore(text, punctuation_marks)
+
+        # output the result formatted as a string or a list of strings
+        # according to type(text)
+        return (list2str(text) if text_type in six.string_types
+                else str2list(text))
 
     @abc.abstractmethod
     def _phonemize_aux(self, text, separator, strip):