forked from sequitur-g2p/sequitur-g2p
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
139 lines (109 loc) · 5.32 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
Sequitur G2P
============
A trainable Grapheme-to-Phoneme converter.
Introduction
------------
Sequitur G2P is a data-driven grapheme-to-phoneme converter written at
RWTH Aachen University by Maximilian Bisani.
The method used in this software is described in
M. Bisani and H. Ney: "Joint-Sequence Models for Grapheme-to-Phoneme
Conversion". Speech Communication, Volume 50, Issue 5, May 2008,
Pages 434-451
(available online at http://dx.doi.org/10.1016/j.specom.2008.01.002)
This software is made available to you under terms of the GNU Public
License. It can be used for experimentation and as part of other free
software projects. For details see the licensing terms below.
If you publish about work that involves the use of this software,
please cite the above paper. (You should feel obliged to do so by
rules of good scientific conduct.)
You may contact the author with any questions or comments via e-mail:
maximilian.bisani@rwth-aachen.de
License
-------
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License Version 2 (June
1991) as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, you will find it at
http://www.gnu.org/licenses/gpl.html, or write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110,
USA.
Should a provision of no. 9 and 10 of the GNU General Public License
be invalid or become invalid, a valid provision is deemed to have been
agreed upon which comes closest to what the parties intended
commercially. In any case guarantee/warranty shall be limited to gross
negligent actions or intended actions or fraudulent concealment.
Installing
----------
To build and use this software you need to have the following part installed:
- Python (http://www.python.org)
tested with 2.5, 2.7 and 3.6
- SWIG (http://www.swig.org)
tested with 1.3.31
- NumPy (http://numpy.scipy.org)
tested with 1.0.4
- a C++ compiler that's recognized by Python's distutils.
tested with GCC 4.1, 4.2 and 4.3
To install change to the source directory and type:
python setup.py install --prefix /usr/local
You may substitute /usr/local with some other directory. If you do so
make sure that some-other-directory/lib/python2.5/site-packages/ is in
your PYTHONPATH, e.g. by typing
export PYTHONPATH=some-other-directory/lib/python2.5/site-packages
You can also install via `pip` by pointing it at this repository. You still
need SWIG and a C++ compiler.
```
pip install numpy
pip install git+https://github.com/sequitur-g2p/sequitur-g2p@master
```
Using
-----
Sequitur G2P is a data-driven grapheme-to-phoneme converter.
Actually, it can be applied to any monotonous sequence translation
problem, provided the source and target alphabets are small (less than
255 symbols). Data-driven means that you need to train it with
example pronunciations. It has no built-in linguistic knowledge
whatsoever, which means that it should work for any alphabetic
language. Training takes a pronunciation dictionary and creates a
model file. The model file can then be used to transcribe words that
where not in the dictionary.
Here is step-by-step guide to get you started:
1. Obtain a pronunciation dictionary for training.
The format is one word per line. Each line contains the
orthographic form of the word followed by the corresponding
phonemic transcription. The word and all phonemes need to be
separated by white space. The word and phoneme symbols may thus
not contain blanks. We'll assume your training lexicon is called
train.lex, and that you set aside some portion for testing purposes
as test.lex, which is disjoint from train.lex.
2. Train a model.
To create a first model type:
g2p.py --train train.lex --devel 5% --write-model model-1
This first model will be rather poor because it is only a unigram.
To create higher order models you need to run g2p.py again:
g2p.py --model model-1 --ramp-up --train train.lex --devel 5% --write-model model-2
Repeat this a couple of times
g2p.py --model model-2 --ramp-up --train train.lex --devel 5% --write-model model-3
g2p.py --model model-3 --ramp-up --train train.lex --devel 5% --write-model model-4
...
3. Evaluate the model.
To find out how accurately your model can transcribe unseen words type:
g2p.py --model model-6 --test test.lex
4. Transcribe new words.
Prepare a list of words you want to transcribe as a simple text
file words.txt with one word per line (and no phonemic
transcription), then type:
g2p.py --model model-3 --apply words.txt
Random comments:
- You cannot open models created in a python3 environment inside a
python2 environment. The opposite works.
- Whenever a file name is required, you can specify "-" to mean
standard in, or standard out.
- If a file name ends in ".gz", it is assumed that the file is (or
should be) compressed using gzip.
- For the time being you need to type g2p.py --help and/or read the
source to find out the other things g2p.py can do. Sorry about that.