The Greedy and Recursive Search for Morphological Productivity.
Caleb Belth,
Sarah Payne,
Deniz Beser,
Jordan Kodner,
Charles Yang
CogSci, 2021 [Link to the paper]
If used, please cite:
@inproceedings{belth21greedy,
title={The Greedy and Recursive Search for Morphological Productivity.},
author={Belth, Caleb and Payne, Sarah and Beser, Deniz and Kodner, Jordan and Yang, Charles},
booktitle={CogSci},
year={2021}
}
$ git clone git@github.com:cbelth/ATP-morphology.git
$ cd ATP-morphology
$ python setup.py
To test the setup, run
$ cd test/
$ python tester.py
Unless stated otherwise, the following examples assume that they are being run from the src/
directory.
# import code
>> from atp import ATP
>> from utils import load_german_CHILDES
>> pairs, feature_space = load_german_CHILDES() # load some data
>> atp = ATP(feature_space=feature_space) # initialize an ATP model
>> atp.train(pairs) # train ATP
>> atp.inflect('Sache', ('F',)) # ATP produces the correct inflection
'Sachen'
>> atp.inflect('Gleis', ('N',)) # again for a neuter noun
'Gleise'
The train()
method returns the trained model, so we can also initialize and train a model with a single line:
atp = ATP(feature_space=feature_space).train(pairs)
...
>> atp.inflect_no_feat('Sache', ()) # the result is still correct
'Sachen'
>> atp.inflect_no_feat('Gleis', ())
'Gleise'
>> atp.inflect_no_feat('Kach', ()) # for a nonce word with unknown gender, ATP produces the -er suffix, as do a majority of humans
'Kacher'
Running ATP on new data is simple! All you need to do is create a list of tuples. Each tuple is an instance, and should be ordered (lemma, inflection, features)
. The lemma
, and inflection
should be strings, and features
a tuple of features, each of which should be included in the feature_space
. Let's look at an example.
Suppose we have a simple language with just four known lemmas: 'a', 'b', 'c', and 'd,' which oddly can be inflected as either nouns or verbs.
Let's say that nouns are marked with a '-' suffix and verbs with a '+' suffix, with the exception of 'd', which takes '*' as a noun and '**' as a verb.
We can initialize the data as below,
>> pairs = [('a', 'a-', ('Noun',)),
('b', 'b-', ('Noun',)),
('c', 'c-', ('Noun',)),
('d', 'd*', ('Noun',)),
('a', 'a+', ('Verb',)),
('b', 'b+', ('Verb',)),
('c', 'c+', ('Verb',)),
('d', 'd**', ('Verb',))]
>> feature_space = {'Noun', 'Verb'}
and train an ATP model as before,
>> atp = ATP(feature_space)
>> atp.train(pairs)
If we then introduce a new lemma 'e', the model that ATP learned correctly inflects it:
>> atp.inflect('e', ('Noun',))
'e-'
>> atp.inflect('e', ('Verb',))
'e+'
Moreover, since it has seen the exception 'd' during training, it can still correctly produce its odd suffixes too:
>> atp.inflect('d', ('Noun',))
'd*'
>> atp.inflect('d', ('Verb',))
'd**'
In utils.py
, the function load_pairs(path)
will load files of several formats.
The parameter path
should specify the path to the data that you wish to load.
There are then a number of optional parameters:
sep
is a string specifying what separates columns. By default it is tab'\t'
.feat_sep
is a string specifying what separates features in the feature column. By default it is semicolon';'
.preprocessing
is a string-to-string lambda function that allows you to add custom pre-processing to lemmas and inflections. By default, it removes umlauts.skip_header
is a boolean, which, ifTrue
skips the first line of the file, treating it as a header. By default it isFalse
.with_freq
is a boolean, which, ifTrue
, will return frequencies for each pair (or zero) if no frequencies are given in the file. By default it isFalse
.
As an example, we can load one of the files for the English development data, which is saved in orthography, but can be converted to IPA.
>> from utils import load_pairs, load_word_to_ipa
>> word_to_ipa = load_word_to_ipa() # load a dictionary of english word-to-IPA mappings
>> pairs, features = load_pairs('../data/english/growth/child-0/100.txt', sep=' ')
>> pairs[0]
('pretend', 'pretending', ('V', 'V.PTCP', 'PRS'))
>> pairs, features = load_pairs('../data/english/growth/child-0/100.txt',
sep=' ',
preprocessing=lambda s: word_to_ipa[s]) # map every lemma/inflection to its IPA
>> pairs[0]
('pritɛnd', 'pritɛndɪŋ', ('V', 'V.PTCP', 'PRS'))
The following row formats are supported:
{lemma}{sep}{inflected}{sep}{features}
{lemma}{sep}{inflected}{sep}{features}{sep}{frequency}
{ignored}{sep}{lemma}{sep}{ignored}{sep}{inflected}{sep}{features}
{ignored}{sep}{lemma}{sep}{ignored}{sep}{inflected}{sep}{features}{sep}{frequency}
An example of each (where sep = ' '
):
kɔz kɔzd V;PST
kɔz kɔzd V;PST 3074
cause kɔz caused kɔzd V;PST
cause kɔz caused kɔzd V;PST 3074
The {ignored}
columns can be anything and are simply skipped by load_pairs()
.
The visualization depends on the libray Graphviz (https://graphviz.org/download/). This requires installation beyond python packages. The way to do this is to follow the official Graphviz instructions for your operating system at https://graphviz.org/download/.
This setup is optional if you do not wish to view any trees.
ATP constructs a decision tree. These can be automatically generated using the plot_tree(save_path)
method of ATP.
The tree can be written, as a pdf, to any location. The setup script automatically created a temp/
directory that
is not checked into git and can be used for this purpose.
The following code will generate and open the tree for the German CHILDES data, or Figure 4 in the paper.
>> from atp import ATP
>> from utils import load_german_CHILDES
>> pairs, feature_space = load_german_CHILDES()
>> atp = ATP(feature_space=feature_space).train(pairs)
>> atp.plot_tree('../temp/german', open_pdf=True)
The optional open_pdf
parameter, if set to True
, will automatically open the pdf of the tree in your computer's default pdf viewer. If you do not use open_pdf=True
, then you can navigate on your computer to the location where you saved the pdf and open it from there.
Some of ATP's functionality is available from the command line by treating atp.py
as a script.
The following command will train ATP on 60 words of German and test it on one of the test sets. The resulting inflections are written—in the same order as the input file—to the specified path ../temp/german_out.txt
.
python atp.py -i ../data/german/quant/train60_0.txt -t ../data/german/quant/test_0.txt -o ../temp/german_out.txt
The full command-line usage is shown below. See the "Loading Data From a File" section for further details on the relevant parameters.
usage: atp.py [-h] --input INPUT [--test_path TEST_PATH] [--out_path OUT_PATH] [--sep SEP] [--feat_sep FEAT_SEP] [--skip_header SKIP_HEADER]
optional arguments:
-h, --help show this help message and exit
--input INPUT, -i INPUT
A path to a dataset of training pairs.
--test_path TEST_PATH, -t TEST_PATH
A path to a dataset of test pairs.
--out_path OUT_PATH, -o OUT_PATH
A path to write the test results to. If None, it will print to stdout.
--sep SEP, -s SEP The column seperator for the input file.
--feat_sep FEAT_SEP, -fs FEAT_SEP
The seperator for features in the input file.
--skip_header SKIP_HEADER, -sh SKIP_HEADER
If True, skips the first line of the input file, treating it as a header.
To import from a location other than src/
, do the following first:
>> import sys
>> sys.path.append('{path_to_repository}/src')
>> from atp import ATP
To replicate the experiments, see the Jupyter notebook at notebooks/Experiments.ipynb
.
If you have questions, comments, or feedback, please email Caleb Belth at cbelth@umich.edu.