-
Notifications
You must be signed in to change notification settings - Fork 49
Tutorial
This page covers some what you need to do to start training models. To follow the instructions on this page, you must have successfully compiled and installed Chalk as described in the README. Also, you must be using a clone of the repository (the last release, v1.0, won't do).
The Open American National corpus has provided a set of open, unencumbered annotations for multiple domains (yay!) in the Manually Annotated Sub-Corpus (MASC). We'll use MASC v3.0.0 here.
Note: You may find there are things you wish were different about the MASC annotations (choices about tokenization, etc). They love to get feedback, so be sure to let them know by writing to anc@anc.org.
The MASC annotations are provided in multiple XML files. Chalk provides a conversion utility that transforms the XML into the input formats needed for training sentence detection, tokenizer, and named-entity recognition models (for both Chalk and OpenNLP).
$ cd /tmp/
$ mkdir masc
$ cd masc
$ wget http://www.anc.org/MASC/download/MASC-3.0.0.tg
$ tar xzf MASC-3.0.0.tgz
$ chalk run chalk.corpora.MascTransform data/written /tmp/chalk-masc-data
Creating train
Success: data/written/ficlets,1401
Success: data/written/ficlets,1403
Success: data/written/ficlets,1402
Failure: data/written/non-fiction,CUP1
Success: data/written/non-fiction,rybczynski-ch3
<...more status output...>
$ cd /tmp/chalk-masc-data
$ ls
dev test train
The three directories contain data splits for training models (train), evaluating their performance while tweaking them (dev), and a held out test set for evaluating them blindly (test). Each directory contains files for sentence detection, tokenization and named entity recognition.
$ ls train/
train-ner.txt train-sent.txt train-tok.txt
Check that you've got the right output by running the following command and comparing your output to this.
$ tail -3 train/train-tok.txt
A $1,000 house<SPLIT>, which could be fixed up into maybe a $<SPLIT>30<SPLIT>-<SPLIT>40,000 house comes with a tax bill of $<SPLIT>4<SPLIT>-<SPLIT>6<SPLIT>K per year<SPLIT>!
The taxes on my $140K house in an urban area of Mississippi are only $<SPLIT>1500<SPLIT>/<SPLIT>year<SPLIT>.
Assuming things went smoothly, you are ready to train models. All of the following instructions assume you are in the chalk-masc-data directory.
We need an example text, so let's use one about Aravind Joshi's ACL lifetime achievement award. (Note: I've made a few modifications and edits to make it a better example.)
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof. Aravind Joshi of the University of Pennsylvania. Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950. He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960. Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001. Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics.
Joshi rocks.
Run the following commands to get things set up with this text.
$ cd /tmp/chalk-masc-data
$ echo "The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof. Aravind Joshi of the University of Pennsylvania. Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950. He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960. Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001. Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics." > joshi.txt
Do the following to train a sentence detector.
$ chalk cli SentenceDetectorTrainer -encoding UTF-8 -lang en -data train/train-sent.txt -model eng-masc-sent-tmp.bin
Indexing events using cutoff of 5
Computing event counts... done. 19168 events
Indexing... done.
Sorting and merging events... done. Reduced 19168 events to 14302.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 14302
Number of Outcomes: 2
Number of Predicates: 1667
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-13286.245156975032 0.7741026711185309
2: ... loglikelihood=-7936.714729168212 0.8232470784641068
3: ... loglikelihood=-6605.629415117238 0.8643050918196995
<more iterations>
98: ... loglikelihood=-3055.191153644887 0.9465254590984975
99: ... loglikelihood=-3050.6109732525756 0.9467341402337228
100: ... loglikelihood=-3046.0972508791187 0.9467341402337228
Writing sentence detector model ... done (0.073s)
Wrote sentence detector model to
path: /tmp/chalk-masc-data/eng-masc-sent-tmp.bin
Now run it on the example text.
$ chalk cli SentenceDetector eng-masc-sent-tmp.bin < joshi.txt
Loading Sentence Detector model ... done (0.039s)
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof.
Aravind Joshi of the University of Pennsylvania.
Aravind Joshi was born in 1929 in Pune, India, where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering, the latter in 1950.
He worked as a research assistant in Linguistics at Penn from 1958-60, while completing his Ph.D. in Electrical Engineering, in 1960.
Joshi's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science, which Aravind Joshi co-directed until 2001.
Dr. Joshi has supervised thirty-six Ph.D. theses to-date, on topics including information and coding theory, and also pure linguistics.
Average: 3000.0 sent/s
Total: 6 sent
Runtime: 0.0020s
Overall, things look fine except the split on 'Prof.'. There is only one training example in train-sent.txt, so there is not much to go on for the model and it thinks it is a sentence ending period rather than an indicator of an abbreviation.
Evaluate the model.
$ chalk cli SentenceDetectorEvaluator -model eng-masc-sent-tmp.bin -data dev/dev-sent.txt -lang en
Loading Sentence Detector model ... done (0.033s)
Evaluating ... done
Precision: 0.8338596065095943
Recall: 0.8108171941426547
F-Measure: 0.8221769847922404
This performance is lower than we'd like. Looking at the data, there are probably some changes that need to be made to the MASC conversion. E.g. it includes lines like this in dev/dev-sent.txt:
"Ready!"?
I say.
"Let's go then."?
Chris says.
and
"Not now. It's too confusing to get into now."
?
I hated how, no matter how rude and angry I could be, Karon would always stay so calm.
So there are ?'s showing up oddly, probably an encoding issue. I'll look into this eventually.
Do the following to train a tokenizer. (I'll suppress the output from here on.)
$ chalk cli TokenizerTrainer -encoding UTF-8 -lang en -data train/train-tok.txt -model eng-masc-token-tmp.bin
To test the tokenizer on the example text, we need to pass it through the sentence detector first and then on to the tokenizer.
$ chalk cli SentenceDetector eng-masc-sent-tmp.bin < joshi.txt | chalk cli TokenizerME eng-masc-token-tmp.bin
Loading Sentence Detector model ... Loading Tokenizer model ... done (0.062s)
Average: 2000.0 sent/s
Total: 6 sent
Runtime: 0.0030s
done (0.294s)
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof .
Aravind Joshi of the University of Pennsylvania .
Aravind Joshi was born in 1929 in Pune , India , where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering , the latter in 1950 .
He worked as a research assistant in Linguistics at Penn from 1958- 60 , while completing his Ph .D. in Electrical Engineering , in 1960 .
Joshi 's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a National Science Foundation Science and Technology Center for Research in Cognitive Science , which Aravind Joshi co-directed until 2001 .
Dr . Joshi has supervised thirty-six Ph .D. theses to-date , on topics including information and coding theory , and also pure linguistics .
Average: 67.3 sent/s
Total: 7 sent
Runtime: 0.104s
There are definitely some odd tokenizations there -- but the model is doing what it is supposed to do given the annotation. For example, MASC has two tokens "Dr" and "." for "Dr.". (I've contacted the MASC creators about this, since, e.g. the Penn Treebank tends to have tokenization of "Dr." and "Ph." "D.", etc.)
You can evaluate the performance of the trained tokenizer against the development data as follows.
$ chalk cli TokenizerMEEvaluator -model eng-masc-token-tmp.bin -data dev/dev-tok.txt -lang en
Loading Tokenizer model ... done (0.280s)
Evaluating ... done
Precision: 0.9870173026240403
Recall: 0.9801661356395084
F-Measure: 0.9835797887524979
The MASC conversion utility in Chalk produces CONLL 2003 formatted annotations, e.g.:
Isabella NNP NNP B-PER
Shae NNP NNP I-PER
, , , O
a DT DT O
girl NN NN O
from IN IN O
Mebane NNP NNP B-LOC
North NNP NNP B-LOC
Carolina NNP NNP I-LOC
. . . O
However, Chalk (currently) needs NER training data in OpenNLP format, e.g.: We first need to convert
<START:person> Isabella Shae <END> , a girl from <START:location> Mebane <END> <START:location> North Carolina <END> .
To convert the data, use the format converter.
$ chalk cli TokenNameFinderConverter conll03 -lang en -encoding UTF-8 -types person,location,organization -data train/train-ner.txt > train/train-ner-opennlp.txtjbaldrid@bluebird:/tmp/chalk-masc-data
$ chalk cli TokenNameFinderConverter conll03 -lang en -encoding UTF-8 -types person,location,organization -data dev/dev-ner.txt > dev/dev-ner-opennlp.txt
Now train the model.
$ chalk cli TokenNameFinderTrainer -lang en -encoding UTF-8 -data train/train-ner-opennlp.txt -model eng-masc-ner-tmp.bin
Run it on the example text. (I have removed the timing information from the output given here.)
$ chalk cli SentenceDetector eng-masc-sent-tmp.bin < joshi.txt | chalk cli TokenizerME eng-masc-token-tmp.bin | chalk cli TokenNameFinder eng-masc-ner-tmp.bin
Loading Sentence Detector model ... Loading Tokenizer model ... Loading Token Name Finder model ... done (0.066s)
The Association for Computational Linguistics is proud to present its first Lifetime Achievement Award to Prof .
Aravind Joshi of the <START:organization> University of Pennsylvania <END> .
Aravind Joshi was born in 1929 in <START:location> Pune <END> , <START:location> India <END> , where he completed his secondary education as well as his first degree in Mechanical and Electrical Engineering , the latter in 1950 .
He worked as a research assistant in Linguistics at Penn from 1958- 60 , while completing his Ph .D. in Electrical Engineering , in 1960 .
Joshi 's work and the work of his Penn colleagues at the frontiers of Cognitive Science was rewarded in 1991 by the establishment of a <START:organization> National Science Foundation Science <END> and <START:organization> Technology Center <END> for Research in Cognitive Science , which Aravind Joshi co-directed until 2001 .
Dr . Joshi has supervised thirty-six Ph .D. theses to-date , on topics including information and coding theory , and also pure linguistics .
Clearly some things can be improved! This will require some changes to the annotations and perhaps some modifications to the features, etc.
Evaluate the model.
$ chalk cli TokenNameFinderEvaluator -lang en -encoding UTF-8 -model eng-masc-ner-tmp.bin -data dev/dev-ner-opennlp.txt
Precision: 0.6956187548039969
Recall: 0.3271872740419378
F-Measure: 0.44504548807474803
More confirmation that there is more work to do. (Which could include checking/debugging the MASC transformation code to make sure it isn't messing up.)
Once you are satisfied with the development cycle, you probably want to train models on all the available data for use in applications. I'll make this easier in the future, but here's a straightforward way to do it, e.g. for the tokenizer:
$ cat */*-tok.txt > all-tok.txt
$ chalk cli TokenizerTrainer -encoding UTF-8 -lang en -data all-tok.txt -model eng-masc-token.bin