This README explains the pre-processing performed to create the cluster lexicons that are used as features in the IXA pipes tools [http://ixa2.si.ehu.es/ixa-pipes]. So far we use the following three methods: Brown, Clark and Word2vec.
We induce the following clustering types:
-
Brown hierarchical word clustering algorithm: Brown, et al.: Class-Based n-gram Models of Natural Language.
- Input: a sequence of words separated by whitespace with no punctuation. See brown-input.txt for an example.
- Output: for each word type, its cluster. See brown-output.txt for an example.
- In particular, each line is:
<cluster represented as a bit string> <word> <number of times word occurs in input>
- We use Percy Liang's implementation off-the-shelf.
- Liang: Semi-supervised learning for natural language processing.
-
Clark clustering: Alexander Clark (2003). Combining distributional and morphological information for part of speech induction.
- Input: one lowercased token per line, punctuation removed, sentences separated by two newlines. See clark-input.txt
- Output: for each word type, its cluster and a weight. See clark-output.txt
- Each line consists of
<word> <cluster> <weight>
- We use Alexander Clark's implementation off-the-shelf.
-
Word2vec Skip-gram word embeddings clustered via K-Means: Mikolov et al. (2013). Efficient estimation of word representations in Vector Space.
- Input: lowercased tokens separated by space, punctuation removed. See word2vec-input.txt
- Output: for each word type, its cluster. See word2vec-output.txt
- Each line consists of
<word> <cluster>
- We use Word2vec implementation off-the-shelf.
Let us assume that the source data is in plain text format (e.g., without html or xml tags, etc.), and that every document is in a directory called corpus-directory. Then the following steps are performed:
- Remove all sentences or paragraphs consisting of less than 90% lowercase characters, as suggested by Liang: Semi-supervised learning for natural language processing..
This step is performed by using the following function in ixa-pipe-convert:
java -jar ixa-pipe-convert-$version.jar --brownClean corpus-directory/
ixa-pipe-convert will create a .clean file for each file contained in the folder corpus-directory.
- Move all .clean files into a new directory called, for example, corpus-preclean.
- Tokenize all the files in the folder to one line per sentence. This step is performed by using ixa-pipe-tok in the following shell script:
./recursive-tok.sh $lang corpus-preclean
The tokenized version of each file in the directory corpus-preclean will be saved with a .tok suffix.
- cat to one large file: all the tokenize files are concatenate it into a large huge file called corpus-preclean.tok.
cd corpus-preclean
cat *.tok > corpus-preclean.tok
- Run the brown-clusters-preprocess.sh script like this to create the format required to induce Brown clusters using Percy Liang's program.
./brown-clusters-preprocess.sh corpus-preclean.tok > corpus-preclean.tok.punct
brown-cluster/wcluster --text corpus-preclean.tok.punct --c 1000 --threads 8
This trains 1000 class Brown clusters using 8 threads in parallel.
Let us assume that the source data is in plain text format (e.g., without html or xml tags, etc.), and that every document is in a directory called corpus-directory. Then the following steps are performed:
- Tokenize all the files in the folder to one line per sentence. This step is performed by using ixa-pipe-tok in the following shell script:
./recursive-tok.sh $lang corpus-directory
The tokenized version of each file in the directory corpus-directory will be saved with a .tok suffix.
- cat to one large file: all the tokenize files are concatenate it into a large huge file called corpus.tok.
cd corpus-directory
cat *.tok > corpus.tok
- Run the clark-clusters-preprocess.sh script like this to create the format required to induce Clark clusters using Clark's implementation.
./clark-clusters-preprocess.sh corpus.tok > corpus.tok.punct.lower
To train 100 word clusters use the following command line:
cluster_neyessenmorph -s 5 -m 5 -i 10 corpus.tok.punct.lower - 100 > corpus.tok.punct.lower.100
Assuming that the source data is in plain text format (e.g., without html or xml tags, etc.), and that every document is in a directory called corpus-directory. Then the following steps are performed:
- Tokenize all the files in the folder to one line per sentence. This step is performed by using ixa-pipe-tok in the following shell script:
./recursive-tok.sh $lang corpus-directory
The tokenized version of each file in the directory corpus-directory will be saved with a .tok suffix.
- cat to one large file: all the tokenize files are concatenate it into a large huge file called corpus.tok.
cd corpus-directory
cat *.tok > corpus.tok
- Run the word2vec-clusters-preprocess.sh script like this to create the format required by Word2vec.
./word2vec-clusters-preprocess.sh corpus.tok > corpus-word2vec.txt
To train 400 class clusters using 8 threads in parallel we use the following command:
word2vec/word2vec -train corpus-word2vec.txt -output corpus-s50-w5.400 -cbow 0 -size 50 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 8 -classes 400
There are many ways of cleaning XML, HTML and other metadata than often comes in corpora. As we will usually be processing very large amounts of texts, we do not pay too much attention to detail and crudely remove every tag using regular expressions. In the scripts directory there is a shell script that takes either a file as argument like this:
./xmltotext.sh file.html > file.txt
NOTE that this script will replace your original files with a cleaned version of them.
If you are interested in using the Wikipedia for your language, here you can find many Wikipedia dumps already extracted to XML which can be directly fed to the xmltotext.sh script:
[http://linguatools.org/tools/corpora/wikipedia-monolingual-corpora/]
If your language is not among them, we usually use the Wikipedia Extractor and then the xmltotext.sh script:
[http://medialab.di.unipi.it/wiki/Wikipedia_Extractor]
Rodrigo Agerri
IXA NLP Group
University of the Basque Country (UPV/EHU)
E-20018 Donostia-San Sebastián
rodrigo.agerri@ehu.eus