#Sentence compression Courtney Napoles, cdnapoles@gmail.com
last updated 18 September 2015
This program generates sentence-level compressions via deletion. It is a modified implementation of the ILP model described in Clarke and Lapata, 2008, "Global Inference for Sentence Compression: An Integer Linear Programming Approach".
ant compile
ILOG CPLEX needs to be installed to run, and the paths in build.xml
and
compress
should be updated accordingly.
Usage: ./compress -i path/to/input -l path/to/lm [-x]
-i val input file or directory
-d debug
-l val path to language model (binary or arpa)
-t output should be <= 120 characters
-q suppress cplex output (normally goes to stderr)
-x input file(s) in xml format
The program expects tokenized text with one sentence per line.
<orig_len> <short_len> <compression> <orig_indices> <compression_rate>
For example, for the input sentence "At the camp , the rebel troops were welcomed with a banner that read : `` Welcome home . ''", the output is as follows:
20 8 At camp , the troops were welcomed . 1 3 4 5 7 8 9 19 0.4
To generate extractive compressions (by deletion only) using an extended version of Clarke & Lapata (2008)'s ILP model:
java research.compression.SentenceCompressor
Required arguments:
-in=val path to the input file or directory
-lm=val path to the language model (trigram)
Optional arguments:
-char use character-based constraints
-cr=val minimum compression rate (default is 0.4)
-debug debug
-l=val specify lambda value (tradeoff between n-gram probability and
"significance" score in objective function
-ngram use the n-gram constraint (each n-gram in compression present in
Google n-grams; n-gram server must be running.
-quiet supress cplex output
-target=val specify the target compression length for each sentence
-test_lambda test varying values of lambda (for dev)
-tweet use a Twitter length constraint (120 characters)
-xml input is in xml format
Example call:
java -Xms2g -Xmx10g -Djava.library.path=$ILOG/bin/x86-64_osx \
-cp bin:lib/berkeleylm.jar:$ILOG/lib/cplex.jar:lib/stanford-parser.jar \
research.compression.SentenceCompressor -in=data/sample_text -lm=your_lm.gz
The language model used is not provided for licensing issues. This software requires a trigram language model in ARPA format. In our research, we used a language model trained on English Gigaword 5 using SRILM. There are some language models available for download from the following sites. Note that I have not tested or used these models myself.
The LM reader used by this program expects each n-gram line to be in the format
log_prob<TAB>ngram<TAB>backoff
If there is no backoff weight, then the format should be
log_prob<TAB>ngram
If you get a String index out of range
error, and your LM is in ARPA, the
fields may be space separated (instead of tab separated), or have trailing
spaces. I have added a script, fix_spacing.pl
to fix this issue. To run this
script, call
zcat your_lm.gz | perl fix_spacing.pl | gzip > your_fixed_lm.gz