-
- 3.1. External Libraries
- 3.1.1 NLP libraries
- 3.1.2 Semantic Similarity Libraries
-
INTRODUCTION =========================
STSModule (Semantic Textual Similarity Module) aims at helping users computing the semantic similarity between either sentences or documents in English. Similarity measures play an important role in a wide variety of NLP applications. By a way of example, Information Retrieval (IR) relies on semantic similarity in order to determine the best result for a related query. Semantic similarity also plays a crucial role in other applications such as Paraphrasing and Translation Memory (TM). However, computing semantic similarity between sentences and documents remains a complex and difficult task. As an attempt to fulfil this gap, STSModule aims at offering the user with a simple, yet very efficient approach to compute semantic similarity by combining various semantic resources with statistical methods.
- TECHNICAL INFORMATION =========================
-
This program provides several abstraction methods to compute the semantic similarity between sentences.
- The MySemanticSimilarityMeasures class wraps all the semantic similarity measures offered by the STSModule. Within the SemanticMeasuresManager class you will find a demo that demonstrates how you can use them. Please have a closer look at the main method located at 'src/measures/SemanticMeasuresManager'
- SemanticMeasuresManager semanticSimilarity = new SemanticMeasuresManager(Constants.EN); // receives the language
- semanticSimilarity.calculatingSemanticSimilarityScores(sentence1, sentence2); // computes the similarity between two sentences
- semanticSimilarity.getSemanticSimilarityMeasures_With_Disambiguation(); // returns various semantic similarity measures
- semanticSimilarity.getSemanticSimilarityMeasures_WITHOUT_Disambiguation(); // returns various semantic similarity measures
- The MySemanticSimilarityMeasures class wraps all the semantic similarity measures offered by the STSModule. Within the SemanticMeasuresManager class you will find a demo that demonstrates how you can use them. Please have a closer look at the main method located at 'src/measures/SemanticMeasuresManager'
-
Apart from that this program also includes several abstraction methods to perform various NLP tasks, such as: POS Tagging (TreeTagger); Lemmatisation (TreeTagger); Stemming (Snowball); Tokenisation (OpenNLP); Sentence Delimitation (OpenNLP); NER (OpenNLP); and Stopword Checker. Hereafter we describe how these methods can be called.
- NLPManager nlpManager = new NLPManager(Constants.EN); // receives the language
- The NLPManager class wraps all the NLP methods offered by the PreProcessor (http://github.com/hpcosta/PreProcessor). Within this class you will find a demo() that demonstrates how you can use all these methods for various languages. Please have a closer look at the demo() method located at 'src/nlp/NLPManager'
-
For more information about the program and how it can be used in a real scenario, please read "MiniExperts: An SVM approach for Measuring Semantic Textual Similarity" available through the following URL: http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval017.pdf
-
INSTALATION =========================
-
Import the project to your Java editor.
-
The folder 'config' and 'internalResources' should be at the same level as the src folder.
-
The folder 'internalResources' contains models for the:
- TreeTagger (English, French, German, Italian, Portuguese and Spanish)
- OpenNLP (tokeniser, sentence splitter and NER - only for English)
- Stopword Checker (German, English, Italian, Portuguese and Spanish)
-
the folder 'config' contains configuration files for the Semantic Similarity Measures. You need to configure the folowing files and parameters (see step 3 first).
- adw.properties
- wn30g.ppv.path= path to: /externalResources/adwResources/ppvs.30g.5k/
- offset.map.file= path to: /externalResources/adwResources/offset2ID.map.tsv
- jlt.properties
- wordnet.wordnetData3.0= path to: /externalResources/adwResources/WordNet-3.0/dict
- stopwords.FilePrefix = path to: /externalResources/adwResources/jlt/stopwords/stopwords
- stanford.pos.model= path to: /externalResources/adwResources/jlt/stanford/left3words-wsj-0-18.tagger
- apart from that, the folder 'config' also contains a configuration file for the TreeTagger. You will need have the TreeTagger installed in your computer an configure the treetagger.properties file.
- adw.properties
-
-
Create a folder named 'externalResources', for example in your workspace.
-
this folder should contain the semantic signatures for the Semantic Similarity Measures
-
please download the Semantic signatures through the following url: http://lcl.uniroma1.it/adw/ppvs.30g.5k.tar.bz2.
-
for more information about the requirements visit http://lcl.uniroma1.it/adw/
-
This section is important to let you know what libraries are used in this project, as well as to know how to update the resources or models.
* ADW - Semantic Similarity Library
* adw.v1.0, read: ADW-README.txt
* This package provides an implementation of Align, Disambiguate, and Walk (ADW). ADW is a WordNet-based approach for measuring semantic similarity of arbitrary pairs of lexical items, from word senses to full texts. The approach leverages random walks on semantic networks for modelling lexical items.
* TreeTagger
* provides a POS Tagger for EN, SP, PT, FR, DE, IT and RU
* The following java library allows to use TreeTagger in Java.
* org.annolab.tt4j-1.0.15
* Stemmer
* provides a Stemmer for EN, SP, PT, FR, DE, IT and RU
* the following java library allows to use Stemmer in Java]
* org.tartarus.snowball
* OpenNLP
* provides a **sentence splitter** and **tokenization** in EN, but can be used for at least EN, PT and SP
* also provides **NER** for EN and SP, see models available through http://opennlp.sourceforge.net/models-1.5/
* the following java library allows to use OpenNLP in Java
* opennlp-maxent-3.0.3;
* opennlp-tools-1.5.3;
* opennlp-uima-1.5.3
* you can find these models inside the project folder, more specificaly in the "/resources/opennlpmodels/..." folder.
* contains the following models for English:
* Date name finder model.
* Location name finder model.
* Money name finder model.
* Organization name finder model.
* Percentage name finder model.
* Person name finder model.
* Time name finder model.
* and the following models for Spanish:
* Location name finder model, trained on conll02 shared task data.
* Organization name finder model, trained on conll02 shared task data.
* Person name finder model, trained on conll02 shared task data.
* Misc name finder model, trained on conll02 shared task data.
* the English and Spanish models are loaded by the 'NEREnModelsLoader' and 'NEREsModelsLoader' classes, respectively.
* BabelNet WebService
* requires:
* commons-io-2.4.jar
* jsoup-1.8.1.jar
- REQUIREMENTS =========================
-
Java 6 (JRE 1.6) or higher
-
Semantic signatures (see Installation)
-
WordNet 3.0 dictionary files (already included in the resources directory)
-
Several models (already included either in the 'internalResources' or in the 'externalResources' folder)
- LICENSE =========================
Copyright (c) 2013-2016 Hernani Costa @LEXYTRAD, University of Malaga, Spain. All rights reserved.
For more information please contact:
hercos (at) uma (dot) es