In this section, we present our Named Entity Recognition engine. First, we go through into the different bits of our workflow, from the pre-processing of our data, into the creation of our model and the final step, the classification. The used features are presented in the final part of the documentation, along with a description of the involved classes.
- Model Language: English
- Resources directory: src/main/resources/ner/
The workflow of our model consists of the three stages, which are described in detail below. The methods of all three stages can be found in class ExecuteNER.java
in src/main/java/ner. All relevant resources that are used for training and classifying, can be found in the Resources directory.
-
Writing the model: This stage is handled by the
writeModel
method. This method runs a pipeline that pre-processes the training data. The pipeline is fed with the ner_eng.train data-set and consists of the following engines:- The
NERReader.class
reads the data-set, extracts the gold value annotations of each token and adds it to the indexes of Jcas for further processing. - An engine that uses the
SnowballStemmer.class
for stemming - The NERAnnotator.class which contains our feature extractors for training the model. The output of the NERAnnotator is written in src/test/resources/model/, and class
CrfSuiteStringOutcomeDataWriter.class
is used to write the output data in CRF format which will be used for the training in the next step of the workflow.
- The
-
Training the model: This stage is handled by method
trainModel
, which uses theTrain.main
method od cleartk (org.cleartk.ml.jar.Train.main
) to train the model based on the output that was generated on the previous step (writeModel
) in src/test/resources/model/. As soon as the training is completed, this method generates the modelmodel.jar
in src/test/resources/model/. -
Classifying test data: This stage is handled by method
classifyTestFile
. The method is fed with the test data ner_eng.test which are processed by the same engines two first engines which are used for inwriteModel
. Following theNERReader
andSnowballStemmer
the pipeline which contains the additional engines:- The
NERAnnotator.class
is used to annotate the test data-set based on the trained modelmodel.jar
. - The AnalyzeFeatures.class engine, which is used to compare and analyze the results that are extracted from the
NERAnnotator
engine, by calculates the Recall, Precision and F1 score.
- The
This class compares the token instance which contains the predicted value (extractor.extract(jCas, token).get(0).getValue().toString()
) by the NERAnnotator
, with the golden value of the token that is included in the test data-set. It then keeps a count of the all true positive, true negative, false positive and false negative findings which is used to the calculations of the Recall, Precision and F1 score.
This class consists of the following methods:
initialize
- This method instantiates the feature extractors which are applied to the training and test data for the Named Entity classification.process
- Processes the Jcas created from theNERReader.class
.- When this method is accessed through the
writeModel
method for training purposes, a list of instances of the tokens that match the requirements of the feature extractors is created. The instances are additionally written into a file by thedataWriter
in src/test/resources/model/ which will then be used to train the model (trainModel
). - When this method is accessed through the
classifyTestFile
method, the generated list of instances are annotated with theNEIOBAnnotation
to set the predicted value of each token.
- When this method is accessed through the
In this model, the following feature extractors are used to classify the desired name entities:
-
TypePathExtractor
for stemming -
TypePathExtractor
for Part-of-speech tags -
FeatureFunctionExtractor
with the following functions:CoveredTextExtractor<Token>
- Gets the covered text of the token's annotationLowerCaseFeatureFunction()
- Checks if the characters in question are lowercasedCapitalTypeFeatureFunction()
- Checks if the characters in question follow any of these four patterns (all uppercase -ALL_UPPERCASE
, the first characters is uppercase -INITIAL_UPERCASE
, all characters are lowercased -ALL_LOWERCASE
, the characters are mixed cased -MIXED_CASE
)NumericTypeFeatureFunction()
- Checks is the characters in question follow any of these three patterns (are digits -DIGITS
, are year digits -YEAR_DIGITS
, are alphanumeric - `ALPHANUMERIC)CharacterNgramFeatureFunction(fromRight, 0, 2)
: With the orientation set toRIGHT_TO_LEFT
for characters from 0 to 2, we are checking for prefixes.
-
CleartkExtractor<Token, Token>
forPreceding(1)
: We extract the annotation of the previous word. -
CleartkExtractor<Token, Token>
withTypePathExtractor
POS forPreceding(2)
: Extracting the Part-of-speech tag of the two previous words. -
CleartkExtractor<Token, Token>
for classSurroundingCapFeature.class
:This feature extractor is checking if the two previous words and the two following words begin with a capital letter. -
FeatureFunctionExtractor
for classNameEntityExtractor.class
: Reads a list of gazetteer provided in src/main/resources/ner/eng.list, and creates a new feature based on each NER tag as shown below.- Tag: MISC -
new Feature("miscName-" + tag)
- Tag: PER -
new Feature("perName-" + tag)
- Tag: ORG -
new Feature("orgName-" + tag)
- Tag: LOC -
new Feature("locName-" + tag)
- Tag: MISC -
-
FeatureFunctionExtractor
for classCityEntityExtractor.class
: Reads a list of gazetteer provided in src/main/resources/ner/worldcitiespop.txt, and creates a new feature based on the LOC NER tag as shown below.- Tag: LOC -
new Feature("locName-" + tag)
- Tag: LOC -
The following features were attempted but did not introduce any improvement, or actually decreased the performance of the classification process:
- POS tags of the tokens following the focus token;
- POS tag of one token directly preceding the focus token.