Stanford CoreNLP

Main steps to run Stanford CoreNLP with HAREM dataset

First download Stanford CoreNLP tool jar from its webpage.
Navigate to the path of the stanford-corenlp.jar.
Run command java -cp stanford-corenlp.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop <file.prop>
1. This command trains and generates a CRF model according to file.prop.
2. file.prop specifies the training file and the features to be used in the training process.
Run command java -cp stanford-corenlp.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier <ner-model.ser.gz> -testFile <file_test.txt>
1. This command classifies the file file_test.txt using the CRF model generated in step 3.
2. It also presents the evaluation regarding the Precision, Recall and F1 for the multiple classes.

Check folder for more information.

Stanford dataset format

File with each line having a token and the respective entity type, being O the tokens which are not an entity. Example:

"I complained to Microsoft about Bill Gates."

I O

complained O

to O

Microsoft ORGANIZATION

about O

Bill PERSON

Gates PERSON

. O

Convert HAREM dataset in Stanford NER input

New version

In order to evaluate using conlleval script, the same tokenization has to be present either in the golden data and the output of Stanford. So, for that to happen, I used the StanfordCoreNLP tokenizer (edu.stanford.nlp.process.PTBTokenizer) in both the training and testing (golden and output) dataset. Also, I converted the tokenized text into conll format using this script and added IOB tags using this script.

Previous version

In order to be able to run Stanford NER with the HAREM dataset as input, the dataset has to be converted in the correct format. For conversion, used corpus-processor. Download

Steps:

Install ruby
Install corpus-processor ruby-gem
Change categories to be recognised (example)
Run command: corpus-processor process <input-file> <output-file> --categories=<file.yml>

Check folder for more information.

Average results

Check all the results here.

Results after 4 repeats:

Level	Precision	Recall	F-measure
Categories	58.84%	53.60%	56.10%
Types	-	-	-
Subtypes	-	-	-
Filtered	69.97%	54.23%	61.10%

Note: Since types and subtypes were too computationally demanding to run, a different prop file was used in order to decrease the number of features and thus reduce the number of computational variables. However, since different features were used, it wouldn't be comparable to the other tools, so it is not displayed here.

Hyperparameter study

For this tool, I decided to check the influence of the following hyperparameters: tolerance, epsilon, MaxNGramLeng. The results are the following:

Tolerance (default: 1e-4)