- Introduction
- Main Features
- Setup
- Basic Usage
- Models and Scenarios
- Important Notes
- Training Your Own Model
- Morpheme and Word Embeddings
- Evaluation
- Ben-Mordecai Corpus
- Citations
Code and models for neural modeling of Hebrew NER. Described in the TACL paper "Neural Modeling for Named Entities and Morphology (NEMO2)" along with extensive experiments on the different modeling scenarios provided in this repository.
- Trained on the Hebrew NER and Morphology NEMO corpus of gold annotated Modern Hebrew news articles.
- Multiple modeling options to go from raw Hebrew text to morpheme and/or token-level NER boundaries.
- Neural model implementation of NCRF++
- bclm is used for reading and transforming morpho-syntactic information layers.
- Preferably in a virtual env:
pip install -r requirements.txt
- Install
bclm>=1.0.0
: http://github.com/OnlpLab/bclm (this step will be spared soon when bclm is added to pip)
- Install
- Install
yap
: https://github.com/OnlpLab/yap - Clone this NEMO repo:
git clone https://github.com/OnlpLab/NEMO.git
- Enter the repo directory:
cd NEMO
- Unpack model files:
gunzip data/*.gz
- Change
YAP_PATH
inconfig.py
to the path of your localyap
executable.
- In YAP folder, run YAP API server
./yap api
(if you specify a port, change it inconfig.py
) - In NEMO folder, run NEMO API server
uvicorn api_main:app --reload --port 8090
- You can find the available API endpoints with usage examples in api_usage.ipynb.
- Once the API server is up, you can also check out the API documentation by opening (http://localhost:8090/docs) in your browser.
- All you need to do is run
nemo.py
with a specific command (scenario), on a text file of Hebrew sentences separated by a line-break. - You can run a neural NER model directly, or choose a full end-to-end scenario that includes morphological segmentation and alignments (described fully in the next section). e.g.:
- the
run_ner_model
command with thetoken-single
model will tokenize sentences and run thetoken-single
model:python nemo.py run_ner_model token-single example.txt example_output.txt
- the
morph_hybrid
command runs the end-to-end segmentation and NER pipeline which provided our best performing morpheme-level NER boundaries:python nemo.py morph_yap morph example.txt example_output_MORPH.txt
- the
- You can find outputs of different commands on example.txt in: example_output_MORPH_HYBRID_ALIGN_TOKENS.txt, example_output_MORPH_HYBRID.txt, example_output_MORPH_YAP.txt, example_output_MULTI_ALIGN.txt, example_output_SINGLE.txt
- For a full list of the available commands please consult the next section and the inline documentation at the end of
nemo.py
. - Please use only the regular and not the
*_oov
models (which contain embeddings only for words that appear in the NEMO corpus). In other words, unless you use the model to replicate our results on the Hebrew treebank, always use e.g.token-multi
and nottoken-multi_oov
.
Models are all standard Bi-LSTM-CRF with char encoding (LSTM/CNN) of NCRFpp with pre-trained fastText embeddings. Differences between models lay in:
- Input units: morphemes
morph
vs. tokenstoken-*
- Output label set:
token-single
single sequence labels (e.g.B-ORG
) vs.token-multi
multi-labels (atomic labels, e.g.O-ORG^B-ORG^I-ORG
) that predict, in order, the labels for the morphemes the token is made of.
Token-based Models | Morpheme-based Model |
---|---|
Morphemes must be predicted. This is done by performing morphological disambiguation (MD). We offer two options to do so:
- Standard pipeline: MD using YAP. This is used in the
morph_yap
command, which runs ourmorph
NER model on the output of YAP joint segmentation. - Hybrid pipeline: MD using our best performing Hybrid approach, which uses the output of the
token-multi
model to reduce the MD option space. This is used inmorph_hybrid
,multi_align_hybrid
andmorph_hybrid_align_tokens
. We will explain these scenarios next.
MD Approach | Commands |
---|---|
Standard | morph_yap |
Hybrid |
morph_hybrid ,multi_align_hybrid ,morph_hybrid_align_tokens |
Finally, to get our desired output (tokens/morphemes), we can choose between different scenarios, some involving extra post-processing alignments:
- To get morpheme-level labels we have two options:
- Run our
morph
NER model on predicted morphemes: Commands:morph_yap
ormorph_hybrid
(better). token-multi
labels can be aligned with predicted morphemes to get morpheme-level boundaries. Command:multi_align_hybrid
.
- Run our
Run morph NER on Predicted Morphemes |
Multi Predictions Aligned with Predicted Morpheme |
---|---|
morph_yap ,morph_hybrid |
multi_align_hybrid |
- To get token-level labels we have three options:
run_ner_model
command withtoken-single
model.- the predicted labels of the
token-multi
can be mapped totoken-single
labels to get standard token-single output. The commandmulti_to_single
does this end-to-end. - Morpheme-level output can be aligned back to token-level boundaries. Command:
morph_hybrid_align_tokens
(this achieved best token-level results in our experiments).
Run token-single |
Map token-multi to token-single |
Align morph NER with Tokens |
---|---|---|
run_ner_model token-single |
multi_to_single |
morph_hybrid_align_tokens |
- Note: while the
morph_hybrid*
scenarios offer the best performance, they are less efficient since they requires running bothmorph
andtoken-multi
NER models.
- NCRFpp was great for our experiments on the NEMO corpus (which is given, constant, data), but it holds some caveats for real life scenarios of arbitrary text:
- fastText is not used on the fly to obtain vectors for OOV words (i.e. those that were not seen in our Wikipedia corpus). Instead, it is used as a regular embedding matrix. Hence the full generalization capacities of fastText, as shown in our experiments, are not available in the currently provided models, which will perform slightly worse than they could on arbitrary text. In our experiments we created such a matrix in advance with all the words in the NEMO corpus and used it during training. Information regarding training your own model with your own vocabulary in the next section.
- We currently do not provide an API, only file input/outputs. The pipeline works in the background through temp files, you can choose to delete these by default using the
DELETE_TEMP_FILES
config parameter.
- In the near future we plan to publish a cleaner end-to-end implementation, including use of our new AlephBERT pre-trained Transformer models.
- For archiving and reproducibility purposes, our original code used for experiments and analysis can be found in the following repos: https://github.com/cjer/NCRFpp, https://github.com/cjer/NER (beware - 2 years of Jupyter notebooks).
We provide template NCRF++ config files. These files already contain the hyperparameters we used in our training. To train your own model:
- Copy the config for the variant (token-multi, token-single, morph) you wish to use from the ncrf_train_configs folder.
- Change the parameter
word_emb_dir
to that of an embedding vectors file in standard word2vec textual format. You can use the fastText bin models we make available (in the next section) or any other embedding vectors of your choice. - Run the following in your shell:
python ncrf_main.py --config <path_to_config> --device <gpu_device_number>
- For more information, please consult NCRF++ documentation.
- To evaluate your trained models, please consult the evaluation section.
The word embeddings we trained and used in our models are available:
- Space-delimited tokens (traditional word embeddings): fastText (bin, text), GloVe, word2vec
- Morphemes: fastText (bin, text), GloVe, word2vec
These were trained on a 2013 Wiki dump corpus by Yoav Goldberg, which we re-tokenized and then re-parsed using YAP:
- Space-delimited tokens
- Morphemes, automatic YAP segmentation (using the morpheme FORM as the unit for embedding)
- CONLL files of full morpho-syntactic output of YAP
To evaluate your predictions against gold use the ne_evaluate_mentions.py script. Evaluation looks for exact match of string and entity category, but is slightly different than the standard CoNLL2003 evaluation commonly used for NER. The reason is that predicted segmentation differs from gold, so positional indexes of sequence labels cannot be used. What we do instead, is extract multi-sets of entity mentions and use set operations to compute precision, recall and F1-score. You can find more detailed discussion of evaluation in the NEMO2 paper.
To evaluate an output prediction file against a gold file use:
python ne_evaluate_mentions.py <path_to_gold_ner> <path_to_predicted_ner>
If you're within python, just call ne_evaluate_mentions.evaluate_files(...)
with the same parameters.
In our NEMO2 paper we also evaluate our models on the Ben-Mordecai Hebrew NER Corpus (BMC). The 3 random splits we used can be found here.
If you use any of the NEMO2 code, models, embeddings or the NEMO corpus, please cite the NEMO2 paper:
@article{DBLP:journals/corr/abs-2007-15620,
author = {Dan Bareket and
Reut Tsarfaty},
title = {Neural Modeling for Named Entities and Morphology (NEMO{\^{}}2)},
journal = {CoRR},
volume = {abs/2007.15620},
year = {2020},
url = {https://arxiv.org/abs/2007.15620},
archivePrefix = {arXiv},
eprint = {2007.15620},
timestamp = {Mon, 03 Aug 2020 14:32:13 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2007-15620.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
If you use the NEMO2's NER models please also cite NCRF++:
@inproceedings{yang2018ncrf,
title={{NCRF}++: An Open-source Neural Sequence Labeling Toolkit},
author={Yang, Jie and Zhang, Yue},
booktitle={Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics},
Url = {http://aclweb.org/anthology/P18-4013},
year={2018}
}