Skip to content

yuanlida/nc

Repository files navigation

NER Classifier

Description

This is a project under Edison Premium which contains the NER model used for classifying entities extraction from mail signature blocks

Project Overview

Architecture

After much comparison, we have settled with a Bi-lstm CRF neuralnetwork with word and character embeddings

Components

  • Train model and adjust hyperparameters: bi_lstm.py
  • Convert model into tensorflow lite for mobile: tf_converter.py
  • Evaluate model using test-set data: batch_test.py
  • Quick-run to see model output: tf_lite_invoke.py

Quick Run

Navigate to tf_lite_invoke.py

Run the script with the following code

lite_model.set_sentence(sentence)
label = lite_model.analyze()
print(sentence, ':label is ', label)

Example Output

Jordan McDonald :label is  name
PO Box 7193 :label is  loc
Gujarat, INDIA :label is  loc
Legal, financial, technical translations :label is  tit

Training the Model

Training Time

  • Training the model locally is not recommended at data size processed is huge
  • Instead, we train the model in a deployed ubtuntu VM
  • Estimated time to train each model on the VM is ~= 4-6 hours

Basic Train

Run the script in bilstm_onefile/bi_lstm.py directly OR
Run via Terminal

python3 bilstm_onefile/bi_lstm.py

VM 大机 Train

Remember to change file directories and key to your own

  1. Turn on your VM server and SSH into it
aws ec2 start-instances --instance-ids i-029dfa9b95dba8117

ssh -i ~/.ssh/mykey ubuntu@34.212.42.106
  1. Navigate to our nc project
cd nc 
  1. Run the pre-defined shellscript via nohup and monitor progress
nohup sh ~/nc/run.sh >~/nc/script.py.log </dev/null 2>&1 &

tail -f ~/nc/script.py.log
  1. Once training completes, copy 3 remote files to local: bi-lilstm.tflite, charJson, wordJson .
scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/bi-lstm.tflite ~/desktop/edison-ai/ner-tflite

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/charJson ~/desktop/edison-ai/ner-tflite

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/wordJson ~/desktop/edison-ai/ner-tflite
  1. Cleanse the VM of these old files and data
sh clean.sh
  1. Copy the contents of bi-lilstm.tflite, charJson, wordJson and paste them in model/Bi-LSTM

  2. Afterwhich, perform Quick Run as mentioned above

NER Classifier

Description

This is a project under Edison Premium which contains the NER model used for classifying entities extraction from mail signature blocks

Project Overview

Architecture

After much comparison, we have settled with a Bi-lstm CRF neuralnetwork with word and character embeddings

Components

  • Train model and adjust hyperparameters: bi_lstm.py
  • Convert model into tensorflow lite for mobile: tf_converter.py
  • Evaluate model using test-set data: batch_test.py
  • Quick-run to see model output: tf_lite_invoke.py

Setup

pip install -r requirements.txt

Quick Run

Navigate to tf_lite_invoke.py

Run the script with the following code

lite_model.set_sentence(sentence)
label = lite_model.analyze()
print(sentence, ':label is ', label)

Example Output

Jordan McDonald :label is  name
PO Box 7193 :label is  loc
Gujarat, INDIA :label is  loc
Legal, financial, technical translations :label is  tit

Training the Model

Training Time

  • Training the model locally is not recommended at data size processed is huge
  • Instead, we train the model in a deployed ubtuntu VM
  • Estimated time to train each model on the VM is ~= 4-6 hours

Basic Train

Run the script in bilstm_onefile/bi_lstm.py directly OR
Run via Terminal

python3 bilstm_onefile/bi_lstm.py

VM 大机 Train

Remember to change file directories and key to your own

  1. Turn on your VM server and SSH into it
aws ec2 start-instances --instance-ids i-029dfa9b95dba8117

ssh -i ~/.ssh/mykey ubuntu@34.212.42.106
  1. Navigate to our nc project
cd nc 
  1. Run the pre-defined shellscript via nohup and monitor progress
nohup sh ~/nc/run.sh >~/nc/script.py.log </dev/null 2>&1 &

tail -f ~/nc/script.py.log
  1. Once training completes, copy 3 remote files to local: bi-lilstm.tflite, charJson, wordJson .
scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/bi-lstm.tflite ~/desktop/edison-ai/ner-tflite

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/charJson ~/desktop/edison-ai/ner-tflite

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/wordJson ~/desktop/edison-ai/ner-tflite
  1. Cleanse the VM of these old files and data
sh clean.sh
  1. Copy the contents of bi-lilstm.tflite, charJson, wordJson and paste them in model/Bi-LSTM

  2. Afterwhich, perform Quick Run as mentioned above

Data

Data files are divided into:

  • training set(90%)
  • test set (10%)
  • sample set (1%)

Process

Data Augmentation, generation, sorting: process_data/jeff_work

Dataset

Model Evaluation Comparisons

View Performance

List of edge-cases / wrongly classified

Should be LOC

  • wildmoor :label is name

Should be ORG

Should be PER(name)

  • Rob Record :label is loc
  • Carla Roppo-Owczarek :label is org
  • (BRIAN) HEXTER :label is org
  • Senior Engineer - Projects & Services :label is org

Should be TIT

  • IT Support :label is org
  • admin :label is loc

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published