NER Classifier

Description

This is a project under Edison Premium which contains the NER model used for classifying entities extraction from mail signature blocks

Project Overview

Architecture

After much comparison, we have settled with a Bi-lstm CRF neuralnetwork with word and character embeddings

Components

Train model and adjust hyperparameters: bi_lstm.py
Convert model into tensorflow lite for mobile: tf_converter.py
Evaluate model using test-set data: batch_test.py
Quick-run to see model output: tf_lite_invoke.py

Quick Run

Navigate to tf_lite_invoke.py

Run the script with the following code

lite_model.set_sentence(sentence)
label = lite_model.analyze()
print(sentence, '：label is ', label)

Example Output

Jordan McDonald ：label is  name
PO Box 7193 ：label is  loc
Gujarat, INDIA ：label is  loc
Legal, financial, technical translations ：label is  tit

Training the Model

Training Time

Training the model locally is not recommended at data size processed is huge
Instead, we train the model in a deployed ubtuntu VM
Estimated time to train each model on the VM is ~= 4-6 hours

Basic Train

Run the script in bilstm_onefile/bi_lstm.py directly OR
Run via Terminal

python3 bilstm_onefile/bi_lstm.py

VM 大机 Train

Remember to change file directories and key to your own

Turn on your VM server and SSH into it

aws ec2 start-instances --instance-ids i-029dfa9b95dba8117

ssh -i ~/.ssh/mykey ubuntu@34.212.42.106

Navigate to our nc project

cd nc

Run the pre-defined shellscript via nohup and monitor progress

nohup sh ~/nc/run.sh >~/nc/script.py.log </dev/null 2>&1 &

tail -f ~/nc/script.py.log

Once training completes, copy 3 remote files to local: bi-lilstm.tflite, charJson, wordJson .

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/bi-lstm.tflite ~/desktop/edison-ai/ner-tflite

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/charJson ~/desktop/edison-ai/ner-tflite

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/wordJson ~/desktop/edison-ai/ner-tflite

Cleanse the VM of these old files and data

sh clean.sh

Copy the contents of bi-lilstm.tflite, charJson, wordJson and paste them in model/Bi-LSTM
Afterwhich, perform Quick Run as mentioned above

NER Classifier

Description

This is a project under Edison Premium which contains the NER model used for classifying entities extraction from mail signature blocks

Project Overview

Architecture

After much comparison, we have settled with a Bi-lstm CRF neuralnetwork with word and character embeddings

Components

Train model and adjust hyperparameters: bi_lstm.py
Convert model into tensorflow lite for mobile: tf_converter.py
Evaluate model using test-set data: batch_test.py
Quick-run to see model output: tf_lite_invoke.py

Setup

pip install -r requirements.txt

Quick Run

Navigate to tf_lite_invoke.py

Run the script with the following code

lite_model.set_sentence(sentence)
label = lite_model.analyze()
print(sentence, '：label is ', label)

Example Output

Jordan McDonald ：label is  name
PO Box 7193 ：label is  loc
Gujarat, INDIA ：label is  loc
Legal, financial, technical translations ：label is  tit

Training the Model

Training Time

Training the model locally is not recommended at data size processed is huge
Instead, we train the model in a deployed ubtuntu VM
Estimated time to train each model on the VM is ~= 4-6 hours

Basic Train

Run the script in bilstm_onefile/bi_lstm.py directly OR
Run via Terminal

python3 bilstm_onefile/bi_lstm.py

VM 大机 Train

Remember to change file directories and key to your own

Turn on your VM server and SSH into it

aws ec2 start-instances --instance-ids i-029dfa9b95dba8117

ssh -i ~/.ssh/mykey ubuntu@34.212.42.106

Navigate to our nc project

cd nc

Run the pre-defined shellscript via nohup and monitor progress

nohup sh ~/nc/run.sh >~/nc/script.py.log </dev/null 2>&1 &

tail -f ~/nc/script.py.log

Once training completes, copy 3 remote files to local: bi-lilstm.tflite, charJson, wordJson .

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/bi-lstm.tflite ~/desktop/edison-ai/ner-tflite

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/charJson ~/desktop/edison-ai/ner-tflite

scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/wordJson ~/desktop/edison-ai/ner-tflite

Cleanse the VM of these old files and data

sh clean.sh

Copy the contents of bi-lilstm.tflite, charJson, wordJson and paste them in model/Bi-LSTM
Afterwhich, perform Quick Run as mentioned above

Data

Data files are divided into:

training set(90%)
test set (10%)
sample set (1%)

Process

Data Augmentation, generation, sorting: process_data/jeff_work

Dataset

TIT:
- osm 18
- oneTonline
- Google jobs
- Free Title_skills, removed Spanish words
- NLPaug generation of job titles
TEL
- follow US phone number format
- Edison data + artificial augmentation
- Added miscellaneous country number formats
ORG
- Changed to 7 mil company name corpus, Included only 2,278,866 lines that are US based.
- Crunchbase Companies data
LOC
- added countries, states, streets
- UK DATA
- US KAGGLE open addresses data[DELETED]
- Added random places of interest
- Generated data with example address pattern formats are:
  - UK: street, town, county
  - US: street, city, state, postal
PER
- Added single names
- Augment different name formats such as .D
- Edison email
- Baby names, USA names

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
bilstm_onefile		bilstm_onefile
data		data
gen_data		gen_data
model		model
others		others
performance		performance
process_data		process_data
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
batch_test.py		batch_test.py
build_data.py		build_data.py
config.py		config.py
invoke_by_ex.py		invoke_by_ex.py
line.txt		line.txt
load_check_point.py		load_check_point.py
main.py		main.py
requirements.txt		requirements.txt
test.py		test.py
test_entities.txt		test_entities.txt
tf_converter.py		tf_converter.py
tf_lite_invoke.py		tf_lite_invoke.py
tf_lite_invoke_jeff.py		tf_lite_invoke_jeff.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NER Classifier

Description

Project Overview

Architecture

Components

Quick Run

Training the Model

Training Time

Basic Train

VM 大机 Train

NER Classifier

Description

Project Overview

Architecture

Components

Setup

Quick Run

Training the Model

Training Time

Basic Train

VM 大机 Train

Data

Process

Dataset

Model Evaluation Comparisons

List of edge-cases / wrongly classified

Should be LOC

Should be ORG

Should be PER(name)

Should be TIT

About

Releases

Packages

Contributors 2

Languages

yuanlida/nc

Folders and files

Latest commit

History

Repository files navigation

NER Classifier

Description

Project Overview

Architecture

Components

Quick Run

Training the Model

Training Time

Basic Train

VM 大机 Train

NER Classifier

Description

Project Overview

Architecture

Components

Setup

Quick Run

Training the Model

Training Time

Basic Train

VM 大机 Train

Data

Process

Dataset

Model Evaluation Comparisons

List of edge-cases / wrongly classified

Should be LOC

Should be ORG

Should be PER(name)

Should be TIT

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages