This is a project under Edison Premium which contains the NER model used for classifying entities extraction from mail signature blocks
After much comparison, we have settled with a Bi-lstm CRF neuralnetwork with word and character embeddings
- Train model and adjust hyperparameters: bi_lstm.py
- Convert model into tensorflow lite for mobile: tf_converter.py
- Evaluate model using test-set data: batch_test.py
- Quick-run to see model output: tf_lite_invoke.py
Navigate to tf_lite_invoke.py
Run the script with the following code
lite_model.set_sentence(sentence)
label = lite_model.analyze()
print(sentence, ':label is ', label)
Example Output
Jordan McDonald :label is name
PO Box 7193 :label is loc
Gujarat, INDIA :label is loc
Legal, financial, technical translations :label is tit
- Training the model locally is not recommended at data size processed is huge
- Instead, we train the model in a deployed ubtuntu VM
- Estimated time to train each model on the VM is ~= 4-6 hours
Run the script in bilstm_onefile/bi_lstm.py directly
OR
Run via Terminal
python3 bilstm_onefile/bi_lstm.py
Remember to change file directories and key to your own
- Turn on your VM server and SSH into it
aws ec2 start-instances --instance-ids i-029dfa9b95dba8117
ssh -i ~/.ssh/mykey ubuntu@34.212.42.106
- Navigate to our nc project
cd nc
- Run the pre-defined shellscript via nohup and monitor progress
nohup sh ~/nc/run.sh >~/nc/script.py.log </dev/null 2>&1 &
tail -f ~/nc/script.py.log
- Once training completes, copy 3 remote files to local: bi-lilstm.tflite, charJson, wordJson .
scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/bi-lstm.tflite ~/desktop/edison-ai/ner-tflite
scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/charJson ~/desktop/edison-ai/ner-tflite
scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/wordJson ~/desktop/edison-ai/ner-tflite
- Cleanse the VM of these old files and data
sh clean.sh
-
Copy the contents of bi-lilstm.tflite, charJson, wordJson and paste them in model/Bi-LSTM
-
Afterwhich, perform Quick Run as mentioned above
This is a project under Edison Premium which contains the NER model used for classifying entities extraction from mail signature blocks
After much comparison, we have settled with a Bi-lstm CRF neuralnetwork with word and character embeddings
- Train model and adjust hyperparameters: bi_lstm.py
- Convert model into tensorflow lite for mobile: tf_converter.py
- Evaluate model using test-set data: batch_test.py
- Quick-run to see model output: tf_lite_invoke.py
pip install -r requirements.txt
Navigate to tf_lite_invoke.py
Run the script with the following code
lite_model.set_sentence(sentence)
label = lite_model.analyze()
print(sentence, ':label is ', label)
Example Output
Jordan McDonald :label is name
PO Box 7193 :label is loc
Gujarat, INDIA :label is loc
Legal, financial, technical translations :label is tit
- Training the model locally is not recommended at data size processed is huge
- Instead, we train the model in a deployed ubtuntu VM
- Estimated time to train each model on the VM is ~= 4-6 hours
Run the script in bilstm_onefile/bi_lstm.py directly
OR
Run via Terminal
python3 bilstm_onefile/bi_lstm.py
Remember to change file directories and key to your own
- Turn on your VM server and SSH into it
aws ec2 start-instances --instance-ids i-029dfa9b95dba8117
ssh -i ~/.ssh/mykey ubuntu@34.212.42.106
- Navigate to our nc project
cd nc
- Run the pre-defined shellscript via nohup and monitor progress
nohup sh ~/nc/run.sh >~/nc/script.py.log </dev/null 2>&1 &
tail -f ~/nc/script.py.log
- Once training completes, copy 3 remote files to local: bi-lilstm.tflite, charJson, wordJson .
scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/bi-lstm.tflite ~/desktop/edison-ai/ner-tflite
scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/charJson ~/desktop/edison-ai/ner-tflite
scp -i ~/.ssh/mykey -r ubuntu@34.212.42.106:~/nc/model/Bi-LSTM/wordJson ~/desktop/edison-ai/ner-tflite
- Cleanse the VM of these old files and data
sh clean.sh
-
Copy the contents of bi-lilstm.tflite, charJson, wordJson and paste them in model/Bi-LSTM
-
Afterwhich, perform Quick Run as mentioned above
Data files are divided into:
- training set(90%)
- test set (10%)
- sample set (1%)
Data Augmentation, generation, sorting: process_data/jeff_work
-
TIT:
- osm 18
- oneTonline
- Google jobs
- Free Title_skills, removed Spanish words
- NLPaug generation of job titles
-
TEL
- follow US phone number format
- Edison data + artificial augmentation
- Added miscellaneous country number formats
-
ORG
- Changed to 7 mil company name corpus, Included only 2,278,866 lines that are US based.
- Crunchbase Companies data
-
LOC
- added countries, states, streets
- UK DATA
- US KAGGLE open addresses data[DELETED]
- Added random places of interest
- Generated data with example address pattern formats are:
- UK: street, town, county
- US: street, city, state, postal
-
PER
- Added single names
- Augment different name formats such as .D
- Edison email
- Baby names, USA names
- wildmoor :label is name
- Rob Record :label is loc
- Carla Roppo-Owczarek :label is org
- (BRIAN) HEXTER :label is org
- Senior Engineer - Projects & Services :label is org
- IT Support :label is org
- admin :label is loc