Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Langauge support for person name #385

Merged
merged 44 commits into from
Dec 8, 2020
Merged
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
b09e431
All datasets to dict
ashutoshsingh0223 Nov 4, 2020
f28d8fd
Add SpacyTagger class and usage in NameDetector
ashutoshsingh0223 Nov 5, 2020
4e48ce8
Add indic translations for badwords, stopwords, common and question w…
ashutoshsingh0223 Nov 5, 2020
bffc7af
Indic unicode ranges
ashutoshsingh0223 Nov 5, 2020
45734fe
Indic unicode ranges
ashutoshsingh0223 Nov 5, 2020
74b146e
Add all indic languages in hindi block
ashutoshsingh0223 Nov 5, 2020
d53e8d1
Remove s from english match regex in detect_hindi_name
ashutoshsingh0223 Nov 5, 2020
8ae9529
NameDetector: Add INDIC_LANGUAGES_SET in detect_entity and replace_de…
ashutoshsingh0223 Nov 5, 2020
ffb1817
Remove spacy tagging code
ashutoshsingh0223 Nov 5, 2020
53f0db1
Download spacy models in requirements.txt
ashutoshsingh0223 Nov 6, 2020
adb7bcb
Add NAME_VARIATIONS and language set for european languages
ashutoshsingh0223 Nov 6, 2020
af41391
Add tokenize method to SpacyUtils class
ashutoshsingh0223 Nov 6, 2020
fece774
Add rule to match tags generated by spacy
ashutoshsingh0223 Nov 6, 2020
a8d5846
use token.tag instead of pos_
ashutoshsingh0223 Nov 6, 2020
3340dfd
use token.tag instead of pos_
ashutoshsingh0223 Nov 6, 2020
3d22aa9
Revert back to pos_
ashutoshsingh0223 Nov 6, 2020
c4758cb
Resolve merge conflicts
ashutoshsingh0223 Nov 6, 2020
becbec8
Resolve merge conflicts
ashutoshsingh0223 Nov 6, 2020
bf42bee
Lint fixes
ashutoshsingh0223 Nov 6, 2020
08ad660
Merge pull request #381 from hellohaptik/Spacy_stuff
ashutoshsingh0223 Nov 6, 2020
b68a58d
Merge branch 'develop' into Language_support_person_name
ashutoshsingh0223 Nov 6, 2020
d0ea4ad
Lint Fixes
ashutoshsingh0223 Nov 6, 2020
ffbf344
Merge branch 'Language_support_person_name' of https://github.com/hel…
ashutoshsingh0223 Nov 6, 2020
831ff90
Remove unsuported test cases for person_name
ashutoshsingh0223 Nov 9, 2020
ea32714
Add bot_message column in person_name test cases
ashutoshsingh0223 Nov 9, 2020
cedf2bb
Add bot_message column in person_name test cases
ashutoshsingh0223 Nov 9, 2020
12a7de2
Consider default language as english in get_name_using_pos_tagger
ashutoshsingh0223 Nov 9, 2020
f849e9a
Consider default language as english in get_name_using_pos_tagger
ashutoshsingh0223 Nov 9, 2020
f123a3e
Fix assert_data in ner_v2.detectors.textual.tests.test_elastic_search…
ashutoshsingh0223 Nov 9, 2020
b954e88
Fix test_queries.py to use env variables for search size
ashutoshsingh0223 Nov 9, 2020
87b5e9a
Fix test_queries.py to use env variables for search size
ashutoshsingh0223 Nov 9, 2020
6f75adf
Fix lint errors
ashutoshsingh0223 Nov 9, 2020
0731b43
Add documentatiion for SpacyUtils class
ashutoshsingh0223 Nov 10, 2020
b071b93
Add tests for SpacyUtils
ashutoshsingh0223 Nov 10, 2020
dd4110c
Add previous message variation for hindi
ashutoshsingh0223 Nov 12, 2020
45d1e75
Remove hindi regex patterns and methods - get_hindi_names_from_regex,…
ashutoshsingh0223 Nov 24, 2020
f244b2d
Substring check in get_hindi_name_without regex
ashutoshsingh0223 Nov 25, 2020
03650a9
Remove regex test cases
ashutoshsingh0223 Dec 2, 2020
646a223
Merge branch 'develop' into Language_support_person_name
ashutoshsingh0223 Dec 2, 2020
9e5394f
Add bot_message and remove pattern based tests for person_namw
ashutoshsingh0223 Dec 4, 2020
00386aa
Merge branch 'Language_support_person_name' of https://github.com/hel…
ashutoshsingh0223 Dec 4, 2020
c699a8f
Remove pdb
ashutoshsingh0223 Dec 4, 2020
2843385
Add bot_message in query_params in ner_collection.json
ashutoshsingh0223 Dec 4, 2020
7003e4a
Merge pull request #380 from hellohaptik/Language_support_person_name
ashutoshsingh0223 Dec 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Add SpacyTagger class and usage in NameDetector
ashutoshsingh0223 committed Nov 5, 2020

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit f28d8fd2db64522c8d7752bd3a1c9c3cc90204ac
5 changes: 5 additions & 0 deletions language_utilities/constant.py
Original file line number Diff line number Diff line change
@@ -12,5 +12,10 @@
MALAYALAM_LANG = 'ml'
PUNJABI_LANG = 'pa'

SPANISH_LANG = 'es'
DUTCH_LANG = 'nl'
FRENCH_LANG = 'fr'
GERMAN_LANG = 'de'

# language translation status
TRANSLATED_TEXT = 'translated_text'
28 changes: 28 additions & 0 deletions lib/nlp/spacy_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
import six
chiragjn marked this conversation as resolved.
Show resolved Hide resolved
from lib.singleton import Singleton
from language_utilities.constant import ENGLISH_LANG, SPANISH_LANG, DUTCH_LANG, GERMAN_LANG, FRENCH_LANG

# import spacy


# class SpacyTagger(six.with_metaclass(Singleton, object)):
# def __init__(self):
# self.spacy_language_to_model = {
# ENGLISH_LANG: {'name': 'en_core_web_sm', 'model': None},
# GERMAN_LANG: {'name': 'de_core_news_sm', 'model': None},
# FRENCH_LANG: {'name': 'fr_core_news_sm', 'model': None},
# DUTCH_LANG: {'name': 'nl_core_news_sm', 'model': None},
# SPANISH_LANG: {'name': 'es_core_news_sm', 'model': None}
# }
#
# def tag(self, text, language):
# spacy_model_name = self.spacy_language_to_model[language]['name']
# nlp = self.spacy_language_to_model[language]['model']
# if not nlp:
# nlp = spacy.load(spacy_model_name, disable=['parser', 'ner'])
# spacy_doc = nlp(text)
# tokens = []
# for spacy_token in spacy_doc:
# token = (spacy_token.text, spacy_token.tag)
# tokens.append(token)
# return tokens
21 changes: 15 additions & 6 deletions ner_v1/detectors/textual/name/name_detection.py
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@

from language_utilities.constant import ENGLISH_LANG, HINDI_LANG
from lib.nlp.const import nltk_tokenizer
from lib.nlp.pos import POS
from lib.nlp.pos import POS #,SpacyTagger
from ner_v1.constant import DATASTORE_VERIFIED, MODEL_VERIFIED
from ner_v1.constant import EMOJI_RANGES, FIRST_NAME, MIDDLE_NAME, LAST_NAME
from ner_v1.detectors.textual.name.hindi_const import (INDIC_BADWORDS, INDIC_QUESTIONWORDS,
@@ -110,17 +110,26 @@ def get_name_using_pos_tagger(self, text):
"""

entity_value, original_text = [], []
pos_tagger_object = POS()
name_tokens = text.split()
# Passing empty tokens to tag will cause IndexError
tagged_names = pos_tagger_object.tag(name_tokens)

if self.language == ENGLISH_LANG:
pos_tagger_object = POS()
name_tokens = text.split()
# Passing empty tokens to tag will cause IndexError
tagged_names = pos_tagger_object.tag(name_tokens)

else:
pass
# spacy_tagger = SpacyTagger()
# tagged_names = spacy_tagger.tag(text=text.strip(), language=self.language)

num_tokens = len(tagged_names)

is_question = [word[0] for word in tagged_names if word[1].startswith('WR') or
word[1].startswith('WP') or word[1].startswith('CD')]
if is_question:
return entity_value, original_text

if len(name_tokens) < 4 and self.bot_message:
if num_tokens < 4 and self.bot_message:
pos_words = [word[0] for word in tagged_names if word[1].startswith('NN') or
word[1].startswith('JJ')]
if pos_words: