Build ASR Support for Regex, Email. Enhance Number, Date Entity #475

tanaya-b · 2022-03-25T04:42:41Z

JIRA Ticket Number

JIRA TICKET: ML-2962

Description of change

Add ASR Utils Library for text normalization
Build support for longest fuzzy match
Change API Calls on both Haptik API and Chatbot_NER
Update dictionaries
Updates for Number Entity
- Punctuation filtering (Numeric Entity)
- Scale resolution logic (Numeric Entity)
- Number sorting fix (Numeric Entity)
- Add Double/Triple in Scaling (Numeric Entity)

Checklist (OPTIONAL):

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

…ific?)

haptik-deployment · 2022-03-25T04:42:50Z

lib/nlp/text_normalization.py

+import string
+from six.moves import range
+
+from chatbot_ner.config import ner_logger


F401 'chatbot_ner.config.ner_logger' imported but unused

haptik-deployment · 2022-03-25T04:42:51Z

lib/nlp/text_normalization.py

+# Constants
+_re_flags = re.UNICODE | re.V1
+PUNCTUATION_CHARACTERS = list(string.punctuation + '। ')
+CAPTURE_RANGE_RE = "{(?P<minimum>\d+),(?P<maximum>\d+)}"


W605 invalid escape sequence '\d'

haptik-deployment · 2022-03-25T04:42:51Z

lib/nlp/text_normalization.py

+    """
+
+    if not insert_edits:
+        count = lambda l1, l2: sum([1 for x in l1 if x in l2])


E731 do not assign a lambda expression, use a def

haptik-deployment · 2022-03-25T04:42:51Z

lib/nlp/text_normalization.py

+
+    Example procedure:
+        input_text = "बी nine nine three zero"
+        regex = "\w\d{4}"


W605 invalid escape sequence '\w'
W605 invalid escape sequence '\d'

haptik-deployment · 2022-03-25T04:42:51Z

ner_v2/detectors/numeral/number/number_detection.py

+                number_unit = number_value_dict[NUMBER_DETECTION_RETURN_DICT_UNIT]
+                if self.min_digit <= self._num_digits(number_value) <= self.max_digit:
+                    if self.unit_type and (number_unit is None or self.language_number_detector.units_map[
+                        number_unit].type != self.unit_type) and not self.detect_without_unit:


E125 continuation line with same indent as next logical line

haptik-deployment · 2022-03-25T04:42:51Z

ner_v2/detectors/numeral/number/standard_number_detector.py

+
+    # add re.escape to handle decimal cases in detected original
+    detected_original = re.escape(detected_original)
+    unit_matches = re.search(r'\W+((' + self.unit_choices + r')[.,\s]*' + detected_original + r')\W+|\W+(' +


W504 line break after binary operator

haptik-deployment · 2022-03-25T04:42:51Z

ner_v2/detectors/numeral/number/standard_number_detector.py

+    # add re.escape to handle decimal cases in detected original
+    detected_original = re.escape(detected_original)
+    unit_matches = re.search(r'\W+((' + self.unit_choices + r')[.,\s]*' + detected_original + r')\W+|\W+(' +
+                             detected_original + r'\s*(' +


W504 line break after binary operator

haptik-deployment · 2022-03-25T04:42:51Z

ner_v2/detectors/numeral/number/standard_number_detector.py

+    end_span = -1
+    spanned_text = self.processed_text
+
+    regex_numeric_patterns = re.compile(r'(([\d,]+\.?[\d]*)\s?(' + self.scale_map_choices + r'))[\s\-\:]' +


W504 line break after binary operator

tanaya-b · 2022-03-31T04:35:20Z

Lint fixes to be pushed with next suggested changes.

ner_v1/api.py

ner_v2/detectors/numeral/number/number_detection.py

lib/nlp/text_normalization.py

ner_v2/detectors/numeral/number/en/data/numerals_constant.csv

ner_v2/detectors/numeral/number/number_detection.py

ner_v2/detectors/numeral/number/standard_number_detector.py

ner_v2/detectors/numeral/utils.py

ner_v1/chatbot/entity_detection.py

ner_v2/detectors/numeral/constant.py

haptik-deployment · 2022-04-13T18:41:20Z

lib/nlp/text_normalization.py

+        input_text (str): modified text
+
+    Example:
+        fit_text_to_format(input_text='1 2 3 45', regex_pattern='\d{5}')


W605 invalid escape sequence '\d'

haptik-deployment · 2022-04-13T18:41:20Z

lib/nlp/text_normalization.py

+
+    if not insert_edits:
+        # A rough heuristic to allow (#_of_punctuations + 2) extra characters during fuzzy matching
+        count = lambda l1, l2: sum([1 for x in l1 if x in l2])  # pylint: disable=E731


E731 do not assign a lambda expression, use a def

haptik-deployment · 2022-04-13T18:41:20Z

lib/nlp/text_normalization.py

+
+    Example procedure:
+        input_text = "बी nine nine three zero"
+        regex = r"\w\d{4}"


W605 invalid escape sequence '\w'
W605 invalid escape sequence '\d'

naseem-shaik

LGTM, Please fix the lint errors before merging it.

tanaya-b · 2022-04-19T09:28:41Z

retest this please

sonarqubecloud · 2022-04-20T09:00:27Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
14 Code Smells

No Coverage information
0.0% Duplication

haptik-deployment · 2022-04-20T09:08:12Z

UNIT TESTS HAVE PASSED... Good To Merge

tanaya-b added 11 commits March 16, 2022 14:24

Update Date Dictionary (Hindi)

33e012e

Allow merging separate numbers to reach mark (should this be ASR spec…

a5c6a6d

…ific?)

Numeric fixes

b323e24

Regex based modifications

f008b42

Update hindi numerals constant

f30be4a

Email changes + Regex fixes

1fab9d2

Regex - Greedy Fuzzy Matching

7fbf499

Fix docstrings

37121ac

minor fix

747283a

Syntax Error fixes

eab01f0

Add fixme

73ef53e

tanaya-b added the dont-test label Mar 25, 2022

haptik-deployment reviewed Mar 25, 2022

View reviewed changes

tanaya-b added the new-feature Added new functionality label Mar 25, 2022

tanaya-b added 2 commits March 29, 2022 18:18

Revert accidental change

ff61149

Pop span from dictionary before returning

bec704f