-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial port #1
initial port #1
Conversation
code extracted from ovos-classifiers for better separation of concerns
WalkthroughThis update introduces a comprehensive normalization framework for multilingual utterances, enhancing the processing capabilities of the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Normalizer
participant Tokenizer
participant LanguageConfig
User->>Normalizer: Input utterance
Normalizer->>LanguageConfig: Load language settings
LanguageConfig-->>Normalizer: Return config
Normalizer->>Tokenizer: Tokenize utterance
Tokenizer-->>Normalizer: Return tokens
Normalizer->>User: Output normalized utterance
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (23)
- MANIFEST.in (1 hunks)
- ovos_utterance_normalizer/init.py (1 hunks)
- ovos_utterance_normalizer/normalizer.py (1 hunks)
- ovos_utterance_normalizer/numeric.py (1 hunks)
- ovos_utterance_normalizer/res/az/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/ca/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/cz/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/de/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/en/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/es/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/fr/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/it/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/nl/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/no/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/pt/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/ru/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/sl/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/sv/normalize.json (1 hunks)
- ovos_utterance_normalizer/res/uk/normalize.json (1 hunks)
- ovos_utterance_normalizer/tokenization.py (1 hunks)
- ovos_utterance_normalizer/version.py (1 hunks)
- requirements.txt (1 hunks)
- setup.py (1 hunks)
Files skipped from review due to trivial changes (4)
- MANIFEST.in
- ovos_utterance_normalizer/res/es/normalize.json
- ovos_utterance_normalizer/version.py
- requirements.txt
Additional context used
Ruff
ovos_utterance_normalizer/numeric.py
36-37: Use a single
if
statement instead of nestedif
statements(SIM102)
998-998: Function definition does not bind loop variable
c
(B023)
1045-1045: Function definition does not bind loop variable
c
(B023)
1137-1146: Combine
if
branches using logicalor
operatorCombine
if
branches(SIM114)
1156-1159: Use ternary operator
val = int(word) if word.isdigit() else float(word)
instead ofif
-else
-block(SIM108)
1767-1767: Function definition does not bind loop variable
c
(B023)
1814-1814: Function definition does not bind loop variable
c
(B023)
1891-1900: Combine
if
branches using logicalor
operatorCombine
if
branches(SIM114)
1914-1917: Use ternary operator
val = int(word) if word.isdigit() else float(word)
instead ofif
-else
-block(SIM108)
Additional comments not posted (75)
ovos_utterance_normalizer/res/fr/normalize.json (1)
1-14
: Configuration options for French normalization look good.The configuration options are appropriate for text normalization. Ensure that the empty objects and arrays (
contractions
,word_replacements
,number_replacements
,stopwords
,articles
) are populated as needed for effective normalization.ovos_utterance_normalizer/res/it/normalize.json (1)
1-14
: Configuration options for Italian normalization look good.The configuration options are appropriate for text normalization. Ensure that the empty objects and arrays (
contractions
,word_replacements
,number_replacements
,stopwords
,articles
) are populated as needed for effective normalization.ovos_utterance_normalizer/res/nl/normalize.json (1)
1-14
: Configuration options for Dutch normalization look good.The configuration options are appropriate for text normalization. Ensure that the empty objects and arrays (
contractions
,word_replacements
,number_replacements
,stopwords
,articles
) are populated as needed for effective normalization.ovos_utterance_normalizer/res/no/normalize.json (1)
1-14
: File structure and settings look good.The JSON configuration file for Norwegian normalization settings is well-structured and includes reasonable default values. Ensure that the settings align with the intended behavior for Norwegian language processing.
ovos_utterance_normalizer/res/sl/normalize.json (1)
1-14
: File structure and settings look good.The JSON configuration file for Slovenian normalization settings is well-structured and includes reasonable default values. Ensure that the settings align with the intended behavior for Slovenian language processing.
ovos_utterance_normalizer/res/sv/normalize.json (1)
1-14
: File structure and settings look good.The JSON configuration file for Swedish normalization settings is well-structured and includes reasonable default values. Ensure that the settings align with the intended behavior for Swedish language processing.
ovos_utterance_normalizer/res/az/normalize.json (1)
1-45
: LGTM! Configuration settings are accurate and relevant.The configuration settings for normalization in the Azerbaijani language are well-structured and appropriate.
ovos_utterance_normalizer/res/cz/normalize.json (1)
1-46
: LGTM! Configuration settings are accurate and relevant.The configuration settings for normalization in the Czech language are well-structured and appropriate.
ovos_utterance_normalizer/res/ru/normalize.json (1)
1-46
: LGTM! Configuration settings are accurate and relevant.The configuration settings for normalization in the Russian language are well-structured and appropriate.
ovos_utterance_normalizer/res/uk/normalize.json (2)
2-8
: Ensure consistency in normalization settings.The normalization settings seem appropriate, but ensure they are consistent with other language configurations.
11-70
: Verify number replacements for completeness and accuracy.The number replacements cover a wide range of values, but ensure that all common numbers are included and accurately mapped.
ovos_utterance_normalizer/res/pt/normalize.json (4)
2-8
: Ensure consistency in normalization settings.The normalization settings seem appropriate, but ensure they are consistent with other language configurations.
11-65
: Verify number replacements for completeness and accuracy.The number replacements cover a wide range of values, but ensure that all common numbers are included and accurately mapped.
67-91
: Review stopwords for completeness and relevance.The list of stopwords appears comprehensive, but verify that it includes all relevant stopwords for the Portuguese language.
93-97
: Review articles for completeness and relevance.The list of articles appears appropriate, but verify that it includes all relevant articles for the Portuguese language.
ovos_utterance_normalizer/res/ca/normalize.json (4)
2-8
: Ensure consistency in normalization settings.The normalization settings seem appropriate, but ensure they are consistent with other language configurations.
11-73
: Verify number replacements for completeness and accuracy.The number replacements cover a wide range of values, but ensure that all common numbers are included and accurately mapped.
75-100
: Review stopwords for completeness and relevance.The list of stopwords appears comprehensive, but verify that it includes all relevant stopwords for the Catalan language.
102-108
: Review articles for completeness and relevance.The list of articles appears appropriate, but verify that it includes all relevant articles for the Catalan language.
ovos_utterance_normalizer/__init__.py (4)
14-15
: LGTM!The
__init__
method correctly initializes the plugin with a name and priority.
17-35
: LGTM!The
get_normalizer
method correctly returns the appropriate language-specific normalizer.
37-39
: LGTM!The
strip_punctuation
method correctly removes punctuation from the beginning and end of the provided utterance.
41-60
: LGTM!The
transform
method correctly normalizes a list of utterances based on the provided context, expands contractions, strips punctuation if configured, and deduplicates the list while preserving order.setup.py (4)
9-31
: LGTM!The
get_version
function correctly reads the version information from a file and constructs the version string.
34-39
: LGTM!The
package_files
function correctly collects all files in the specified directory.
42-50
: LGTM!The
required
function correctly reads the requirements from a file and processes them based on the environment variable.
61-85
: LGTM!The setup configuration correctly defines the package metadata, dependencies, and entry points.
ovos_utterance_normalizer/res/de/normalize.json (6)
1-8
: LGTM!The general normalization settings are correctly defined for handling lowercase, numbers to digits, contractions, symbols, accents, articles, and stopwords.
9-25
: LGTM!The contractions settings are correctly defined for expanding common German contractions.
26-71
: LGTM!The word replacements settings are correctly defined for replacing common German abbreviations and units of measurement with their full forms.
72-112
: LGTM!The number replacements settings are correctly defined for replacing German number words with their corresponding digits.
113-113
: LGTM!The stopwords settings correctly include an empty list, indicating no stopwords are defined for removal.
114-121
: LGTM!The articles settings are correctly defined for handling common German articles.
ovos_utterance_normalizer/res/en/normalize.json (6)
1-8
: General settings look good.The general settings for normalization options are correctly defined and appropriate for the intended tasks.
9-177
: Contractions mapping is comprehensive and well-defined.The contractions mapping covers a wide range of common contractions and is correctly defined.
178-178
: Empty word replacements section is acceptable.The word replacements section is currently empty, which is acceptable for the current implementation and can be expanded in the future.
179-207
: Number replacements mapping is comprehensive and well-defined.The number replacements mapping covers numbers from zero to ninety and is correctly defined.
209-209
: Empty stopwords section is acceptable.The stopwords section is currently empty, which is acceptable for the current implementation and can be expanded in the future.
210-214
: Articles section is well-defined.The articles section includes common English articles and is correctly defined.
ovos_utterance_normalizer/tokenization.py (6)
1-12
: Imports and Token namedtuple look good.The imports are necessary and correctly used. The Token namedtuple is correctly defined to store word tokens and their indices.
15-57
: ReplaceableEntity class is well-defined.The ReplaceableEntity class is correctly defined to store entities found in a string, including their value and tokens. The properties and methods are correctly implemented and adhere to best practices.
64-121
: ReplaceableNumber, ReplaceableDate, ReplaceableTime, and ReplaceableTimedelta classes are well-defined.These classes correctly inherit from ReplaceableEntity and are defined to store specific types of entities found in a string. The constructors and any additional methods are correctly implemented and adhere to best practices.
124-151
: partition_list function is well-defined.The partition_list function is correctly defined to partition a list of items based on a callable that returns a boolean. The logic and return value are correctly implemented and adhere to best practices.
154-168
: sentence_tokenize and word_tokenize functions are well-defined.The sentence_tokenize function correctly tokenizes a text into sentences. The word_tokenize function correctly tokenizes an utterance into words, with special handling for Portuguese and Catalan languages. The logic and return values are correctly implemented and adhere to best practices.
171-195
: word_tokenize_pt and word_tokenize_ca functions are well-defined.The word_tokenize_pt and word_tokenize_ca functions correctly tokenize an utterance into words for Portuguese and Catalan languages, respectively, with specific handling for certain patterns. The logic and return values are correctly implemented and adhere to best practices.
ovos_utterance_normalizer/normalizer.py (7)
1-69
: Imports and Normalizer class look good.The imports are necessary and correctly used. The Normalizer class is correctly defined with various properties and methods for normalizing utterances.
96-132
: Methods for expanding contractions, converting numbers to digits, and removing articles and stopwords are well-defined.The methods
expand_contractions
,numbers_to_digits
,remove_articles
, andremove_stopwords
correctly perform specific normalization tasks based on the configuration. The logic and return values are correctly implemented and adhere to best practices.
134-149
: Methods for removing symbols and accents, and replacing words are well-defined.The methods
remove_symbols
,remove_accents
, andreplace_words
correctly perform specific normalization tasks based on the configuration. The logic and return values are correctly implemented and adhere to best practices.
151-172
: normalize method is well-defined.The
normalize
method correctly performs the overall normalization of an utterance based on various settings and configurations. The logic and return value are correctly implemented and adhere to best practices.
175-206
: CatalanNormalizer, CzechNormalizer, PortugueseNormalizer, RussianNormalizer, and UkrainianNormalizer classes are well-defined.These classes correctly inherit from the Normalizer class and provide specific configurations for different languages. The constructors and any additional methods are correctly implemented and adhere to best practices.
208-214
: EnglishNormalizer class is well-defined.The EnglishNormalizer class correctly inherits from the Normalizer class and provides specific configurations for English. The constructor and the overridden
numbers_to_digits
method are correctly implemented and adhere to best practices.
216-235
: AzerbaijaniNormalizer and GermanNormalizer classes are well-defined.These classes correctly inherit from the Normalizer class and provide specific configurations for Azerbaijani and German languages. The constructors and the overridden methods are correctly implemented and adhere to best practices.
ovos_utterance_normalizer/numeric.py (23)
9-22
: LGTM!The
is_numeric
function is straightforward and correctly handles the conversion of a string to a float.
25-40
: LGTM!The
look_for_fractions
function is clear and correctly uses theis_numeric
function to check if both parts of the fraction are numeric.Tools
Ruff
36-37: Use a single
if
statement instead of nestedif
statements(SIM102)
266-281
: LGTM!The
is_ordinal_de
method correctly checks for ordinals using the_STRING_LONG_ORDINAL_DE
dictionary.
283-332
: LGTM!The
is_fractional_de
method correctly checks for fractions using the_STRING_FRACTION_DE
dictionary.
334-348
: LGTM!The
is_number_de
method correctly checks for numeric values, ordinals, and fractions.
350-379
: LGTM!The
convert_words_to_numbers
method correctly tokenizes the input string and replaces words with their numeric equivalents.
382-402
: LGTM!The
extract_numbers
method correctly extracts numbers using the_extract_numbers_with_text_de
method.
403-445
: LGTM!The
_extract_numbers_with_text_de
method correctly iterates through the tokens and extracts numbers.
448-466
: LGTM!The
_extract_number_with_text_de
method correctly extracts a single number using the_extract_number_with_text_de_helper
method.
469-488
: LGTM!The
_extract_number_with_text_de_helper
method correctly handles the extraction of fractions, decimals, and whole numbers.
491-627
: LGTM!The
_extract_real_number_with_text_de
method correctly handles the extraction of real numbers, including handling negatives, fractions, and spoken decimals.
979-1054
: LGTM!The
_initialize_number_data_de
method correctly initializes the dictionaries for short scale and long scale numbers.Tools
Ruff
998-998: Function definition does not bind loop variable
c
(B023)
1045-1045: Function definition does not bind loop variable
c
(B023)
870-897
: LGTM!The
is_fractional
method correctly checks for fractions using the_FRACTION_STRING_EN
dictionary.
899-929
: LGTM!The
convert_words_to_numbers
method correctly tokenizes the input string and replaces words with their numeric equivalents.
931-947
: LGTM!The
extract_numbers
method correctly extracts numbers using the_extract_numbers_with_text_en
method.
965-977
: LGTM!The
_extract_numbers_with_text_en
method correctly iterates through the tokens and extracts numbers.
1343-1363
: LGTM!The
_extract_number_with_text_en
method correctly extracts a single number using the_extract_number_with_text_en_helper
method.
1310-1341
: LGTM!The
_extract_number_with_text_en_helper
method correctly handles the extraction of fractions, decimals, and whole numbers.
979-1018
: LGTM!The
_extract_fraction_with_text_en
method correctly handles the extraction of fractions.Tools
Ruff
998-998: Function definition does not bind loop variable
c
(B023)
1020-1065
: LGTM!The
_extract_decimal_with_text_en
method correctly handles the extraction of decimals.Tools
Ruff
1045-1045: Function definition does not bind loop variable
c
(B023)
1067-1309
: LGTM!The
_extract_whole_number_with_text_en
method correctly handles the extraction of whole numbers, including handling negatives, fractions, and spoken decimals.Tools
Ruff
1137-1146: Combine
if
branches using logicalor
operatorCombine
if
branches(SIM114)
1156-1159: Use ternary operator
val = int(word) if word.isdigit() else float(word)
instead ofif
-else
-block(SIM108)
950-977
: LGTM!The
_initialize_number_data_en
method correctly initializes the dictionaries for short scale and long scale numbers.
1580-1609
: LGTM!The
convert_words_to_numbers
method correctly tokenizes the input string and replaces words with their numeric equivalents.
code extracted from ovos-classifiers for better separation of concerns and localization
TODO:
Summary by CodeRabbit
New Features
UtteranceNormalizerPlugin
for normalizing utterances by handling numbers, punctuation, and contractions.Bug Fixes
Documentation
Chores
requirements.txt
to manage dependencies andsetup.py
for packaging the module.