Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructured dev and test code #12

Merged
merged 38 commits into from
Dec 30, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
6f45085
Update README.md
PiyushGSlab Nov 28, 2022
3047cf4
Update README.md
PiyushGSlab Nov 29, 2022
8531a1d
updated pip package name
PiyushGSlab Nov 29, 2022
70a816b
Merge branch 'acryldata:main' into main
mardikark-gslab Dec 1, 2022
08984df
Removed unwanted CSV files
mardikark-gslab Dec 6, 2022
15d4043
Restructured unit testing file to load dataset from provided director…
mardikark-gslab Dec 6, 2022
52d3519
cosmetic changes
mardikark-gslab Dec 6, 2022
690f9bf
Refactored code to compute name desc, dtype score into a singe funtion
PiyushGSlab Dec 8, 2022
bc6e6e6
Added function annotations
PiyushGSlab Dec 8, 2022
2737897
added function annotations
PiyushGSlab Dec 8, 2022
c8fb546
added quick test functionality
PiyushGSlab Dec 8, 2022
1c046c5
Removed TODO comment
mardikark-gslab Dec 12, 2022
b9478bc
Removed restriction of loading only 1000 rows in test file
mardikark-gslab Dec 12, 2022
8819477
Renamed the test file
mardikark-gslab Dec 12, 2022
418a7f2
Merge branch 'main' into test_restructure
hsheth2 Dec 13, 2022
0694949
Updated function annotations (list and dict)
PiyushGSlab Dec 14, 2022
b2f0ceb
Merge branch 'test_restructure' of https://github.com/mardikark-gslab…
PiyushGSlab Dec 14, 2022
9b167bb
Updated function annotations and ran gradle sanity checks
PiyushGSlab Dec 14, 2022
062d3e5
Removed the quick test functionality. Separate script will be added l…
PiyushGSlab Dec 20, 2022
ac523d5
add Final qualifier to prevent mypy type checking errors
PiyushGSlab Dec 20, 2022
2f298e9
added a class DebugInfo
PiyushGSlab Dec 20, 2022
e4ab9b5
changed the debug_info from raw dict to TypedDict
PiyushGSlab Dec 20, 2022
3bc6f02
reduced the verbosity of logger messages (some logs moved to debug le…
PiyushGSlab Dec 20, 2022
e0c4866
added typing_extensions library to base requirements
PiyushGSlab Dec 20, 2022
76dcb44
removed the Final qualifier as it is not required any more for mypy t…
PiyushGSlab Dec 23, 2022
eb0125a
changed DebugInfo from TypedDict to dataclass
PiyushGSlab Dec 23, 2022
68c293d
some syntax changes as debug_info is now instance of dataclass and fi…
PiyushGSlab Dec 23, 2022
b19a365
fixed some incorrect function annotations
PiyushGSlab Dec 23, 2022
85736b5
fixed some incorrect function annotations
PiyushGSlab Dec 23, 2022
7228548
removed typing_extensions from base requirements as it is not require…
PiyushGSlab Dec 23, 2022
c22b8ab
class variables of DebugInfo assigned default value None
PiyushGSlab Dec 26, 2022
72cc38e
removed hasattr check
PiyushGSlab Dec 26, 2022
325d5d9
replaced debug_info NoneType check with prediction_factors_weights wi…
PiyushGSlab Dec 26, 2022
c619e6b
Modified the float comparison, also changed the DebugInfo instance va…
mardikark-gslab Dec 29, 2022
cc6d546
Removed unused import
mardikark-gslab Dec 29, 2022
d41a293
Removed cast operation
mardikark-gslab Dec 30, 2022
5b5c0cd
Update datahub-classify/src/datahub_classify/infotype_helper.py
hsheth2 Dec 30, 2022
0bcc6bd
Update datahub-classify/src/datahub_classify/infotype_helper.py
hsheth2 Dec 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
549 changes: 107 additions & 442 deletions datahub-classify/src/datahub_classify/infotype_helper.py

Large diffs are not rendered by default.

6 changes: 4 additions & 2 deletions datahub-classify/src/datahub_classify/infotype_predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,17 @@
logger = logging.getLogger(__name__)


def get_infotype_function_mapping(infotypes, global_config):
def get_infotype_function_mapping(
infotypes: Optional[list], global_config: dict
) -> dict:
from inspect import getmembers, isfunction

module_name = "datahub_classify.infotype_helper"
module = importlib.import_module(module_name)
module_fn_dict = dict(getmembers(module, isfunction))
infotype_function_map = {}
if not infotypes:
infotypes = global_config.keys()
infotypes = list(global_config.keys())
for infotype in infotypes:
if infotype not in global_config.keys():
logger.warning(f"Configuration is not available for infotype - {infotype}")
Expand Down
16 changes: 11 additions & 5 deletions datahub-classify/src/datahub_classify/infotype_utils.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
import logging
import re
from typing import Optional

from datahub_classify.constants import PREDICTION_FACTORS_AND_WEIGHTS, VALUES
from datahub_classify.helper_classes import Metadata

logger = logging.getLogger(__name__)


# TODO: Exception handling
# Match regex for Name and Description
def match_regex(text_to_match, regex_list):
def match_regex(text_to_match: str, regex_list: list) -> float:
original_text = text_to_match.lower()
cleaned_text = "".join(e for e in original_text if e.isalpha())
match_score: float = 0
Expand Down Expand Up @@ -36,7 +38,7 @@ def match_regex(text_to_match, regex_list):


# Match data type
def match_datatype(dtype_to_match, dtype_list):
def match_datatype(dtype_to_match: str, dtype_list: list[str]) -> int:
dtype_list = [str(s).lower() for s in dtype_list]
dtype_to_match = dtype_to_match.lower()
if dtype_to_match in dtype_list:
Expand All @@ -47,7 +49,7 @@ def match_datatype(dtype_to_match, dtype_list):


# Match regex for values
def match_regex_for_values(values, regex_list):
def match_regex_for_values(values: list, regex_list: list) -> float:
values_score_list = []
length_values = len(values)
values = [str(x).lower() for x in values]
Expand All @@ -66,7 +68,9 @@ def match_regex_for_values(values, regex_list):
return values_score


def detect_named_entity_spacy(spacy_models_list, entities_of_interest, value):
def detect_named_entity_spacy(
spacy_models_list: list, entities_of_interest: list[str], value: str
) -> bool:
for spacy_model in spacy_models_list:
doc = spacy_model(value)
for ent in doc.ents:
Expand All @@ -75,7 +79,9 @@ def detect_named_entity_spacy(spacy_models_list, entities_of_interest, value):
return False


def perform_basic_checks(metadata, values, config_dict, infotype=None):
def perform_basic_checks(
metadata: Metadata, values: list, config_dict: dict, infotype: Optional[str] = None
) -> bool:
basic_checks_status = True
minimum_values_threshold = 50
if (
Expand Down
51,001 changes: 0 additions & 51,001 deletions datahub-classify/tests/datasets/Customer Segmentation.csv

This file was deleted.

Binary file not shown.
1,001 changes: 0 additions & 1,001 deletions datahub-classify/tests/datasets/Electric_Vehicle_Population_Data.csv

This file was deleted.

1,001 changes: 0 additions & 1,001 deletions datahub-classify/tests/datasets/Electric_Vehicle_Population_Data_2.csv

This file was deleted.

2,500 changes: 0 additions & 2,500 deletions datahub-classify/tests/datasets/USA_cars_datasets.csv

This file was deleted.

2,898 changes: 0 additions & 2,898 deletions datahub-classify/tests/datasets/athletes.csv

This file was deleted.

78 changes: 0 additions & 78 deletions datahub-classify/tests/datasets/coaches.csv

This file was deleted.

Loading