Restructured dev and test code #12

PiyushGSlab · 2022-12-12T07:30:32Z

Restructured the unit testing code
Pulled out common code into one function in infotype_helper.py

added new infotypes to supported infotypes list

Removed libraries specifications

…y insted of mentioning each dataset name

hsheth2 · 2022-12-13T19:16:36Z

datahub-classify/src/datahub_classify/infotype_helper.py

-def inspect_for_email_address(metadata, values, config):
+def compute_name_description_dtype_score(
+    metadata: Metadata, config: dict, debug_info: dict
+) -> dict:


it'd be good to start using dataclasses or TypedDict instead of raw dict types here and elsewhere in the code

Updated the code with this change incorporated.

Not quite what I was looking for. Where possible, please use dataclasses. For example, debug_info shouldn't be a dictionary but instead an instance of a dataclass

We have changed debug_info to be of TypedDict. Does this work?

datahub-classify/tests/test_infotype_predictor.py

…/datahub-classify into test_restructure

…ater on for quick test

…vel)

PiyushGSlab · 2022-12-20T15:22:29Z

As per Mayuri's request, we have modified some logger statements in infotype_predictor.py, to reduce the verbosity of messages

hsheth2

Looking at the code now, a dataclass would definitely be preferable over the TypedDict

hsheth2 · 2022-12-21T04:25:39Z

datahub-classify/src/datahub_classify/infotype_helper.py

-    metadata: Metadata, values: list, config: dict
-) -> tuple:  # noqa: C901
+    metadata: Metadata, values: List[Any], config: Dict[str, Any]
+) -> Tuple[float, Any]:  # noqa: C901


Suggested change

) -> Tuple[float, Any]: # noqa: C901

) -> Tuple[float, DebugInfo]: # noqa: C901

Changed DebugInfo to dataclass

hsheth2 · 2022-12-21T04:25:50Z

datahub-classify/src/datahub_classify/infotype_helper.py

@@ -181,7 +183,7 @@ def inspect_for_gender(
    try:
        if (
            debug_info.get(NAME, None)
-            and int(debug_info[NAME]) == 1
+            and int(debug_info[NAME]) == 1  # type: ignore


why was this required?

removed this, not required anymore

hsheth2 · 2022-12-21T04:26:00Z

datahub-classify/src/datahub_classify/infotype_helper.py

-    metadata: Metadata, values: list, config: dict
-) -> tuple:
+    metadata: Metadata, values: List[Any], config: Dict[str, Any]
+) -> Tuple[float, Any]:


use debuginfo

fixed this wherever applicable

hsheth2 · 2022-12-21T04:26:14Z

datahub-classify/src/datahub_classify/infotype_helper.py

@@ -198,10 +200,10 @@ def inspect_for_gender(


 def inspect_for_credit_debit_card_number(
-    metadata: Metadata, values: list, config: dict
-) -> tuple:
+    metadata: Metadata, values: List[Any], config: Dict[str, Any]


what is the type of config here?

config is a type of dict (configuration for that particular infotype)
For eg:
{
'Prediction_Factors_and_Weights': {
'Name': 0.4,
'Description': 0,
'Datatype': 0,
'Values': 0.6
},
'Name': { 'regex': [] },
'Description': { 'regex': [] },
'Datatype': { 'type': [] },
'Values': {
'prediction_type': 'regex/library',
'regex': [],
'library': []
}
},

Hi @hsheth2 , I had a discussion with @mayurinehate on changing this dict to dataclass, in the future PR we will work on it.

…ype checks

…xed some incorrect function annotations

…d any more

hsheth2

mostly looking good, had one tiny comment remaining

hsheth2 · 2022-12-23T21:13:31Z

datahub-classify/src/datahub_classify/infotype_helper.py

-            and int(debug_info[NAME]) == 1  # type: ignore
-            and VALUES in debug_info.keys()
-            and 0.5 > cast(float, debug_info[VALUES]) > 0.1
+            hasattr(debug_info, "name")


I don't think this hasattr check is necessary anymore, how that we have the dataclass?

Removed the hasattr check, and also modified the condition to apply check on prediction_factors_weights which adds more clarity to the code.

…th applies same check with more code clarity. Also earlier condition was always failing as 0 equals False in python

hsheth2

Approving for now

Note that I'm still not super happy with where we are on this codebase.

config parsing and handling - generally we should be using pydantic models instead of raw dicts and constants for indexes
class and variable naming / comments - for example, it's very unclear what DebugInfo's description actually contains (does it contain the confidence or something else?), especially since it can be a string or a float. another example: reference_input.py's input1 is actually the default config, but that fact is totally unclear from the code
code redundancy / duplication in the infotype_helper.py file - the inspect_* methods all still look extremely similar and should be refactored more. Ideally we should have a single large map of type -> function with the "library" implementation, and the regex and other handling should be common
The tests shouldn't have separate UNIT_TESTING and normal json files with expected values - instead we should just have a parameterized test that uses pytest marks to conditionally skip non-unit test cases

hsheth2 · 2022-12-29T05:57:00Z

datahub-classify/src/datahub_classify/infotype_helper.py

-            and hasattr(debug_info, "values")
-            and debug_info.values == 0
+            prediction_factors_weights.get(NAME, 0) > 0
+            and debug_info.name == 1.0


comparing floats with == is unreliable

Modified the code not to use "==" for float comparison. Also I have modified DebugInfo instance variable type to float instead of float and str. Added couple of TODOs regarding adding warning/error messages in the error flag which will get passed in ColumnInfo object in future PR.

mardikark-gslab · 2022-12-29T08:37:41Z

Approving for now

Note that I'm still not super happy with where we are on this codebase.

config parsing and handling - generally we should be using pydantic models instead of raw dicts and constants for indexes

class and variable naming / comments - for example, it's very unclear what DebugInfo's description actually contains (does it contain the confidence or something else?), especially since it can be a string or a float. another example: reference_input.py's input1 is actually the default config, but that fact is totally unclear from the code

code redundancy / duplication in the infotype_helper.py file - the inspect_* methods all still look extremely similar and should be refactored more. Ideally we should have a single large map of type -> function with the "library" implementation, and the regex and other handling should be common

The tests shouldn't have separate UNIT_TESTING and normal json files with expected values - instead we should just have a parameterized test that uses pytest marks to conditionally skip non-unit test cases

Thanks @hsheth2 for the suggestions, I have noted it. We will take care of it in the next PR.

…riable types to float only instead of float/str

hsheth2 · 2022-12-29T19:09:20Z

datahub-classify/src/datahub_classify/infotype_helper.py

@@ -186,7 +192,7 @@ def inspect_for_gender(
    try:
        if (
            prediction_factors_weights.get(NAME, 0) > 0
-            and debug_info.name == 1.0
+            and abs(1 - cast(float, debug_info.name)) < 1e-10


should be using debug_info.name is not None to avoid the call to cast

Modified the code

datahub-classify/src/datahub_classify/infotype_helper.py

PiyushGSlab and others added 15 commits November 28, 2022 18:38

Update README.md

6f45085

added new infotypes to supported infotypes list

Update README.md

3047cf4

Removed libraries specifications

updated pip package name

8531a1d

Merge branch 'acryldata:main' into main

70a816b

Removed unwanted CSV files

08984df

Restructured unit testing file to load dataset from provided director…

15d4043

…y insted of mentioning each dataset name

cosmetic changes

52d3519

Refactored code to compute name desc, dtype score into a singe funtion

690f9bf

Added function annotations

bc6e6e6

added function annotations

2737897

added quick test functionality

c8fb546

Removed TODO comment

1c046c5

Removed restriction of loading only 1000 rows in test file

b9478bc

Renamed the test file

8819477

Merge branch 'main' into test_restructure

418a7f2

hsheth2 requested changes Dec 13, 2022

View reviewed changes

PiyushGSlab added 9 commits December 14, 2022 11:30

Updated function annotations (list and dict)

0694949

Merge branch 'test_restructure' of https://github.com/mardikark-gslab…

b2f0ceb

…/datahub-classify into test_restructure

Updated function annotations and ran gradle sanity checks

9b167bb

Removed the quick test functionality. Separate script will be added l…

062d3e5

…ater on for quick test

add Final qualifier to prevent mypy type checking errors

ac523d5

added a class DebugInfo

2f298e9

changed the debug_info from raw dict to TypedDict

e4ab9b5

reduced the verbosity of logger messages (some logs moved to debug le…

3bc6f02

…vel)

added typing_extensions library to base requirements

e0c4866

hsheth2 requested changes Dec 21, 2022

View reviewed changes

PiyushGSlab added 3 commits December 23, 2022 16:10

removed the Final qualifier as it is not required any more for mypy t…

76dcb44

…ype checks

changed DebugInfo from TypedDict to dataclass

eb0125a

some syntax changes as debug_info is now instance of dataclass and fi…

68c293d

…xed some incorrect function annotations

PiyushGSlab added 3 commits December 23, 2022 16:16

fixed some incorrect function annotations

b19a365

fixed some incorrect function annotations

85736b5

removed typing_extensions from base requirements as it is not require…

7228548

…d any more

hsheth2 reviewed Dec 23, 2022

View reviewed changes

PiyushGSlab added 3 commits December 26, 2022 12:24

class variables of DebugInfo assigned default value None

c22b8ab

removed hasattr check

72cc38e

replaced debug_info NoneType check with prediction_factors_weights wi…

325d5d9

…th applies same check with more code clarity. Also earlier condition was always failing as 0 equals False in python

hsheth2 approved these changes Dec 29, 2022

View reviewed changes

mardikark-gslab added 2 commits December 29, 2022 16:46

Modified the float comparison, also changed the DebugInfo instance va…

c619e6b

…riable types to float only instead of float/str

Removed unused import

cc6d546

hsheth2 reviewed Dec 29, 2022

View reviewed changes

Removed cast operation

d41a293

hsheth2 approved these changes Dec 30, 2022

View reviewed changes

datahub-classify/src/datahub_classify/infotype_helper.py Outdated Show resolved Hide resolved

datahub-classify/src/datahub_classify/infotype_helper.py Outdated Show resolved Hide resolved

hsheth2 added 2 commits December 30, 2022 01:36

Update datahub-classify/src/datahub_classify/infotype_helper.py

5b5c0cd

Update datahub-classify/src/datahub_classify/infotype_helper.py

0bcc6bd

hsheth2 merged commit d34b8d7 into acryldata:main Dec 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructured dev and test code #12

Restructured dev and test code #12

PiyushGSlab commented Dec 12, 2022

hsheth2 Dec 13, 2022

PiyushGSlab Dec 14, 2022

hsheth2 Dec 14, 2022

PiyushGSlab Dec 20, 2022

PiyushGSlab commented Dec 20, 2022

hsheth2 left a comment

hsheth2 Dec 21, 2022

PiyushGSlab Dec 23, 2022

hsheth2 Dec 21, 2022

PiyushGSlab Dec 23, 2022

hsheth2 Dec 21, 2022

PiyushGSlab Dec 23, 2022

hsheth2 Dec 21, 2022

PiyushGSlab Dec 23, 2022

mardikark-gslab Dec 23, 2022

hsheth2 left a comment

hsheth2 Dec 23, 2022

PiyushGSlab Dec 26, 2022

hsheth2 left a comment •

edited

Loading

hsheth2 Dec 29, 2022

mardikark-gslab Dec 29, 2022

mardikark-gslab commented Dec 29, 2022

hsheth2 Dec 29, 2022

mardikark-gslab Dec 30, 2022

	) -> Tuple[float, Any]: # noqa: C901
	) -> Tuple[float, DebugInfo]: # noqa: C901

Restructured dev and test code #12

Restructured dev and test code #12

Conversation

PiyushGSlab commented Dec 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PiyushGSlab commented Dec 20, 2022

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mardikark-gslab commented Dec 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment •

edited

Loading