Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructured dev and test code #12

Merged
merged 38 commits into from
Dec 30, 2022

Conversation

PiyushGSlab
Copy link
Contributor

  1. Restructured the unit testing code
  2. Pulled out common code into one function in infotype_helper.py

def inspect_for_email_address(metadata, values, config):
def compute_name_description_dtype_score(
metadata: Metadata, config: dict, debug_info: dict
) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be good to start using dataclasses or TypedDict instead of raw dict types here and elsewhere in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the code with this change incorporated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite what I was looking for. Where possible, please use dataclasses. For example, debug_info shouldn't be a dictionary but instead an instance of a dataclass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have changed debug_info to be of TypedDict. Does this work?

@PiyushGSlab
Copy link
Contributor Author

As per Mayuri's request, we have modified some logger statements in infotype_predictor.py, to reduce the verbosity of messages

Copy link
Contributor

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code now, a dataclass would definitely be preferable over the TypedDict

metadata: Metadata, values: list, config: dict
) -> tuple: # noqa: C901
metadata: Metadata, values: List[Any], config: Dict[str, Any]
) -> Tuple[float, Any]: # noqa: C901
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Tuple[float, Any]: # noqa: C901
) -> Tuple[float, DebugInfo]: # noqa: C901

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed DebugInfo to dataclass

@@ -181,7 +183,7 @@ def inspect_for_gender(
try:
if (
debug_info.get(NAME, None)
and int(debug_info[NAME]) == 1
and int(debug_info[NAME]) == 1 # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this, not required anymore

metadata: Metadata, values: list, config: dict
) -> tuple:
metadata: Metadata, values: List[Any], config: Dict[str, Any]
) -> Tuple[float, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use debuginfo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this wherever applicable

@@ -198,10 +200,10 @@ def inspect_for_gender(


def inspect_for_credit_debit_card_number(
metadata: Metadata, values: list, config: dict
) -> tuple:
metadata: Metadata, values: List[Any], config: Dict[str, Any]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the type of config here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config is a type of dict (configuration for that particular infotype)
For eg:
{
'Prediction_Factors_and_Weights': {
'Name': 0.4,
'Description': 0,
'Datatype': 0,
'Values': 0.6
},
'Name': { 'regex': [] },
'Description': { 'regex': [] },
'Datatype': { 'type': [] },
'Values': {
'prediction_type': 'regex/library',
'regex': [],
'library': []
}
},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @hsheth2 , I had a discussion with @mayurinehate on changing this dict to dataclass, in the future PR we will work on it.

Copy link
Contributor

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looking good, had one tiny comment remaining

and int(debug_info[NAME]) == 1 # type: ignore
and VALUES in debug_info.keys()
and 0.5 > cast(float, debug_info[VALUES]) > 0.1
hasattr(debug_info, "name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this hasattr check is necessary anymore, how that we have the dataclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the hasattr check, and also modified the condition to apply check on prediction_factors_weights which adds more clarity to the code.

…th applies same check with more code clarity. Also earlier condition was always failing as 0 equals False in python
Copy link
Contributor

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for now

Note that I'm still not super happy with where we are on this codebase.

  1. config parsing and handling - generally we should be using pydantic models instead of raw dicts and constants for indexes
  2. class and variable naming / comments - for example, it's very unclear what DebugInfo's description actually contains (does it contain the confidence or something else?), especially since it can be a string or a float. another example: reference_input.py's input1 is actually the default config, but that fact is totally unclear from the code
  3. code redundancy / duplication in the infotype_helper.py file - the inspect_* methods all still look extremely similar and should be refactored more. Ideally we should have a single large map of type -> function with the "library" implementation, and the regex and other handling should be common
  4. The tests shouldn't have separate UNIT_TESTING and normal json files with expected values - instead we should just have a parameterized test that uses pytest marks to conditionally skip non-unit test cases

and hasattr(debug_info, "values")
and debug_info.values == 0
prediction_factors_weights.get(NAME, 0) > 0
and debug_info.name == 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparing floats with == is unreliable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the code not to use "==" for float comparison. Also I have modified DebugInfo instance variable type to float instead of float and str. Added couple of TODOs regarding adding warning/error messages in the error flag which will get passed in ColumnInfo object in future PR.

@mardikark-gslab
Copy link
Contributor

Approving for now

Note that I'm still not super happy with where we are on this codebase.

  1. config parsing and handling - generally we should be using pydantic models instead of raw dicts and constants for indexes
  2. class and variable naming / comments - for example, it's very unclear what DebugInfo's description actually contains (does it contain the confidence or something else?), especially since it can be a string or a float. another example: reference_input.py's input1 is actually the default config, but that fact is totally unclear from the code
  3. code redundancy / duplication in the infotype_helper.py file - the inspect_* methods all still look extremely similar and should be refactored more. Ideally we should have a single large map of type -> function with the "library" implementation, and the regex and other handling should be common
  4. The tests shouldn't have separate UNIT_TESTING and normal json files with expected values - instead we should just have a parameterized test that uses pytest marks to conditionally skip non-unit test cases

Thanks @hsheth2 for the suggestions, I have noted it. We will take care of it in the next PR.

@@ -186,7 +192,7 @@ def inspect_for_gender(
try:
if (
prediction_factors_weights.get(NAME, 0) > 0
and debug_info.name == 1.0
and abs(1 - cast(float, debug_info.name)) < 1e-10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be using debug_info.name is not None to avoid the call to cast

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified the code

@hsheth2 hsheth2 merged commit d34b8d7 into acryldata:main Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants