Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly infer bool and object types in autoML #2765

Merged
merged 13 commits into from
Nov 18, 2022

Conversation

arnavgarg1
Copy link
Contributor

@arnavgarg1 arnavgarg1 commented Nov 16, 2022

Fixes an issue where is_field_boolean always return True if there are greater than 3 distinct values. This should return False by default if there are greater than 3 values.

Additionally, some of the type checking fails because source.get_distinct_values() actually gets unique values after dropping NaNs. This prevents us from catching the case we're looking for, i.e., 3 distinct values of which one is a None/NaN. This logic is modified to check for None and Nans, and return True if either of those are found in the case that there are 3 distinct values.

Co-authored-by: @jppgks

@arnavgarg1 arnavgarg1 changed the title Change logic to correctly infer bool and object types Correctly infer bool and object types in autoML Nov 16, 2022
@github-actions
Copy link

github-actions bot commented Nov 16, 2022

Unit Test Results

         6 files  ±  0           6 suites  ±0   3h 56m 42s ⏱️ + 35m 41s
  3 527 tests +11    3 445 ✔️ +10    82 💤 +1  0 ±0 
10 581 runs  +33  10 317 ✔️ +30  264 💤 +3  0 ±0 

Results for commit 8c19ad1. ± Comparison against base commit 87ca887.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@magdyksaleh magdyksaleh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have assumed that the fix would be

def is_field_boolean(source: DataSource, field: str) -> bool:
    num_unique_values, unique_values, _ = source.get_distinct_values(field, max_values_to_return=4)
    if num_unique_values <= 3:
        for entry in unique_values:
            try:
                if np.isnan(entry):
                    continue
            except TypeError:
                # For some field types such as object arrays np.isnan throws a TypeError
                # we catch it since we know in this case it is not a bool.
                return False
            if isinstance(entry, bool):
                continue
            return False
        return True   
    return False

ludwig/automl/base_config.py Outdated Show resolved Hide resolved
ludwig/automl/base_config.py Outdated Show resolved Hide resolved
@arnavgarg1
Copy link
Contributor Author

I would have assumed that the fix would be

def is_field_boolean(source: DataSource, field: str) -> bool:
    num_unique_values, unique_values, _ = source.get_distinct_values(field, max_values_to_return=4)
    if num_unique_values <= 3:
        for entry in unique_values:
            try:
                if np.isnan(entry):
                    continue
            except TypeError:
                # For some field types such as object arrays np.isnan throws a TypeError
                # we catch it since we know in this case it is not a bool.
                return False
            if isinstance(entry, bool):
                continue
            return False
        return True   
    return False

So this is what I initially did, but the problem is that the unique_values returned from source.get_distinct_values always drops the NaNs before returning the unique values. So if you have a column with 3 distinct values ['a', 'b', np.nan], we're only going to get 2 values back. The result of that is that np.isnan() will always fail and it'll return False in the except block, but IMO, this is actually a valid bool type field.

@arnavgarg1 arnavgarg1 requested a review from jppgks November 17, 2022 15:23
@arnavgarg1 arnavgarg1 merged commit faeba6f into master Nov 18, 2022
@arnavgarg1 arnavgarg1 deleted the fix_bool_type_inference branch November 18, 2022 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants