Correctly infer bool and object types in autoML #2765

arnavgarg1 · 2022-11-16T11:18:18Z

Fixes an issue where is_field_boolean always return True if there are greater than 3 distinct values. This should return False by default if there are greater than 3 values.

Additionally, some of the type checking fails because source.get_distinct_values() actually gets unique values after dropping NaNs. This prevents us from catching the case we're looking for, i.e., 3 distinct values of which one is a None/NaN. This logic is modified to check for None and Nans, and return True if either of those are found in the case that there are 3 distinct values.

Co-authored-by: @jppgks

for more information, see https://pre-commit.ci

…i/ludwig into fix_bool_type_inference

github-actions · 2022-11-16T12:30:30Z

Unit Test Results

        6 files ±  0         6 suites ±0 3h 56m 42s ⏱️ + 35m 41s
  3 527 tests +11   3 445 ✔️ +10   82 💤 +1 0 ❌ ±0
10 581 runs +33 10 317 ✔️ +30 264 💤 +3 0 ❌ ±0

Results for commit 8c19ad1. ± Comparison against base commit 87ca887.

♻️ This comment has been updated with latest results.

magdyksaleh

I would have assumed that the fix would be

def is_field_boolean(source: DataSource, field: str) -> bool:
    num_unique_values, unique_values, _ = source.get_distinct_values(field, max_values_to_return=4)
    if num_unique_values <= 3:
        for entry in unique_values:
            try:
                if np.isnan(entry):
                    continue
            except TypeError:
                # For some field types such as object arrays np.isnan throws a TypeError
                # we catch it since we know in this case it is not a bool.
                return False
            if isinstance(entry, bool):
                continue
            return False
        return True   
    return False

ludwig/automl/base_config.py

arnavgarg1 · 2022-11-16T13:18:12Z

I would have assumed that the fix would be

def is_field_boolean(source: DataSource, field: str) -> bool:
    num_unique_values, unique_values, _ = source.get_distinct_values(field, max_values_to_return=4)
    if num_unique_values <= 3:
        for entry in unique_values:
            try:
                if np.isnan(entry):
                    continue
            except TypeError:
                # For some field types such as object arrays np.isnan throws a TypeError
                # we catch it since we know in this case it is not a bool.
                return False
            if isinstance(entry, bool):
                continue
            return False
        return True   
    return False

So this is what I initially did, but the problem is that the unique_values returned from source.get_distinct_values always drops the NaNs before returning the unique values. So if you have a column with 3 distinct values ['a', 'b', np.nan], we're only going to get 2 values back. The result of that is that np.isnan() will always fail and it'll return False in the except block, but IMO, this is actually a valid bool type field.

ludwig/automl/base_config.py

Change logic to correctly infer bool and object types

46b3162

arnavgarg1 changed the title ~~Change logic to correctly infer bool and object types~~ Correctly infer bool and object types in autoML Nov 16, 2022

arnavgarg1 requested review from magdyksaleh, dantreiman, hungcs and justinxzhao November 16, 2022 11:18

pre-commit-ci bot and others added 3 commits November 16, 2022 11:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

0e3786e

for more information, see https://pre-commit.ci

update test names

6381e2d

Merge branch 'fix_bool_type_inference' of https://github.com/ludwig-a…

0039def

…i/ludwig into fix_bool_type_inference

mark tests as distributed

c53776f

magdyksaleh reviewed Nov 16, 2022

View reviewed changes

ludwig/automl/base_config.py Outdated Show resolved Hide resolved

ludwig/automl/base_config.py Outdated Show resolved Hide resolved

change location of test

a95adc7

arnavgarg1 added 2 commits November 16, 2022 13:44

fix logic and update tests

9bb49d9

revert to old variable naming

d1c3570

arnavgarg1 requested a review from magdyksaleh November 16, 2022 13:45

magdyksaleh approved these changes Nov 16, 2022

View reviewed changes

jeffreyftang reviewed Nov 16, 2022

View reviewed changes

ludwig/automl/base_config.py Outdated Show resolved Hide resolved

arnavgarg1 added 2 commits November 16, 2022 19:38

Fix comment

dd3044c

better comment

12bac48

arnavgarg1 requested a review from jeffreyftang November 16, 2022 19:40

justinxzhao approved these changes Nov 17, 2022

View reviewed changes

consolidate Joppe's tests with my tests

8e2aa63

arnavgarg1 requested a review from jppgks November 17, 2022 15:23

arnavgarg1 added 2 commits November 18, 2022 09:06

force test checks

d1c3b9e

remove dummy test

8c19ad1

arnavgarg1 merged commit faeba6f into master Nov 18, 2022

arnavgarg1 deleted the fix_bool_type_inference branch November 18, 2022 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly infer bool and object types in autoML #2765

Correctly infer bool and object types in autoML #2765

arnavgarg1 commented Nov 16, 2022 •

edited

Loading

github-actions bot commented Nov 16, 2022 •

edited

Loading

magdyksaleh left a comment

arnavgarg1 commented Nov 16, 2022

Correctly infer bool and object types in autoML #2765

Correctly infer bool and object types in autoML #2765

Conversation

arnavgarg1 commented Nov 16, 2022 • edited Loading

github-actions bot commented Nov 16, 2022 • edited Loading

Unit Test Results

magdyksaleh left a comment

Choose a reason for hiding this comment

arnavgarg1 commented Nov 16, 2022

arnavgarg1 commented Nov 16, 2022 •

edited

Loading

github-actions bot commented Nov 16, 2022 •

edited

Loading