-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly infer bool and object types in autoML #2765
Conversation
for more information, see https://pre-commit.ci
…i/ludwig into fix_bool_type_inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have assumed that the fix would be
def is_field_boolean(source: DataSource, field: str) -> bool:
num_unique_values, unique_values, _ = source.get_distinct_values(field, max_values_to_return=4)
if num_unique_values <= 3:
for entry in unique_values:
try:
if np.isnan(entry):
continue
except TypeError:
# For some field types such as object arrays np.isnan throws a TypeError
# we catch it since we know in this case it is not a bool.
return False
if isinstance(entry, bool):
continue
return False
return True
return False
So this is what I initially did, but the problem is that the |
Fixes an issue where
is_field_boolean
always return True if there are greater than 3 distinct values. This should return False by default if there are greater than 3 values.Additionally, some of the type checking fails because
source.get_distinct_values()
actually gets unique values after dropping NaNs. This prevents us from catching the case we're looking for, i.e., 3 distinct values of which one is a None/NaN. This logic is modified to check for None and Nans, and return True if either of those are found in the case that there are 3 distinct values.Co-authored-by: @jppgks