Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Handler does not process large integers correctly #274

Open
pieths opened this issue Sep 23, 2019 · 1 comment
Open

Handler does not process large integers correctly #274

pieths opened this issue Sep 23, 2019 · 1 comment

Comments

@pieths
Copy link
Collaborator

pieths commented Sep 23, 2019

import numpy as np
import pandas as pd
from nimbusml import FileDataStream
from nimbusml.preprocessing.missing_values import Handler

# 1 less than the maximum positive int32 value
# See: https://docs.scipy.org/doc/numpy/user/basics.types.html
large_int = 2147483646
with_nans = pd.DataFrame(data=dict( c1=[3, large_int, 5, 4])).astype(np.int32)

nahandle = Handler(replace_with='Mean') << 'c1'

result = nahandle.fit_transform(with_nans)
result = result.astype(np.int32)

print(result)
print(result.dtypes)
print(result.loc[1, 'c1.c1'])
print(result.loc[1, 'c1.c1'] == large_int)

The last line prints False and the value returned is -2147483648.

This does work if the number is small enough to accurately fit in a float32 (ie. large_int = 21474836).

This looks like it fails because the Handler transform implicitly converts its inputs in to float32. Any values which cannot be represented by float32 will not work correctly with this transform.

@pnshinde
Copy link

Hi! I’m new to open source and I’d like to take on this task along with #269 over the next couple of weeks. Is that alright?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants