Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for pandas nullable types in feature matrix #6851

Closed
freddyaboulton opened this issue Apr 13, 2021 · 6 comments · Fixed by #7760
Closed

Add support for pandas nullable types in feature matrix #6851

freddyaboulton opened this issue Apr 13, 2021 · 6 comments · Fixed by #7760

Comments

@freddyaboulton
Copy link

freddyaboulton commented Apr 13, 2021

I would like to use pandas nullable types with xgboost. Currently, xgboost will raise a ValueError if any of the features use the new nullable pandas types.

Repro with xgb 1.2.1 and pandas 1.2.3:

from xgboost import XGBClassifier
import pandas as pd
import pytest

for dtype in ['Int64', 'Float64', 'boolean']:
    
    xgb = XGBClassifier()

    # ValueError if only features use nullable types
    df = pd.DataFrame({"a": pd.Series([True, False, True, False], dtype=dtype),
                       "b": pd.Series([4, 5, 6])})
    y = pd.Series([1, 0, 1, 0])
    
    with pytest.raises(ValueError):
        xgb.fit(df, y)


# No ValueError if use "old" types
for dtype in ['int', 'float', 'bool']:
    
    df = pd.DataFrame({"a": pd.Series([True, False, True, False], dtype=dtype)})
    y = pd.Series([1, 0, 1, 0])

    xgb = XGBClassifier()
    xgb.fit(df, y)

As a user, I would like this feature so I can build a data processing pipeline using the latest pandas capabilities.

Thank you!

@freddyaboulton freddyaboulton changed the title Add support for pandas nullable types in features Add support for pandas nullable types in feature matrix Apr 13, 2021
@trivialfis
Copy link
Member

Took a quick look. I think a better way is to work on #5667 so that we can have masked arrays.

@freddyaboulton
Copy link
Author

Thanks for taking a look @trivialfis ! So a feature matrix with pandas nullable types will be converted to a masked array? What's the status of #5667 ?

@trivialfis
Copy link
Member

So a feature matrix with pandas nullable types will be converted to a masked array?

That's what I'm hoping and I suspect that's how it's implemented for nullable arrays.

What's the status of #5667 ?

I assigned myself to work on it based on the existing work by @zhangzhang10 . But arrow can trunk the data in an arbitrary way so it's a bit messy, I will try to follow up.

@freddyaboulton
Copy link
Author

@trivialfis @hcho3 Are there any updates on this?

I see that xgboost can handle missing data. If I have a dataframe with nullable integers, e.g. Int64, I would have to first convert to double correct?

from xgboost import XGBClassifier
import pandas as pd
import pytest

    
xgb = XGBClassifier()

# ValueError if only features use nullable types
df = pd.DataFrame({"a": pd.Series([True, False, True, False]),
                   "b": pd.Series([4, 5, 6, None], dtype='Int64'),
                   "c": pd.Series([1.23, 4.58, 3.14, 83.12])})
y = pd.Series([1, 0, 1, 0])

with pytest.raises(ValueError):
    xgb.fit(df, y)

Thank you!

@Ark-kun
Copy link

Ark-kun commented Aug 31, 2022

XGBoost still fails when encountering nullable floating-point dtypes.

@gsheni
Copy link

gsheni commented Jan 27, 2023

@Ark-kun I believe this now works (xgboost supports nullable types).

from xgboost import XGBClassifier
import pandas as pd
import pytest

xgb = XGBClassifier()

# ValueError if only features use nullable types
df = pd.DataFrame({"a": pd.Series([True, False, True, False], dtype='boolean'),
                   "b": pd.Series([4, 5, 6, None], dtype='Int64'),
                   "c": pd.Series([1.23, 4.58, 3.14, 83.12], dtype='Float64')})
y = pd.Series([1, 0, 1, 0])
xgb.fit(df, y)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants