Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn when _validate_features sets feature_names #4937

Closed
wants to merge 1 commit into from

Conversation

awbirdsall
Copy link

- Current behavior silently sets feature_names of Booster to
  match data when Booster.predict(X) is called on a loaded model
- Add warning to relevant section of Booster._validate_features
  so user is aware it is happening.
- Warning text: "Booster's feature_names did not exist. Booster's
  feature_names and feature_types are being set to match the data."
- Closes dmlc#4854
@trivialfis
Copy link
Member

Looks good. Thanks. I wonder if there's an easy way to store utf-8 feature names in c++ ...

@trivialfis
Copy link
Member

A little hesitate to merge it since the message seems inevitable for most use cases.

@trivialfis
Copy link
Member

May be able to remove the validation if we can resolve #4594

@awbirdsall
Copy link
Author

Yes, I agree that would be ideal. In the meantime no matter how common the use case (or even because of how common the use case!) I think it’s important for there to be some kind of guardrail against a mismatch between the trained model and the data. Displaying a warning seems like one reasonable way to prevent a “validate features” step from actually silently coercing features to match, but I could also imagine a more involved fix.

@trivialfis
Copy link
Member

trivialfis commented Oct 15, 2019

@awbirdsall You are definitely right about the warning. But let me give a shot at fixing the above mentioned issue first. It might take a few days if I weren't able to do so I will merge this one. ;-) Thanks for the patience.

@awbirdsall
Copy link
Author

No problem, thanks for the work on the library! :)

@trivialfis
Copy link
Member

Progress on #4954 .

@hcho3
Copy link
Collaborator

hcho3 commented Dec 16, 2020

@trivialfis Should we consider saving the feature names in the Booster, like how we save them in the DMatrix? Otherwise, we can't do a reliable feature validation after saving and loading the model.

@trivialfis
Copy link
Member

Thought about it before, but not sure yet. I don't really like xgboost generating pseudo feature names.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 16, 2020

@trivialfis We may want to consider removing the feature name validation option, since due to the lack of serialization of feature names, the option is only half working. Either we should get it to work reliably, or remove it.

@trivialfis
Copy link
Member

Got it. We may remove the feature name generation, and save valid feature names obtained from pandas and alike. WDYT?

@hcho3
Copy link
Collaborator

hcho3 commented Dec 16, 2020

@trivialfis Sounds good, if your proposal also entails saving feature names in the C++ layer.

@trivialfis
Copy link
Member

Yup. Always prefer doing things in c++ when possible.

@hcho3 hcho3 closed this Dec 16, 2020
@hcho3
Copy link
Collaborator

hcho3 commented Dec 16, 2020

@trivialfis I opened #6520 to keep track of the proposal.

@awbirdsall I am closing this pull request in favor of #6520. We want a more robust solution to store the feature names in the model files.

@awbirdsall
Copy link
Author

@hcho3 thanks for the update and chasing after this! The improvements in the new issue look like they should be very helpful 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package] running .predict() on a loaded model silently sets the feature_names attribute
3 participants