-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn when _validate_features sets feature_names #4937
Conversation
- Current behavior silently sets feature_names of Booster to match data when Booster.predict(X) is called on a loaded model - Add warning to relevant section of Booster._validate_features so user is aware it is happening. - Warning text: "Booster's feature_names did not exist. Booster's feature_names and feature_types are being set to match the data." - Closes dmlc#4854
Looks good. Thanks. I wonder if there's an easy way to store utf-8 feature names in c++ ... |
A little hesitate to merge it since the message seems inevitable for most use cases. |
May be able to remove the validation if we can resolve #4594 |
Yes, I agree that would be ideal. In the meantime no matter how common the use case (or even because of how common the use case!) I think it’s important for there to be some kind of guardrail against a mismatch between the trained model and the data. Displaying a warning seems like one reasonable way to prevent a “validate features” step from actually silently coercing features to match, but I could also imagine a more involved fix. |
@awbirdsall You are definitely right about the warning. But let me give a shot at fixing the above mentioned issue first. It might take a few days if I weren't able to do so I will merge this one. ;-) Thanks for the patience. |
No problem, thanks for the work on the library! :) |
Progress on #4954 . |
@trivialfis Should we consider saving the feature names in the Booster, like how we save them in the |
Thought about it before, but not sure yet. I don't really like xgboost generating pseudo feature names. |
@trivialfis We may want to consider removing the feature name validation option, since due to the lack of serialization of feature names, the option is only half working. Either we should get it to work reliably, or remove it. |
Got it. We may remove the feature name generation, and save valid feature names obtained from pandas and alike. WDYT? |
@trivialfis Sounds good, if your proposal also entails saving feature names in the C++ layer. |
Yup. Always prefer doing things in c++ when possible. |
@trivialfis I opened #6520 to keep track of the proposal. @awbirdsall I am closing this pull request in favor of #6520. We want a more robust solution to store the feature names in the model files. |
@hcho3 thanks for the update and chasing after this! The improvements in the new issue look like they should be very helpful 👍 |
match data when Booster.predict(X) is called on a loaded model
so user is aware it is happening.
feature_names and feature_types are being set to match the data."