You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally posted by @GoldenGoldy August 3, 2024
I found that PySR warns about spaces in column names when passing the .fit function data where this occurs. It then replaces the spaces in the column names with underscores and prints a warning about this. You can then proceed with fitting the data as per normal.
When later calling the .predict function, this does not attempt to make the same replacement of spaces with underscores in the column names.
So, if we have a fitted model and want to use it to make predictions, and we pass data to the .predict function in the same format that we used for the .fit function, we can run into the following issue:
The predict function (in sr.py) contains the following code line "X = X.reindex(columns=self.feature_names_in_)". This results in NaN values in case the column names have spaces, because now it tries to match the column names (with spaces) with the feature names of the model, but in the latter the spaces were replaced by underscores.
We then get the somewhat confusing message "ValueError: Input X contains NaN.", which leads one to believe that there are NaN values in the data even while there are none, they only get introduced by the reindex which can't match the column names.
All this can be avoided of course, once you are aware of the problem and avoid using spaces in the column names from the beginning. However, it might be more consistent, and allow for a better user experience, if the .predict function also replaces spaces in the column names with underscores?
The text was updated successfully, but these errors were encountered:
Discussed in #689
Originally posted by @GoldenGoldy August 3, 2024
I found that PySR warns about spaces in column names when passing the .fit function data where this occurs. It then replaces the spaces in the column names with underscores and prints a warning about this. You can then proceed with fitting the data as per normal.
When later calling the .predict function, this does not attempt to make the same replacement of spaces with underscores in the column names.
So, if we have a fitted model and want to use it to make predictions, and we pass data to the .predict function in the same format that we used for the .fit function, we can run into the following issue:
The predict function (in sr.py) contains the following code line "X = X.reindex(columns=self.feature_names_in_)". This results in NaN values in case the column names have spaces, because now it tries to match the column names (with spaces) with the feature names of the model, but in the latter the spaces were replaced by underscores.
We then get the somewhat confusing message "ValueError: Input X contains NaN.", which leads one to believe that there are NaN values in the data even while there are none, they only get introduced by the reindex which can't match the column names.
All this can be avoided of course, once you are aware of the problem and avoid using spaces in the column names from the beginning. However, it might be more consistent, and allow for a better user experience, if the .predict function also replaces spaces in the column names with underscores?
The text was updated successfully, but these errors were encountered: