Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mixed numeric datatypes for optimizers #667

Merged
merged 15 commits into from
Feb 7, 2024

Conversation

ephoris
Copy link
Contributor

@ephoris ephoris commented Feb 5, 2024

Addressing issue discussed in #666

@ephoris
Copy link
Contributor Author

ephoris commented Feb 5, 2024

@microsoft-github-policy-service agree

This reverts commit 55bf6b3.
Realized this is required as again, df.iloc[0] will convert all items to a similar type because pandas.Series cannot be mixed types. In the case of numeric values, everything is implicitly translated into numpy.float64 types.
Copy link
Member

@motus motus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! Next step, we should make sure LlamaTune does not choke on ConfigSpace instances that have conditionals (like, with the tunables that have special values). For starters, LlamaTune should be able to support that input: tunable_to_configspace_test.py:32

mlos_core/mlos_core/spaces/adapters/llamatune.py Outdated Show resolved Hide resolved
mlos_core/mlos_core/tests/optimizers/optimizer_test.py Outdated Show resolved Hide resolved
@motus motus added the mlos-core label Feb 6, 2024
@ephoris
Copy link
Contributor Author

ephoris commented Feb 6, 2024

Hm okay I have noticed that the llamatune_opt_test.py:49 failed. Looking into it, I see that some test in mlos_core_opt_df_test.py:45 define parameters with special characters. For example kernel_sched_migration_cost_ns!type. Normally this column name is fine if we're converting from pandas.DataFrame to dict converting, but using a NamedTuples received from itertuples() will parse these columns strings with undefined behavior, which is why I am getting a test failure. I think we might have to use a different solution instead of pandas.DataFrame.itertuples().

@ephoris
Copy link
Contributor Author

ephoris commented Feb 6, 2024

Since I am assuming MLOS wants to support generic strings for hyper parameters, we can revert back to iterrows(). Then by typecasting the dataframe to the object type, we can preserve numeric dtypes when calling iterrows() (see smac_optimizers.py:339). It's not as elegant unfortunately, but this prevents pandas from upcasting integers to floats.

@bpkroth bpkroth linked an issue Feb 6, 2024 that may be closed by this pull request
@bpkroth bpkroth added the bug Something isn't working label Feb 6, 2024
@ephoris ephoris marked this pull request as ready for review February 6, 2024 20:24
@ephoris ephoris requested a review from a team as a code owner February 6, 2024 20:24
@ephoris
Copy link
Contributor Author

ephoris commented Feb 6, 2024

Linter failed on 983763f. Fixed then reran linter locally, sorry for the spam. I think workflows should pass this time around.

@bpkroth
Copy link
Contributor

bpkroth commented Feb 7, 2024

Hm okay I have noticed that the llamatune_opt_test.py:49 failed. Looking into it, I see that some test in mlos_core_opt_df_test.py:45 define parameters with special characters. For example kernel_sched_migration_cost_ns!type. Normally this column name is fine if we're converting from pandas.DataFrame to dict converting, but using a NamedTuples received from itertuples() will parse these columns strings with undefined behavior, which is why I am getting a test failure. I think we might have to use a different solution instead of pandas.DataFrame.itertuples().

You could also try using itertuples(name=None) to get non-NamedTuples back.

@bpkroth
Copy link
Contributor

bpkroth commented Feb 7, 2024

Linter failed on 983763f. Fixed then reran linter locally, sorry for the spam. I think workflows should pass this time around.

No worries. I generally suggest using the devcontainer and just locally running make check and make test when you're doing things. It should run the same tests we did in CI then.

@bpkroth
Copy link
Contributor

bpkroth commented Feb 7, 2024

Final thought: I think there's a few places in mlos_bench where we do a to_numeric that might need re-examine with this.

ephoris and others added 3 commits February 7, 2024 12:39
Co-authored-by: Brian Kroth <bpkroth@users.noreply.github.com>
Co-authored-by: Brian Kroth <bpkroth@users.noreply.github.com>
Copy link
Contributor

@bpkroth bpkroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@bpkroth
Copy link
Contributor

bpkroth commented Feb 7, 2024

Final thought: I think there's a few places in mlos_bench where we do a to_numeric that might need re-examine with this.

We can take this up elsewhere I think.

@ephoris
Copy link
Contributor Author

ephoris commented Feb 7, 2024

Hm okay I have noticed that the llamatune_opt_test.py:49 failed. Looking into it, I see that some test in mlos_core_opt_df_test.py:45 define parameters with special characters. For example kernel_sched_migration_cost_ns!type. Normally this column name is fine if we're converting from pandas.DataFrame to dict converting, but using a NamedTuples received from itertuples() will parse these columns strings with undefined behavior, which is why I am getting a test failure. I think we might have to use a different solution instead of pandas.DataFrame.itertuples().

You could also try using itertuples(name=None) to get non-NamedTuples back.

Unfortunately, itertuples preserves dtypes, but it also preserves pandas.NA values. Normally I would say this is good, however, pandas.NA value is not treated the same as None values. This results in situations where the ConfigSpace package will throw errors as it only handles None values. For now I guess pandas.DataFrame.astype('O') might be the most elegant solution for the time being as iterrows() will implicitly convert pandas.NA to None values which most of the code paths were set up to deal with.

@ephoris
Copy link
Contributor Author

ephoris commented Feb 7, 2024

Thanks @bpkroth and @motus. Great learning experience on my end, I hope this bug fix will be helpful and prevent errors in the future.

@bpkroth bpkroth merged commit 9175d18 into microsoft:main Feb 7, 2024
11 of 12 checks passed
@ephoris ephoris deleted the ephoris/smac/iter_bug branch February 7, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mlos-core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SMAC optimizer does not support mixed input space
3 participants