-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example #791
Add example #791
Conversation
* adds example for Feurer et al. (2015) * removes the stub for Fusi et al. (2018) as they actually perform the same task. I can't create an example, though, as they used regression datasets for classification (and OpenML by now forbids creating such tasks).
Codecov Report
@@ Coverage Diff @@
## develop #791 +/- ##
===========================================
- Coverage 87.71% 87.68% -0.03%
===========================================
Files 36 36
Lines 4208 4248 +40
===========================================
+ Hits 3691 3725 +34
- Misses 517 523 +6
Continue to review full report at Codecov.
|
] | ||
|
||
#################################################################################################### | ||
# The dataset IDs could be used directly to load the dataset and split the data into a training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure you want to start with the dataset ids, rather than the task ids?
If the answer is yes, this clearly signals that we do not have any good procedures for "getting tasks that belong to a given set of datasets". We should either extend the API to support this better or provide the functions below as convenience function (or combination of both)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning here is to stay close to the Auto-sklearn paper, where only dataset IDs are given. What kind of convenience function would you like to have? Something like:
def get_tasks_for_dataset(
dataset_id: int,
task_type_id: int,
estimation_procedure: str,
status: str,
check_target_attribute: bool,
) -> List:
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also make the note more drastic.
# deactivated tasks | ||
tasks_d = openml.tasks.list_tasks( | ||
task_type_id=1, | ||
status='deactivated', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not search for status "all" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lack of knowledge, I'll update the example.
task_ids.sort() | ||
|
||
# These are the tasks to work with: | ||
print(task_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging.info
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think print is fine for examples. logging is only important for the library itself to make the amount of output controllable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I could not find anything big.
Thanks for the review, I hope I could address your comments. |
@@ -10,9 +10,80 @@ | |||
~~~~~~~~~~~ | |||
|
|||
| Efficient and Robust Automated Machine Learning | |||
| Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum and Frank Hutter | |||
| Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum and Frank Hutter # noqa F401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mfeurer wrong usage of # noqa F401 in the text ? It is not interpreted as a comment.
Maybe you meant here:
import pandas as pd |
] | ||
|
||
#################################################################################################### | ||
# The dataset IDs could be used directly to load the dataset and split the data into a training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*training set
# It is discouraged to work directly on datasets and only provide dataset IDs in a paper as | ||
# this does not allow reproducibility (unclear splitting). Please do not use datasets but the | ||
# respective tasks as basis for a paper and publish task IDS. This example is only given to | ||
# showcase the use OpenML-Python for a published paper and as a warning on how not to do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use of OpenML-Python*
same task. I can't create an example, though, as they used regression
datasets for classification (and OpenML by now forbids creating such
tasks).