-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fit_pipeline honoring api constraints #173
Fit_pipeline honoring api constraints #173
Conversation
include_components: Optional[Dict] = None, | ||
exclude_components: Optional[Dict] = None, | ||
search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None | ||
) -> TabularClassificationPipeline: | ||
return TabularClassificationPipeline(dataset_properties=dataset_properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the include_components, exclude_components and search space updates also be present in line 116 ? in the return. They are for tabular regression. Otherwise, they do not seem to be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I seem to have missed this. Thanks for pointing out
@ravinkohli As an additional note, please add this to the |
autoPyTorch/api/base_task.py
Outdated
@@ -144,6 +153,26 @@ def __init__( | |||
search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None, | |||
task_type: Optional[str] = None | |||
) -> None: | |||
""" | |||
|
|||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this docstring is WIP :)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh its not important. We already have the documentation for BaseTask. I'll remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking more into it, please also add task_type
in the docs for base_task
autoPyTorch/api/base_task.py
Outdated
resampling_strategy: Union[CrossValTypes, HoldoutValTypes] = HoldoutValTypes.holdout_validation, | ||
resampling_strategy_args: Optional[Dict[str, Any]] = None, | ||
dataset_name: Optional[str] = None, | ||
return_only: Optional[bool] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a return_only, maybe create it once?
So with return only, every time you call this method you will have to create and save a new pipeline. Do you think it is better to:
- The first thing this function does is check if load_datamanger from the backend is able to load a dataset, and if it is the case, it loads it rather than creating a new one.
- Else a dataset is created
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I had it create and always save a dataset, is that in case we want to have a different dataset to fit a pipeline, i.e, once we have made a search, we can try and fit the best incumbent configuration to a larger dataset(I see this as being a very common usecase, where someone would want to search for a subsampled version of the total dataset and then fit the best found configuration on the whole dataset) or a different related dataset. return_only allows us to use the same function to create a dataset, whenever we get new data, i.e, in search, fit and refit. As we are also returning the dataset after the fit_pipeline method, the same dataset can be used to predict and score using the pipeline.
Do you still think it is better to load a datamanager from the backend and use it? If so, do you know of a way to achieve them both.
autoPyTorch/api/base_task.py
Outdated
resampling_strategy_args = resampling_strategy_args if resampling_strategy_args is not None else \ | ||
self.resampling_strategy_args | ||
|
||
dataset = self._create_dataset(X_train=X_train, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we enhance the create_dataset, we can call fit_pipeline multiple times without the need for a new dataset each time. Plus the training split will not change. Also, the name then probably should be get_dataset
autoPyTorch/api/base_task.py
Outdated
|
||
# get dataset properties | ||
dataset_requirements = get_dataset_requirements( | ||
info=self._get_required_dataset_properties(dataset)) | ||
dataset_properties = dataset.get_dataset_properties(dataset_requirements) | ||
self._backend.save_datamanager(dataset) | ||
|
||
self._backend._make_internals_directory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is troublesome to have it here, then let us move it to the constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is already in the backend constructor so I'll remove it
autoPyTorch/api/base_task.py
Outdated
exclude_components=exclude_components, | ||
search_space_updates=search_space_updates) | ||
if configuration is None: | ||
configuration = pipeline.get_hyperparameter_search_space().get_default_configuration() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it makes sense to make this more flexible? Why the default and not a random. Maybe we make configuration a required argument? Shouldn't this be passed to the build_pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think I'll remove this part, and make configuration a required argument. This will then be passed to the TAE and that will take care of the rest.
autoPyTorch/api/base_task.py
Outdated
if search_space_updates is None: | ||
search_space_updates = self.search_space_updates | ||
|
||
pipeline = self.build_pipeline(dataset_properties=dataset_properties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need this build_pipeline? Can we only rely on the ExecuteTaFuncWithQueue
? This way configuration can be a int or a string, and then we get a traditional pipeline for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense
test/test_api/test_api.py
Outdated
|
||
|
||
@pytest.mark.parametrize("disable_file_output", [True, False]) | ||
@pytest.mark.parametrize('openml_id', (40981,)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test another configuration :), I think we always test Australian... Maybe add:
Why are very fast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by why are very fast? Yes, I'll put one of these configuration
data_id=int(openml_id), | ||
return_X_y=True, as_frame=True | ||
) | ||
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us make this function faster with a 20% for train and 80% for testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much, I believe this function is critical for debug especially. I left some few comments to further make it more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, looks good. You can merge it as soon as the unit tests pass.
421005f
into
automl:refactor_development_regularization_cocktails
This PR fixes #149.
The changes are as follows:_
build_pipeline
, where it was not taking the include, exclude and search space updates into account. here