Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fit_pipeline honoring api constraints #173

Conversation

ravinkohli
Copy link
Contributor

This PR fixes #149.

The changes are as follows:_

  1. the name of api.fit to api.fit_pipeline to avoid ambiguity.
  2. The pipeline is now fit using the TAE which honors the memory and time limit constraints using pynisher.
  3. fixes a bug in build_pipeline, where it was not taking the include, exclude and search space updates into account. here
  4. Adds tests for ensuring we can always fit a pipeline and also use it to predict and score on the data.
  5. Aims to reduce the ambiguity regarding disable_file_output a various places in the API

@ravinkohli ravinkohli added the bug Something isn't working label Apr 15, 2021
include_components: Optional[Dict] = None,
exclude_components: Optional[Dict] = None,
search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None
) -> TabularClassificationPipeline:
return TabularClassificationPipeline(dataset_properties=dataset_properties)
Copy link

@ArlindKadra ArlindKadra Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the include_components, exclude_components and search space updates also be present in line 116 ? in the return. They are for tabular regression. Otherwise, they do not seem to be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I seem to have missed this. Thanks for pointing out

@ArlindKadra
Copy link

@ravinkohli As an additional note, please add this to the regularization_cocktails branch first. We need it there asap.

@@ -144,6 +153,26 @@ def __init__(
search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None,
task_type: Optional[str] = None
) -> None:
"""

Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this docstring is WIP :)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh its not important. We already have the documentation for BaseTask. I'll remove this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more into it, please also add task_type in the docs for base_task

resampling_strategy: Union[CrossValTypes, HoldoutValTypes] = HoldoutValTypes.holdout_validation,
resampling_strategy_args: Optional[Dict[str, Any]] = None,
dataset_name: Optional[str] = None,
return_only: Optional[bool] = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a return_only, maybe create it once?

So with return only, every time you call this method you will have to create and save a new pipeline. Do you think it is better to:

  1. The first thing this function does is check if load_datamanger from the backend is able to load a dataset, and if it is the case, it loads it rather than creating a new one.
  2. Else a dataset is created

What do you think?

Copy link
Contributor Author

@ravinkohli ravinkohli Apr 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I had it create and always save a dataset, is that in case we want to have a different dataset to fit a pipeline, i.e, once we have made a search, we can try and fit the best incumbent configuration to a larger dataset(I see this as being a very common usecase, where someone would want to search for a subsampled version of the total dataset and then fit the best found configuration on the whole dataset) or a different related dataset. return_only allows us to use the same function to create a dataset, whenever we get new data, i.e, in search, fit and refit. As we are also returning the dataset after the fit_pipeline method, the same dataset can be used to predict and score using the pipeline.

Do you still think it is better to load a datamanager from the backend and use it? If so, do you know of a way to achieve them both.

resampling_strategy_args = resampling_strategy_args if resampling_strategy_args is not None else \
self.resampling_strategy_args

dataset = self._create_dataset(X_train=X_train,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we enhance the create_dataset, we can call fit_pipeline multiple times without the need for a new dataset each time. Plus the training split will not change. Also, the name then probably should be get_dataset


# get dataset properties
dataset_requirements = get_dataset_requirements(
info=self._get_required_dataset_properties(dataset))
dataset_properties = dataset.get_dataset_properties(dataset_requirements)
self._backend.save_datamanager(dataset)

self._backend._make_internals_directory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is troublesome to have it here, then let us move it to the constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is already in the backend constructor so I'll remove it

exclude_components=exclude_components,
search_space_updates=search_space_updates)
if configuration is None:
configuration = pipeline.get_hyperparameter_search_space().get_default_configuration()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it makes sense to make this more flexible? Why the default and not a random. Maybe we make configuration a required argument? Shouldn't this be passed to the build_pipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think I'll remove this part, and make configuration a required argument. This will then be passed to the TAE and that will take care of the rest.

if search_space_updates is None:
search_space_updates = self.search_space_updates

pipeline = self.build_pipeline(dataset_properties=dataset_properties,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need this build_pipeline? Can we only rely on the ExecuteTaFuncWithQueue? This way configuration can be a int or a string, and then we get a traditional pipeline for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense



@pytest.mark.parametrize("disable_file_output", [True, False])
@pytest.mark.parametrize('openml_id', (40981,))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test another configuration :), I think we always test Australian... Maybe add:

Why are very fast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by why are very fast? Yes, I'll put one of these configuration

data_id=int(openml_id),
return_X_y=True, as_frame=True
)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us make this function faster with a 20% for train and 80% for testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Copy link
Contributor

@franchuterivera franchuterivera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, I believe this function is critical for debug especially. I left some few comments to further make it more useful.

@ravinkohli ravinkohli changed the base branch from refactor_development to refactor_development_regularization_cocktails April 19, 2021 09:43
Copy link

@ArlindKadra ArlindKadra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, looks good. You can merge it as soon as the unit tests pass.

@ArlindKadra ArlindKadra merged commit 421005f into automl:refactor_development_regularization_cocktails Apr 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants