Fit_pipeline honoring api constraints #173

ravinkohli · 2021-04-15T13:30:04Z

This PR fixes #149.

The changes are as follows:_

the name of api.fit to api.fit_pipeline to avoid ambiguity.
The pipeline is now fit using the TAE which honors the memory and time limit constraints using pynisher.
fixes a bug in build_pipeline, where it was not taking the include, exclude and search space updates into account. here
Adds tests for ensuring we can always fit a pipeline and also use it to predict and score on the data.
Aims to reduce the ambiguity regarding disable_file_output a various places in the API

ArlindKadra · 2021-04-19T06:45:50Z

autoPyTorch/api/tabular_classification.py

+                       include_components: Optional[Dict] = None,
+                       exclude_components: Optional[Dict] = None,
+                       search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None
+                       ) -> TabularClassificationPipeline:
        return TabularClassificationPipeline(dataset_properties=dataset_properties)


Should the include_components, exclude_components and search space updates also be present in line 116 ? in the return. They are for tabular regression. Otherwise, they do not seem to be used.

oops, I seem to have missed this. Thanks for pointing out

ArlindKadra · 2021-04-19T07:56:21Z

@ravinkohli As an additional note, please add this to the regularization_cocktails branch first. We need it there asap.

franchuterivera · 2021-04-19T08:24:21Z

autoPyTorch/api/base_task.py

@@ -144,6 +153,26 @@ def __init__(
        search_space_updates: Optional[HyperparameterSearchSpaceUpdates] = None,
        task_type: Optional[str] = None
    ) -> None:
+        """
+
+        Args:


I think this docstring is WIP :)?

oh its not important. We already have the documentation for BaseTask. I'll remove this

Looking more into it, please also add task_type in the docs for base_task

franchuterivera · 2021-04-19T08:30:43Z

autoPyTorch/api/base_task.py

+                        resampling_strategy: Union[CrossValTypes, HoldoutValTypes] = HoldoutValTypes.holdout_validation,
+                        resampling_strategy_args: Optional[Dict[str, Any]] = None,
+                        dataset_name: Optional[str] = None,
+                        return_only: Optional[bool] = False


Instead of having a return_only, maybe create it once?

So with return only, every time you call this method you will have to create and save a new pipeline. Do you think it is better to:

The first thing this function does is check if load_datamanger from the backend is able to load a dataset, and if it is the case, it loads it rather than creating a new one.

Else a dataset is created

What do you think?

The reason I had it create and always save a dataset, is that in case we want to have a different dataset to fit a pipeline, i.e, once we have made a search, we can try and fit the best incumbent configuration to a larger dataset(I see this as being a very common usecase, where someone would want to search for a subsampled version of the total dataset and then fit the best found configuration on the whole dataset) or a different related dataset. return_only allows us to use the same function to create a dataset, whenever we get new data, i.e, in search, fit and refit. As we are also returning the dataset after the fit_pipeline method, the same dataset can be used to predict and score using the pipeline.

Do you still think it is better to load a datamanager from the backend and use it? If so, do you know of a way to achieve them both.

franchuterivera · 2021-04-19T08:31:46Z

autoPyTorch/api/base_task.py

+        resampling_strategy_args = resampling_strategy_args if resampling_strategy_args is not None else \
+            self.resampling_strategy_args
+
+        dataset = self._create_dataset(X_train=X_train,


If we enhance the create_dataset, we can call fit_pipeline multiple times without the need for a new dataset each time. Plus the training split will not change. Also, the name then probably should be get_dataset

franchuterivera · 2021-04-19T08:32:18Z

autoPyTorch/api/base_task.py


        # get dataset properties
        dataset_requirements = get_dataset_requirements(
            info=self._get_required_dataset_properties(dataset))
        dataset_properties = dataset.get_dataset_properties(dataset_requirements)
        self._backend.save_datamanager(dataset)

+        self._backend._make_internals_directory()


If this is troublesome to have it here, then let us move it to the constructor

I think it is already in the backend constructor so I'll remove it

franchuterivera · 2021-04-19T08:33:54Z

autoPyTorch/api/base_task.py

+                                       exclude_components=exclude_components,
+                                       search_space_updates=search_space_updates)
+        if configuration is None:
+            configuration = pipeline.get_hyperparameter_search_space().get_default_configuration()


Do you think it makes sense to make this more flexible? Why the default and not a random. Maybe we make configuration a required argument? Shouldn't this be passed to the build_pipeline?

So I think I'll remove this part, and make configuration a required argument. This will then be passed to the TAE and that will take care of the rest.

franchuterivera · 2021-04-19T08:35:04Z

autoPyTorch/api/base_task.py

+        if search_space_updates is None:
+            search_space_updates = self.search_space_updates
+
+        pipeline = self.build_pipeline(dataset_properties=dataset_properties,


I wonder if we need this build_pipeline? Can we only rely on the ExecuteTaFuncWithQueue? This way configuration can be a int or a string, and then we get a traditional pipeline for example?

that makes sense

franchuterivera · 2021-04-19T08:39:03Z

test/test_api/test_api.py

+
+
+@pytest.mark.parametrize("disable_file_output", [True, False])
+@pytest.mark.parametrize('openml_id', (40981,))


Can we test another configuration :), I think we always test Australian... Maybe add:

https://www.openml.org/d/21

https://www.openml.org/d/40984

https://www.openml.org/d/1066

Why are very fast?

What do you mean by why are very fast? Yes, I'll put one of these configuration

franchuterivera · 2021-04-19T08:40:07Z

test/test_api/test_api.py

+        data_id=int(openml_id),
+        return_X_y=True, as_frame=True
+    )
+    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(


Let us make this function faster with a 20% for train and 80% for testing

franchuterivera

Thank you very much, I believe this function is critical for debug especially. I left some few comments to further make it more useful.

…it_pipeline

…-fit_pipeline

ArlindKadra

Great, looks good. You can merge it as soon as the unit tests pass.

Working fit_pipeline method, with test and example

969ba0e

ravinkohli requested review from ArlindKadra, franchuterivera and nabenabe0928 April 15, 2021 13:30

ravinkohli added the bug Something isn't working label Apr 15, 2021

ravinkohli added 2 commits April 15, 2021 17:32

Fixed bug in tabular regression

8cbcf50

Fix bug in example single configuration

c61128d

ArlindKadra reviewed Apr 19, 2021

View reviewed changes

franchuterivera reviewed Apr 19, 2021

View reviewed changes

franchuterivera suggested changes Apr 19, 2021

View reviewed changes

ravinkohli changed the base branch from refactor_development to refactor_development_regularization_cocktails April 19, 2021 09:43

ravinkohli and others added 8 commits April 19, 2021 14:37

Addressed comments from Fransisco, making configuration required in f…

c611d13

…it_pipeline

Merge branch 'refactor_development_regularization_cocktails' into fix…

ceb1c5b

…-fit_pipeline

Add configuration to example, fix in get_dataset

8ae4759

Fix mypy

e418742

[FIX] hardcoded budget

29547ef

fix mypy for pipeline config

058c2b5

Address Arlinds comment for task type documentation

f6ca6e0

Fix bug in tests

5f7adfd

ArlindKadra approved these changes Apr 21, 2021

View reviewed changes

ravinkohli added 4 commits April 21, 2021 18:56

Change way to get configuration fr using fit_pipeline

cfd728c

fix flake

7f6cddc

Trial with --forked

9cf285c

update setup.py

08c6ef0

ArlindKadra merged commit 421005f into automl:refactor_development_regularization_cocktails Apr 22, 2021

github-actions bot pushed a commit that referenced this pull request Apr 22, 2021

Arlind Kadra: Merge pull request #173 from ravinkohli/fix-fit_pipeline

99a1ebe

ArlindKadra mentioned this pull request Apr 22, 2021

Revert "Fit_pipeline honoring api constraints" #182

Merged

ravinkohli mentioned this pull request Jun 21, 2021

[ADD] Post-Hoc ensemble fitting #260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fit_pipeline honoring api constraints #173

Fit_pipeline honoring api constraints #173

ravinkohli commented Apr 15, 2021

ArlindKadra Apr 19, 2021 •

edited

Loading

ravinkohli Apr 19, 2021

ArlindKadra commented Apr 19, 2021

franchuterivera Apr 19, 2021

ravinkohli Apr 19, 2021

ArlindKadra Apr 20, 2021

franchuterivera Apr 19, 2021

ravinkohli Apr 19, 2021 •

edited

Loading

franchuterivera Apr 19, 2021

franchuterivera Apr 19, 2021

ravinkohli Apr 19, 2021

franchuterivera Apr 19, 2021

ravinkohli Apr 19, 2021

franchuterivera Apr 19, 2021

ravinkohli Apr 19, 2021

franchuterivera Apr 19, 2021

ravinkohli Apr 19, 2021

franchuterivera Apr 19, 2021

ravinkohli Apr 19, 2021

franchuterivera left a comment

ArlindKadra left a comment



		@pytest.mark.parametrize("disable_file_output", [True, False])
		@pytest.mark.parametrize('openml_id', (40981,))

Fit_pipeline honoring api constraints #173

Fit_pipeline honoring api constraints #173

Conversation

ravinkohli commented Apr 15, 2021

ArlindKadra Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArlindKadra commented Apr 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravinkohli Apr 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franchuterivera left a comment

Choose a reason for hiding this comment

ArlindKadra left a comment

Choose a reason for hiding this comment

ArlindKadra Apr 19, 2021 •

edited

Loading

ravinkohli Apr 19, 2021 •

edited

Loading