Simplify InputValidator: Allows pandas frame to directly reach the pipeline #1135

franchuterivera · 2021-04-23T22:58:05Z

Moves the encoder that translates pandas dataframes to numpy into the pipeline
Enhances Auto-Sklearn to work with pandas internally, rather than numpy
Feature type list is internally translated to a dictionary of column->data type to be robust against different pandas column ordering
Adds extra check to make sure a pandas frame can produce a pipeline (in order words, this new set of checks make sure that A- a pandas frame reaches the base pipeline without being translated to numpy, and that - we can fit the pipeline with a pandas frame)

codecov · 2021-04-23T23:34:16Z

Codecov Report

Merging #1135 (3a31f01) into development (0982410) will increase coverage by 0.13%.
The diff coverage is 97.60%.

@@               Coverage Diff               @@
##           development    #1135      +/-   ##
===============================================
+ Coverage        85.83%   85.96%   +0.13%     
===============================================
  Files              137      138       +1     
  Lines            10625    10703      +78     
===============================================
+ Hits              9120     9201      +81     
+ Misses            1505     1502       -3

Impacted Files	Coverage Δ
autosklearn/data/validation.py	`97.14% <ø> (ø)`
autosklearn/estimators.py	`93.47% <ø> (ø)`
autosklearn/data/xy_data_manager.py	`84.84% <83.33%> (+1.51%)`	⬆️
...onents/data_preprocessing/rescaling/standardize.py	`95.23% <88.88%> (-4.77%)`	⬇️
...ents/data_preprocessing/rescaling/robust_scaler.py	`96.87% <90.90%> (-3.13%)`	⬇️
...data_preprocessing/rescaling/abstract_rescaling.py	`91.30% <92.30%> (-1.01%)`	⬇️
...omponents/data_preprocessing/data_preprocessing.py	`90.09% <94.28%> (-0.23%)`	⬇️
autosklearn/evaluation/train_evaluator.py	`73.58% <94.73%> (+0.11%)`	⬆️
...ata_preprocessing/categorical_encoding/encoding.py	`96.42% <96.42%> (ø)`
autosklearn/data/feature_validator.py	`97.50% <98.38%> (+1.14%)`	⬆️
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0982410...3a31f01. Read the comment docs.

mfeurer

Sorry, just a few high-level comments so far. I'll do my best to give more comments in a timely manner.

autosklearn/data/abstract_data_manager.py

autosklearn/pipeline/components/data_preprocessing/imputation/categorical_imputation.py

autosklearn/metalearning/metafeatures/metafeatures.py

autosklearn/data/feature_validator.py

mfeurer

I can't add anything to the comment about the OrdinalEncoder, but according to the docs it can handle NaN: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

autosklearn/pipeline/components/data_preprocessing/rescaling/none.py

autosklearn/pipeline/components/data_preprocessing/rescaling/abstract_rescaling.py

autosklearn/pipeline/components/data_preprocessing/rescaling/__init__.py

autosklearn/pipeline/components/data_preprocessing/categorical_encoding/one_hot_encoding.py

autosklearn/pipeline/components/data_preprocessing/rescaling/robust_scaler.py

autosklearn/pipeline/components/data_preprocessing/rescaling/standardize.py

autosklearn/pipeline/components/data_preprocessing/variance_threshold/variance_threshold.py

mfeurer

Alright, just checked everything that's not a test.

autosklearn/data/xy_data_manager.py

autosklearn/smbo.py

autosklearn/pipeline/components/data_preprocessing/data_preprocessing.py

autosklearn/evaluation/train_evaluator.py

mfeurer

And the last part.

Could you please also check whether we need to unit test the meta-feature calculation (i.e. add a new set of checks for pandas in addition to the current tests which check for sparse and ndarray)?

test/test_automl/test_estimators.py

mfeurer

Finished the review of the changes. I'll now think a bit more whether we can include more tests for obscure datasets.

autosklearn/metalearning/metafeatures/metafeatures.py

autosklearn/pipeline/base.py

autosklearn/pipeline/components/data_preprocessing/categorical_encoding/no_encoding.py

test/test_metalearning/pyMetaLearn/test_meta_features.py

autosklearn/metalearning/metafeatures/metafeatures.py

test/test_automl/test_estimators.py

test/test_metalearning/pyMetaLearn/test_meta_features.py

autosklearn/data/xy_data_manager.py

mfeurer

Hey, I just took the liberty to debug and change the metafeature calculation, I hope that's okay.

I do have two minor questions left :)

autosklearn/data/feature_validator.py

test/test_data/test_feature_validator.py

franchuterivera · 2021-05-28T09:34:26Z

Thanks a lot for the help, that makes a lot of sense. Since yesterday I was thinking why KNN will care about the order of the columns and it was just a dumb error. Sorry about that.

mfeurer · 2021-05-28T09:35:28Z

Thanks a lot for the help, that makes a lot of sense. Since yesterday I was thinking why KNN will care about the order of the columns and it was just a dumb error. Sorry about that.

No worries, I was looking at DT for which it also makes sense that it depends at the order; but then I realized that the numbers are crazy different so I thought there must be something else to it.

franchuterivera added 2 commits April 23, 2021 18:03

[ADD] Move encoder to pipeline

153831e

[Fix] Unit Tests

7163228

mfeurer reviewed May 1, 2021

View reviewed changes

franchuterivera added 2 commits May 3, 2021 21:34

[ADD] mypy support for preprocessing

28199b0

[FIX] unit test

09f5e7d

mfeurer reviewed May 4, 2021

View reviewed changes

test/test_automl/test_estimators.py Outdated Show resolved Hide resolved

test/test_automl/test_estimators.py Show resolved Hide resolved

mfeurer mentioned this pull request May 4, 2021

Predict fails with category error #1141

Closed

franchuterivera added 3 commits May 7, 2021 16:16

Feedback from PR

66a5419

[Fix] unit test

35e6ae5

[FIx] pre-commit

93e963c

franchuterivera marked this pull request as ready for review May 7, 2021 15:47

franchuterivera requested a review from mfeurer May 7, 2021 15:47

mfeurer reviewed May 21, 2021

View reviewed changes

franchuterivera added 4 commits May 26, 2021 22:45

Add better unit testing for anneal

9ef9117

Fix unit testing

306fd27

Fix metalearning script

cab18e5

Fix feat check

0fd0267

mfeurer reviewed May 27, 2021

View reviewed changes

franchuterivera added 3 commits May 27, 2021 14:30

Feedback from pr

83af0ef

feat_type in testing

8b188b0

sparse dataframe

ab7fc5e

franchuterivera requested a review from mfeurer May 27, 2021 17:24

fix pandas landmarking meta-features

1530ccc

mfeurer reviewed May 28, 2021

View reviewed changes

autosklearn/data/feature_validator.py Outdated Show resolved Hide resolved

test/test_data/test_feature_validator.py Show resolved Hide resolved

Feedback from comments

c1c953f

franchuterivera added 2 commits May 28, 2021 14:47

np.nan columns to category

9bfe360

[Fix] Mypy

3a31f01

mfeurer merged commit 4a482de into automl:development Jun 25, 2021

eddiebergman mentioned this pull request Jul 19, 2021

Enhancement: Make the Ordinal Encoder a encoder choice #1150

Open

simonprovost mentioned this pull request Aug 3, 2021

Categories should be non-negative numbers ERROR #1203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify InputValidator: Allows pandas frame to directly reach the pipeline #1135

Simplify InputValidator: Allows pandas frame to directly reach the pipeline #1135

franchuterivera commented Apr 23, 2021

codecov bot commented Apr 23, 2021 •

edited

Loading

mfeurer left a comment

mfeurer left a comment

mfeurer left a comment

mfeurer left a comment

mfeurer left a comment

mfeurer left a comment

franchuterivera commented May 28, 2021

mfeurer commented May 28, 2021

Simplify InputValidator: Allows pandas frame to directly reach the pipeline #1135

Simplify InputValidator: Allows pandas frame to directly reach the pipeline #1135

Conversation

franchuterivera commented Apr 23, 2021

codecov bot commented Apr 23, 2021 • edited Loading

Codecov Report

mfeurer left a comment

Choose a reason for hiding this comment

mfeurer left a comment

Choose a reason for hiding this comment

mfeurer left a comment

Choose a reason for hiding this comment

mfeurer left a comment

Choose a reason for hiding this comment

mfeurer left a comment

Choose a reason for hiding this comment

mfeurer left a comment

Choose a reason for hiding this comment

franchuterivera commented May 28, 2021

mfeurer commented May 28, 2021

codecov bot commented Apr 23, 2021 •

edited

Loading