Dev sklearn preprocess #157

kstrzala · 2018-07-25T11:54:17Z

Major issues

-cleaning in sklearn models: move normalization/fillna/one_hot into sklearn_preprocess block
-stacking_normalization block for log_reg stacking

Minor issues

-parameter udpate

* Dynamic features * Smart features (minerva-ml#61) * Update README.md * Update README.md * Update * Smart features update * More descriptive transformer name * Reading all data in main * More application features * Transformer for cleaning * Multiinput data dictionary * Fix (minerva-ml#63) * fixed configs * dropped redundand steps, moved stuff to cleaning, refactored groupby (minerva-ml#64) * dropped redundand steps, moved stuff to cleanining, refactored groupby * restructured, added stacking + CV * Fix format string * Update pipeline_manager.py clipped prediction -> prediction * added stratified kfold option (minerva-ml#77) * Update config (minerva-ml#79) * dropped redundand steps, moved stuff to cleanining, refactored groupby * restructured, added stacking + CV * Update pipeline_config.py * Dev review (minerva-ml#81) * dropped feature by type split, refactored pipleine_config * dropped feature by type split method * explored application features * trash * reverted refactor of aggs * fixed/updated bureau features * cleared notebooks * agg features added to notebook bureau * credit card cleaned * added other feature notebooks * added rank mean * updated model arch * reverted to old params * fixed rank mean calculations * ApplicationCleaning update (minerva-ml#84) * Cleaning - application * Clear output in notebook * clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (minerva-ml#85) * local trash * External sources notebook (minerva-ml#86) * Update * External sources notebook * Dev lgbm params (minerva-ml#88) * local trash * updated configs * dropped comment * updated lgb params * Dev app agg fix (minerva-ml#90) * dropped app_aggs * app agg features fixed * cleaned leftovers * dropped fast read-in for debug * External_sources statistics (minerva-ml#89) * Speed-up ext_src notebook * exernal_sources statistics * Weighted mean and notebook fix * application notebook update * clear notebook output * Fix auto submission (minerva-ml#95) * CreditCardBalance monthly diff mean * POSCASH remaining installments * POSCASH completed_contracts * notebook update * Resolve conflicts * Fix * Update neptune.yaml * Update neptune_random_search.yaml * Split static and dynamic features - credit card balance

* added nan_count * added nan count with parameter

* added simple features, parallel groupby, last-installment features * refactored last_installment features * added features for the very last installment

* added dynamic-trend features * formated configs * added skew/iqr features

* added number of credit agreement change features * reverted sample size

* previous_application handcrafted features * previous application cleaning * Update neptune.yaml * code improvement * Update notebook

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby) * sped up all hand crafted * fixed bureau worker errors * fixed isntallment names * fixed isntallment names * fixed bureau and prev_app naming bugs * reverted to vectorized where possible * updated hyperparams * updated early stopping params to meet convergence * reverted to old fallback neptune file * updated paths * updated paths, explored prev-app features

* Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Features - family - test * Features - family - aggregate * Features - family - aggregate 2 * Features - family - aggregate 3 * Features - family - aggregate 4 * Update pipeline_config.py

…#129) * new previous application features * Data cleaning * update application notebook * credit card cleaning * Data cleaning - groupby agg * Include suggested changes

* new previous application features * Data cleaning * update application notebook * credit card cleaning * Data cleaning - groupby agg * Include suggested changes * Fix

* added fraction features to eda and feature extraction, updated configs * updated hyperparams

* age/employment dummies (minerva-ml#104) * added diff features * New handcrafted features (minerva-ml#102) * Dynamic features * Smart features (minerva-ml#61) * Update README.md * Update README.md * Update * Smart features update * More descriptive transformer name * Reading all data in main * More application features * Transformer for cleaning * Multiinput data dictionary * Fix (minerva-ml#63) * fixed configs * dropped redundand steps, moved stuff to cleaning, refactored groupby (minerva-ml#64) * dropped redundand steps, moved stuff to cleanining, refactored groupby * restructured, added stacking + CV * Fix format string * Update pipeline_manager.py clipped prediction -> prediction * added stratified kfold option (minerva-ml#77) * Update config (minerva-ml#79) * dropped redundand steps, moved stuff to cleanining, refactored groupby * restructured, added stacking + CV * Update pipeline_config.py * Dev review (minerva-ml#81) * dropped feature by type split, refactored pipleine_config * dropped feature by type split method * explored application features * trash * reverted refactor of aggs * fixed/updated bureau features * cleared notebooks * agg features added to notebook bureau * credit card cleaned * added other feature notebooks * added rank mean * updated model arch * reverted to old params * fixed rank mean calculations * ApplicationCleaning update (minerva-ml#84) * Cleaning - application * Clear output in notebook * clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (minerva-ml#85) * local trash * External sources notebook (minerva-ml#86) * Update * External sources notebook * Dev lgbm params (minerva-ml#88) * local trash * updated configs * dropped comment * updated lgb params * Dev app agg fix (minerva-ml#90) * dropped app_aggs * app agg features fixed * cleaned leftovers * dropped fast read-in for debug * External_sources statistics (minerva-ml#89) * Speed-up ext_src notebook * exernal_sources statistics * Weighted mean and notebook fix * application notebook update * clear notebook output * Fix auto submission (minerva-ml#95) * CreditCardBalance monthly diff mean * POSCASH remaining installments * POSCASH completed_contracts * notebook update * Resolve conflicts * Fix * Update neptune.yaml * Update neptune_random_search.yaml * Split static and dynamic features - credit card balance * Dev nan count (minerva-ml#105) * added nan_count * added nan count with parameter * Dev fe installments (minerva-ml#106) * added simple features, parallel groupby, last-installment features * refactored last_installment features * added features for the very last installment * Dev fe instalments dynamic (minerva-ml#107) * added dynamic-trend features * formated configs * added skew/iqr features * added number of credit agreement change features (minerva-ml#109) * added number of credit agreement change features * reverted sample size * Dynamic features - previous application (minerva-ml#108) * previous_application handcrafted features * previous application cleaning * Update neptune.yaml * code improvement * Update notebook * Notebook - feature importance (minerva-ml#112) * Dev speed up (minerva-ml#111) * refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby) * sped up all hand crafted * fixed bureau worker errors * fixed isntallment names * fixed isntallment names * fixed bureau and prev_app naming bugs * reverted to vectorized where possible * updated hyperparams * updated early stopping params to meet convergence * reverted to old fallback neptune file * updated paths * updated paths, explored prev-app features * dropped duplicated agg * POS_CASH added features * POS CASH features added * POS_CASH_balance feature cleaning * Yaml adjustment * Path change

'<' instead of '>'

* application agg cleaning * update neptune.yaml

jakubczakon · 2018-07-26T08:36:36Z

src/pipeline_blocks.py

                       **kwargs):
-    config, model_params, rs_config = full_config
+    model_name = '{}{}'.format(clf_name, suffix)
+    model_params = getattr(config, clf_name)


@kstrzala Why not simply go with config[clf_name] and config.random_search[clf_name] ?

Because then you loose AttrDict behaviour (sad but true) and code fixing it would probably be similarly complex.

jakubczakon · 2018-07-26T08:37:36Z

src/pipeline_blocks.py

-                           PersistResults(**rs_config.callbacks.persist_results)]
-            )
+        features_train, features_valid = features
+        if getattr(config.random_search, clf_name).n_runs:


@kstrzala since we have already fetched the random_search_params lets just use random_search_params.n_runs

jakubczakon · 2018-07-26T08:51:43Z

src/pipeline_blocks.py

+                                        experiment_directory=config.pipeline.experiment_directory,
+                                        **kwargs
+                                        )
+        return sklearn_preprocess, sklearn_preprocess_valid


@kstrzala I believe it is cleaner to have if else rather just if here.

jakubczakon · 2018-07-26T08:54:34Z

src/pipeline_blocks.py

@@ -470,6 +532,51 @@ def stacking_features(config, train_mode, suffix, **kwargs):
        return feature_combiner


+def stacking_normalization(features, config, train_mode, suffix, **kwargs):


@kstrzala Since we are working on predictions at this point I am wondering which (if any) normalization strategy should one choose. I guess for our problem normalized rank could be a better choice than normalizing the predictions. What do you think?

jakubczakon · 2018-07-26T08:56:50Z

src/pipeline_manager.py

@@ -410,6 +410,7 @@ def _read_data(dev_mode, read_train=True, read_test=False):
                                                                  on='SK_ID_BUREAU', how='right')
    if dev_mode:


@kstrzala I don't understand what it does since we've loaded only nrows from bureau anyways via .read_csv(...nrows)

jakubczakon · 2018-07-26T08:58:18Z

src/pipelines.py

@@ -99,9 +106,6 @@ def xgboost(config, train_mode, suffix=''):


 def sklearn_main(config, ClassifierClass, clf_name, train_mode, suffix='', normalize=False):


@kstrzala let's just call it sklearn_pipeline or sklearn_classifier or something along those lines. sklearn_main could mean a lot of things

jakubczakon and others added 30 commits July 4, 2018 09:45

age/employment dummies (minerva-ml#104)

44812ab

added diff features

ea3cff4

Dev nan count (minerva-ml#105)

f06d19a

* added nan_count * added nan count with parameter

Dev fe installments (minerva-ml#106)

3aecd9d

* added simple features, parallel groupby, last-installment features * refactored last_installment features * added features for the very last installment

Dev fe instalments dynamic (minerva-ml#107)

5b738d4

* added dynamic-trend features * formated configs * added skew/iqr features

added number of credit agreement change features (minerva-ml#109)

f3cd0b6

* added number of credit agreement change features * reverted sample size

Dynamic features - previous application (minerva-ml#108)

7cd2071

* previous_application handcrafted features * previous application cleaning * Update neptune.yaml * code improvement * Update notebook

Notebook - feature importance (minerva-ml#112)

5ec73e2

dropped duplicated agg

04f047d

POS_CASH added features

8d12c3d

XMerge remote-tracking branch 'upstream/dev' into dev

4d07dae

added second level models (minerva-ml#126)

3872a94

POS CASH features added

2b18bf8

Merge remote-tracking branch 'upstream/dev' into dev

53c28fe

POS_CASH_balance feature cleaning

28cf42b

Yaml adjustment

45affca

Data cleaning and two new features (previous application) (minerva-ml…

7bd7b6a

…#129) * new previous application features * Data cleaning * update application notebook * credit card cleaning * Data cleaning - groupby agg * Include suggested changes

Data cleaning - fix (minerva-ml#130)

38bbabd

* new previous application features * Data cleaning * update application notebook * credit card cleaning * Data cleaning - groupby agg * Include suggested changes * Fix

Dev fractions (minerva-ml#132)

68ca3be

* added fraction features to eda and feature extraction, updated configs * updated hyperparams

Path change

afb01b1

Merge remote-tracking branch 'upstream/dev' into dev

debe4c2

fix misinterpretations

e0c061e

'<' instead of '>'

fix misinterpretations

49eaad3

'<' instead of '>'

Add cleaning in application_groupby_agg (minerva-ml#137)

54c3d4a

* application agg cleaning * update neptune.yaml

Merge remote-tracking branch 'upstream/dev' into dev

0f036ad

New branch

3a9a316

Karol Strzałkowski added 14 commits July 23, 2018 11:17

Notebook dev

2ab1d56

Merge remote-tracking branch 'upstream/dev' into dev_sklearn_preprocess

e87e2bc

q

b9a996f

Sklearn models modified

0ad1458

Minor bug fix

9f5eda6

Whatever

6c2973e

Merge branch 'dev' into dev_sklearn_preprocess

b651fbb

Merge remote-tracking branch 'upstream/dev' into dev_sklearn_preprocess

cb2e701

Space refactor

67ca9aa

Old forgotten merge

4d2edc8

Final refactor

806462c

Minor update

e316310

last k features with fraction removal

771fbc3

Merge remote-tracking branch 'upstream/dev' into dev_sklearn_preprocess

bc2de9a

kstrzala requested review from Ninoko, jakubczakon and kamil-kaczmarek July 25, 2018 11:54

Ninoko approved these changes Jul 25, 2018

View reviewed changes

kstrzala requested a review from pknut July 25, 2018 13:37

jakubczakon reviewed Jul 26, 2018

View reviewed changes

Karol Strzałkowski added 2 commits July 26, 2018 11:28

Fix PR isuuses

70cd8ae

Merge remote-tracking branch 'upstream/dev' into dev_sklearn_preprocess

9e3ce31

jakubczakon approved these changes Jul 26, 2018

View reviewed changes

jakubczakon merged commit 64466d9 into minerva-ml:dev Jul 26, 2018

jakubczakon mentioned this pull request Jul 26, 2018

Refactor/adjust sklearn model to refactore pipeline #76

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev sklearn preprocess #157

Dev sklearn preprocess #157

kstrzala commented Jul 25, 2018 •

edited

Loading

jakubczakon Jul 26, 2018

kstrzala Jul 26, 2018

jakubczakon Jul 26, 2018

jakubczakon Jul 26, 2018

jakubczakon Jul 26, 2018

jakubczakon Jul 26, 2018

jakubczakon Jul 26, 2018

jakubczakon Jul 26, 2018

		@@ -470,6 +532,51 @@ def stacking_features(config, train_mode, suffix, **kwargs):
		return feature_combiner


		def stacking_normalization(features, config, train_mode, suffix, **kwargs):

		@@ -410,6 +410,7 @@ def _read_data(dev_mode, read_train=True, read_test=False):
		on='SK_ID_BUREAU', how='right')
		if dev_mode:

		@@ -99,9 +106,6 @@ def xgboost(config, train_mode, suffix=''):


		def sklearn_main(config, ClassifierClass, clf_name, train_mode, suffix='', normalize=False):

Dev sklearn preprocess #157

Dev sklearn preprocess #157

Conversation

kstrzala commented Jul 25, 2018 • edited Loading

Major issues

Minor issues

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kstrzala commented Jul 25, 2018 •

edited

Loading