Skip to content
This repository has been archived by the owner on Jun 22, 2022. It is now read-only.

Smart features #61

Merged
merged 9 commits into from
Jun 25, 2018
Merged

Conversation

pknut
Copy link

@pknut pknut commented Jun 22, 2018

No description provided.

Kamil A. Kaczmarek and others added 3 commits June 20, 2018 15:27
super().__init__()

@property
def application_names(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut If there is no calculation in the method why not have

python self.application_names = ['A', 'B']

@@ -140,3 +140,215 @@ def transform(self, X):
how='left')

return {'numerical_features': X[self.groupby_aggregations_names].astype(np.float32)}


class Application(BaseTransformer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut this name is not very descriptive. What is the operation (logic) that this transformer performs?

'PAYMENT_RATE']

def transform(self, X, y=None):
X['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut I think the cleane-up where we deal with missing values outliers etc should be done in a seperate step (or steps)


def fit(self, X):
bureau = pd.read_csv(self.filepath)
bureau['AMT_CREDIT_SUM'].fillna(0, inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut I'd rather put NA handling in a separate transformer

bureau['bureau_active_loans_percentage'] = bureau.groupby(
by=['SK_ID_CURR'])['bureau_credit_active_binary'].agg('mean').reset_index()['bureau_credit_active_binary']

# AVERAGE NUMBER OF DAYS BETWEEN SUCCESSIVE PAST APPLICATIONS FOR EACH CUSTOMER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut I prefer putting some logic in a method with descriptive name. That makes comments obsolete and makes code easier to read

]

def fit(self, X):
bureau = pd.read_csv(self.filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut I don't like that we read data in the fit method. It is very NOT single responsibility principle. I don't mind reading data in a separate transformer but I would rather read them in the main.py and pass objects to pipeline

]

def fit(self, X):
credit_card = pd.read_csv(self.filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut same here

def _bureau(config, train_mode, **kwargs):
if train_mode:
bureau = Step(name='bureau',
transformer=fe.GroupbyAggregationFromFile(**config.bureau),
transformer=fe.Bureau(**config.bureau),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut BureauAggregations or BureauFeatures is more descriptive

pipelines.py Outdated
@@ -76,6 +76,7 @@ def sklearn_main(config, ClassifierClass, clf_name, train_mode, normalize=False)
cache_output=True,
load_persisted_output=True)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut you can drop this line :)

@@ -140,40 +140,89 @@ def classifier_sklearn(sklearn_features, ClassifierClass, full_config, clf_name,
def feature_extraction(config, train_mode, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut @jakubczakon I think that we need feature_extraction refactor. Now, User cannot try multiple models with fewer features. In solution-1 we had 122 features, in solution-2 we have 2.5k features, and here we are adding even more. IMHO it should be parametrizable what freature-sets User want to use in their training. For example, pick only basic_features and bureau features.

@@ -22,6 +22,7 @@
TARGET_COLUMN = 'TARGET'

TIMESTAMP_COLUMNS = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut @jakubczakon I think we should drop in, since we do not use it.

main.py Outdated
@@ -79,8 +79,20 @@ def _train(pipeline_name, dev_mode):
if dev_mode:
logger.info('running in "dev-mode". Sample size is: {}'.format(cfg.DEV_SAMPLE_SIZE))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut since there is a lot of repetition here I would probably go with if dev_mode: nrows=cfg.DEV_SAMPLE else nrows=None and then just pass nrows=nrows

def _bureau(config, train_mode, **kwargs):
if train_mode:
bureau = Step(name='bureau',
transformer=fe.GroupbyAggregationFromFile(**config.bureau),
transformer=fe.BureauFeatures(**config.bureau),
input_data=['input'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pknut I think it would be better to have input_data=['bureau','main'] and then in adapter X: 'main', 'X and bureau: 'bureau', 'X' . It shows the benefit of multiinput data dictionary that is passed to step in the main.py file

@jakubczakon jakubczakon merged commit 5cbbe74 into minerva-ml:dev-solution-3 Jun 25, 2018
kamil-kaczmarek pushed a commit that referenced this pull request Jul 3, 2018
* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* updated best model name

* changed best model path

* corrections
jakubczakon added a commit that referenced this pull request Jul 3, 2018
* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* updated best model name

* changed best model path

* added groupby diff features

* dropped unreasonable agg diffs
pknut added a commit to pknut/open-solution-home-credit that referenced this pull request Jul 3, 2018
* Smart features (minerva-ml#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (minerva-ml#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (minerva-ml#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (minerva-ml#77)

* Update config (minerva-ml#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (minerva-ml#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (minerva-ml#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (minerva-ml#85)

* local trash

* External sources notebook (minerva-ml#86)

* Update

* External sources notebook

* Dev lgbm params (minerva-ml#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (minerva-ml#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (minerva-ml#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (minerva-ml#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix
jakubczakon pushed a commit that referenced this pull request Jul 4, 2018
* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance
kamil-kaczmarek pushed a commit that referenced this pull request Jul 10, 2018
* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* notebook - feature importance - small fixes (#124)

* Notebook - feature importance

* Notebook - feature importance - search by text

* Notebook - feature importance - search by text

* Notebook - feature importance - Plots description

* fixed typo in feature adding (affected installments)
jakubczakon pushed a commit that referenced this pull request Jul 16, 2018
* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* POS_CASH added features

* POS CASH features added

* POS_CASH_balance feature cleaning

* Yaml adjustment

* Path change
kamil-kaczmarek pushed a commit that referenced this pull request Jul 18, 2018
* added second level models (#126)

* Family features (#128)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Features - family - test

* Features - family - aggregate

* Features - family - aggregate 2

* Features - family - aggregate 3

* Features - family - aggregate 4

* Update pipeline_config.py

* Data cleaning and two new features (previous application) (#129)

* new previous application features

* Data cleaning

* update application notebook

* credit card cleaning

* Data cleaning - groupby agg

* Include suggested changes

* Data cleaning - fix (#130)

* new previous application features

* Data cleaning

* update application notebook

* credit card cleaning

* Data cleaning - groupby agg

* Include suggested changes

* Fix

* Dev fractions (#132)

* added fraction features to eda and feature extraction, updated configs

* updated hyperparams

* Dev (#134)

* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* POS_CASH added features

* POS CASH features added

* POS_CASH_balance feature cleaning

* Yaml adjustment

* Path change

* fix misinterpretations

'<' instead of '>'

* fix misinterpretations

'<' instead of '>'

* Add cleaning in application_groupby_agg (#137)

* application agg cleaning

* update neptune.yaml

* Interaction features (#139)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Features - family - test

* Features - family - aggregate

* Features - family - aggregate 2

* Features - family - aggregate 3

* Features - family - aggregate 4

* Update pipeline_config.py

* Features - family - added new cols to agg

* Features - interaction features

* Features - interaction features - fix

* Added is_unbalance to configs

* updated paths, added corr prints in pos cash balance

* dropped unused dependencies

* updated pandas version
jakubczakon pushed a commit that referenced this pull request Jul 20, 2018
* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* POS_CASH added features

* added second level models (#126)

* POS CASH features added

* Family features (#128)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Features - family - test

* Features - family - aggregate

* Features - family - aggregate 2

* Features - family - aggregate 3

* Features - family - aggregate 4

* Update pipeline_config.py

* POS_CASH_balance feature cleaning

* Yaml adjustment

* Data cleaning and two new features (previous application) (#129)

* new previous application features

* Data cleaning

* update application notebook

* credit card cleaning

* Data cleaning - groupby agg

* Include suggested changes

* Data cleaning - fix (#130)

* new previous application features

* Data cleaning

* update application notebook

* credit card cleaning

* Data cleaning - groupby agg

* Include suggested changes

* Fix

* Initail bureau_balance features

* Dev fractions (#132)

* added fraction features to eda and feature extraction, updated configs

* updated hyperparams

* Dev (#134)

* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* POS_CASH added features

* POS CASH features added

* POS_CASH_balance feature cleaning

* Yaml adjustment

* Path change

* fix misinterpretations

'<' instead of '>'

* fix misinterpretations

'<' instead of '>'

* Code cleanup

* Bug fix

* NaN handling

* Add cleaning in application_groupby_agg (#137)

* application agg cleaning

* update neptune.yaml

* Interaction features (#139)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Features - family - test

* Features - family - aggregate

* Features - family - aggregate 2

* Features - family - aggregate 3

* Features - family - aggregate 4

* Update pipeline_config.py

* Features - family - added new cols to agg

* Features - interaction features

* Features - interaction features - fix

* Added is_unbalance to configs

* Time correction

* Full time count correction

* Time features correction and bureau_balance features

* Bug fixing

* Bug fixing
jakubczakon pushed a commit that referenced this pull request Jul 26, 2018
* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* POS_CASH added features

* added second level models (#126)

* POS CASH features added

* Family features (#128)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Features - family - test

* Features - family - aggregate

* Features - family - aggregate 2

* Features - family - aggregate 3

* Features - family - aggregate 4

* Update pipeline_config.py

* POS_CASH_balance feature cleaning

* Yaml adjustment

* Data cleaning and two new features (previous application) (#129)

* new previous application features

* Data cleaning

* update application notebook

* credit card cleaning

* Data cleaning - groupby agg

* Include suggested changes

* Data cleaning - fix (#130)

* new previous application features

* Data cleaning

* update application notebook

* credit card cleaning

* Data cleaning - groupby agg

* Include suggested changes

* Fix

* Dev fractions (#132)

* added fraction features to eda and feature extraction, updated configs

* updated hyperparams

* Path change

* Dev (#134)

* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* POS_CASH added features

* POS CASH features added

* POS_CASH_balance feature cleaning

* Yaml adjustment

* Path change

* fix misinterpretations

'<' instead of '>'

* fix misinterpretations

'<' instead of '>'

* Add cleaning in application_groupby_agg (#137)

* application agg cleaning

* update neptune.yaml

* New branch

* Notebook dev

* q

* Sklearn models modified

* Minor bug fix

* Whatever

* Space refactor

* Old forgotten merge

* Final refactor

* Minor update

* last k features with fraction removal

* Fix PR isuuses
jakubczakon pushed a commit that referenced this pull request Jul 26, 2018
* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* POS_CASH added features

* added second level models (#126)

* POS CASH features added

* Family features (#128)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Features - family - test

* Features - family - aggregate

* Features - family - aggregate 2

* Features - family - aggregate 3

* Features - family - aggregate 4

* Update pipeline_config.py

* POS_CASH_balance feature cleaning

* Yaml adjustment

* Data cleaning and two new features (previous application) (#129)

* new previous application features

* Data cleaning

* update application notebook

* credit card cleaning

* Data cleaning - groupby agg

* Include suggested changes

* Data cleaning - fix (#130)

* new previous application features

* Data cleaning

* update application notebook

* credit card cleaning

* Data cleaning - groupby agg

* Include suggested changes

* Fix

* Dev fractions (#132)

* added fraction features to eda and feature extraction, updated configs

* updated hyperparams

* Path change

* Dev (#134)

* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* POS_CASH added features

* POS CASH features added

* POS_CASH_balance feature cleaning

* Yaml adjustment

* Path change

* fix misinterpretations

'<' instead of '>'

* fix misinterpretations

'<' instead of '>'

* Add cleaning in application_groupby_agg (#137)

* application agg cleaning

* update neptune.yaml

* New branch

* Notebook dev

* q

* Sklearn models modified

* Minor bug fix

* Whatever

* Space refactor

* Old forgotten merge

* Final refactor

* Minor update

* last k features with fraction removal

* Fix PR isuuses

* Fillna bug fix
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants