Skip to content
This repository has been archived by the owner on Jun 22, 2022. It is now read-only.

Commit

Permalink
Dev (#134)
Browse files Browse the repository at this point in the history
* age/employment dummies (#104)

* added diff features

* New handcrafted features (#102)

* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance

* Dev nan count (#105)

* added nan_count

* added nan count with parameter

* Dev fe installments (#106)

* added simple features, parallel groupby, last-installment features

* refactored last_installment features

* added features for the very last installment

* Dev fe instalments dynamic (#107)

* added dynamic-trend features

* formated configs

* added skew/iqr features

* added number of credit agreement change features (#109)

* added number of credit agreement change features

* reverted sample size

* Dynamic features - previous application (#108)

* previous_application handcrafted features

* previous application cleaning

* Update neptune.yaml

* code improvement

* Update notebook

* Notebook - feature importance (#112)

* Dev speed up (#111)

* refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby)

* sped up all hand crafted

* fixed bureau worker errors

* fixed isntallment names

* fixed isntallment names

* fixed bureau and prev_app naming bugs

* reverted to vectorized where possible

* updated hyperparams

* updated early stopping params to meet convergence

* reverted to old fallback neptune file

* updated paths

* updated paths, explored prev-app features

* dropped duplicated agg

* POS_CASH added features

* POS CASH features added

* POS_CASH_balance feature cleaning

* Yaml adjustment

* Path change
  • Loading branch information
karol.strzalkowski authored and jakubczakon committed Jul 16, 2018
1 parent 68ca3be commit c9c9c69
Show file tree
Hide file tree
Showing 6 changed files with 370 additions and 18 deletions.
4 changes: 3 additions & 1 deletion configs/neptune.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
project: ORGANIZATION/home-credit

name: home-credit-default-risk
tags: [solution-4, dev]
tags: [solution-5, dev]

metric:
channel: 'ROC_AUC'
Expand Down Expand Up @@ -54,6 +54,8 @@ parameters:
installments__last_k_trend_periods: '[10, 50, 100, 500]'
installments__last_k_agg_periods: '[1, 5, 10, 20, 50, 100]'
installments__last_k_agg_period_fractions: '[(5,20),(5,50),(10,50),(10,100),(20,100)]'
pos_cash__last_k_trend_periods: '[6, 12]'
pos_cash__last_k_agg_periods: '[6, 12, 30]'
application_aggregation__use_diffs_only: True
use_nan_count: True

Expand Down
2 changes: 2 additions & 0 deletions configs/neptune_random_search.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ parameters:
installments__last_k_trend_periods: '[10, 50, 100, 500]'
installments__last_k_agg_periods: '[1, 5, 10, 50, 100, 500]'
installments__last_k_agg_period_fractions: '[(5,20),(5,50),(10,50),(10,100),(20,100)]'
pos_cash__last_k_trend_periods: '[6, 12]'
pos_cash__last_k_agg_periods: '[6, 12, 30]'
application_aggregation__use_diffs_only: True
use_nan_count: True

Expand Down
2 changes: 2 additions & 0 deletions configs/neptune_stacking.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,8 @@ parameters:
installments__last_k_trend_periods: None
installments__last_k_agg_periods: None
installments__last_k_agg_period_fractions: None
pos_cash__last_k_trend_periods: None
pos_cash__last_k_agg_periods: None
application_aggregation__use_diffs_only: True
use_nan_count: True

Expand Down
254 changes: 248 additions & 6 deletions notebooks/eda-pos_cash_balance.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,21 @@
"outputs": [],
"source": [
"import os\n",
"import sys\n",
"import pandas as pd\n",
"import numpy as np\n",
"from tqdm import tqdm_notebook as tqdm\n",
"from functools import partial\n",
"from sklearn.externals import joblib\n",
"%matplotlib inline\n",
"import seaborn as sns\n",
"from sklearn.linear_model import LinearRegression\n",
"\n",
"DIR = '/mnt/ml-team/minerva/open-solutions/home-credit'\n",
"sys.path.append('../')\n",
"from src.utils import parallel_apply\n",
"from src.feature_extraction import add_features_in_group\n",
"\n",
"DIR = 'PATH/TO/YOUR/DATA'\n",
"description = pd.read_csv(os.path.join(DIR,'data/HomeCredit_columns_description.csv'),encoding = 'latin1')\n",
"application = pd.read_csv(os.path.join(DIR, 'files/unzipped_data/application_train.csv'))\n",
"pos_cash_balance = pd.read_csv(os.path.join(DIR, 'files/unzipped_data/POS_CASH_balance.csv'))"
Expand Down Expand Up @@ -170,11 +178,10 @@
"metadata": {},
"outputs": [],
"source": [
"application = application.merge(features,\n",
" left_on=['SK_ID_CURR'],\n",
" right_on=['SK_ID_CURR'],\n",
"X = application.merge(features, left_on=['SK_ID_CURR'], right_on=['SK_ID_CURR'],\n",
" how='left',\n",
" validate='one_to_one')"
" validate='one_to_one')\n",
"X = X[features.columns.tolist()+['TARGET']]"
]
},
{
Expand All @@ -185,7 +192,7 @@
"source": [
"engineered_numerical_columns = list(features.columns)\n",
"engineered_numerical_columns.remove('SK_ID_CURR')\n",
"credit_eng = application[engineered_numerical_columns + ['TARGET']]\n",
"credit_eng = X[engineered_numerical_columns + ['TARGET']]\n",
"credit_eng_corr = abs(credit_eng.corr())"
]
},
Expand All @@ -209,6 +216,241 @@
" yticklabels=credit_eng_corr.columns)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Solution 5\n",
"\n",
"### Hand crafted features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pos_cash_balance['pos_cash_paid_late'] = (pos_cash_balance['SK_DPD'] > 0).astype(int)\n",
"pos_cash_balance['pos_cash_paid_late_with_tolerance'] = (pos_cash_balance['SK_DPD_DEF'] > 0).astype(int)\n",
"groupby = pos_cash_balance.groupby(['SK_ID_CURR'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def last_k_installment_features(gr, periods):\n",
" gr_ = gr.copy()\n",
" gr_.sort_values(['MONTHS_BALANCE'], ascending=False, inplace=True)\n",
"\n",
" features = {}\n",
" for period in periods:\n",
" if period > 10e10:\n",
" period_name = 'all_installment_'\n",
" gr_period = gr_.copy()\n",
" else:\n",
" period_name = 'last_{}_'.format(period)\n",
" gr_period = gr_.iloc[:period]\n",
"\n",
" features = add_features_in_group(features, gr_period, 'pos_cash_paid_late',\n",
" ['count', 'mean'],\n",
" period_name)\n",
" features = add_features_in_group(features, gr_period, 'pos_cash_paid_late_with_tolerance',\n",
" ['count', 'mean'],\n",
" period_name)\n",
" features = add_features_in_group(features, gr_period, 'SK_DPD',\n",
" ['sum', 'mean', 'max', 'min', 'median'],\n",
" period_name)\n",
" features = add_features_in_group(features, gr_period, 'SK_DPD_DEF',\n",
" ['sum', 'mean', 'max', 'min','median'],\n",
" period_name)\n",
" return features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features = pd.DataFrame({'SK_ID_CURR': pos_cash_balance['SK_ID_CURR'].unique()})\n",
"func = partial(last_k_installment_features, periods=[1, 10, 50, 10e16])\n",
"g = parallel_apply(groupby, func, index_name='SK_ID_CURR', num_workers=10, chunk_size=10000).reset_index()\n",
"features = features.merge(g, on='SK_ID_CURR', how='left')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = X.merge(features, on='SK_ID_CURR',how='left')\n",
"X.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Last loan features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def last_loan_features(gr):\n",
" gr_ = gr.copy()\n",
" gr_.sort_values(['MONTHS_BALANCE'], ascending=False, inplace=True)\n",
" last_installment_id = gr_['SK_ID_PREV'].iloc[0]\n",
" gr_ = gr_[gr_['SK_ID_PREV'] == last_installment_id]\n",
"\n",
" features={}\n",
" features = add_features_in_group(features, gr_, 'pos_cash_paid_late',\n",
" ['count', 'sum', 'mean'],\n",
" 'last_loan_')\n",
" features = add_features_in_group(features, gr_, 'pos_cash_paid_late_with_tolerance',\n",
" ['sum', 'mean'],\n",
" 'last_loan_')\n",
" features = add_features_in_group(features, gr_, 'SK_DPD',\n",
" ['sum', 'mean', 'max', 'min', 'std'],\n",
" 'last_loan_')\n",
" features = add_features_in_group(features, gr_, 'SK_DPD_DEF',\n",
" ['sum', 'mean', 'max', 'min', 'std'],\n",
" 'last_loan_')\n",
" return features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features = pd.DataFrame({'SK_ID_CURR': pos_cash_balance['SK_ID_CURR'].unique()})\n",
"g = parallel_apply(groupby, last_loan_features, index_name='SK_ID_CURR', num_workers=10, chunk_size=10000).reset_index()\n",
"features = features.merge(g, on='SK_ID_CURR', how='left')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = X.merge(features, on='SK_ID_CURR',how='left')\n",
"X.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Trend features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def trend_in_last_k_installment_features(gr, periods):\n",
" gr_ = gr.copy()\n",
" gr_.sort_values(['MONTHS_BALANCE'], ascending=False, inplace=True)\n",
"\n",
" features = {}\n",
" for period in periods:\n",
" gr_period = gr_.iloc[:period]\n",
"\n",
" features = add_trend_feature(features, gr_period,\n",
" 'SK_DPD', '{}_period_trend_'.format(period)\n",
" )\n",
" features = add_trend_feature(features, gr_period,\n",
" 'SK_DPD_DEF', '{}_period_trend_'.format(period)\n",
" )\n",
" return features\n",
"\n",
"def add_trend_feature(features, gr, feature_name, prefix):\n",
" y = gr[feature_name].values\n",
" try:\n",
" x = np.arange(0, len(y)).reshape(-1, 1)\n",
" lr = LinearRegression()\n",
" lr.fit(x, y)\n",
" trend = lr.coef_[0]\n",
" except:\n",
" trend = np.nan\n",
" features['{}{}'.format(prefix, feature_name)] = trend\n",
" return features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features = pd.DataFrame({'SK_ID_CURR': pos_cash_balance['SK_ID_CURR'].unique()})\n",
"func = partial(trend_in_last_k_installment_features, periods=[1,6,12,30,60])\n",
"g = parallel_apply(groupby, func, index_name='SK_ID_CURR', num_workers=10, chunk_size=10000).reset_index()\n",
"features = features.merge(g, on='SK_ID_CURR', how='left')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"features.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = X.merge(features, on='SK_ID_CURR',how='left')\n",
"X_corr = abs(X.corr())\n",
"X_corr.sort_values('TARGET', ascending=False)['TARGET']"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down
Loading

0 comments on commit c9c9c69

Please sign in to comment.