Dev (#134)

* age/employment dummies (#104) * added diff features * New handcrafted features (#102) * Dynamic features * Smart features (#61) * Update README.md * Update README.md * Update * Smart features update * More descriptive transformer name * Reading all data in main * More application features * Transformer for cleaning * Multiinput data dictionary * Fix (#63) * fixed configs * dropped redundand steps, moved stuff to cleaning, refactored groupby (#64) * dropped redundand steps, moved stuff to cleanining, refactored groupby * restructured, added stacking + CV * Fix format string * Update pipeline_manager.py clipped prediction -> prediction * added stratified kfold option (#77) * Update config (#79) * dropped redundand steps, moved stuff to cleanining, refactored groupby * restructured, added stacking + CV * Update pipeline_config.py * Dev review (#81) * dropped feature by type split, refactored pipleine_config * dropped feature by type split method * explored application features * trash * reverted refactor of aggs * fixed/updated bureau features * cleared notebooks * agg features added to notebook bureau * credit card cleaned * added other feature notebooks * added rank mean * updated model arch * reverted to old params * fixed rank mean calculations * ApplicationCleaning update (#84) * Cleaning - application * Clear output in notebook * clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85) * local trash * External sources notebook (#86) * Update * External sources notebook * Dev lgbm params (#88) * local trash * updated configs * dropped comment * updated lgb params * Dev app agg fix (#90) * dropped app_aggs * app agg features fixed * cleaned leftovers * dropped fast read-in for debug * External_sources statistics (#89) * Speed-up ext_src notebook * exernal_sources statistics * Weighted mean and notebook fix * application notebook update * clear notebook output * Fix auto submission (#95) * CreditCardBalance monthly diff mean * POSCASH remaining installments * POSCASH completed_contracts * notebook update * Resolve conflicts * Fix * Update neptune.yaml * Update neptune_random_search.yaml * Split static and dynamic features - credit card balance * Dev nan count (#105) * added nan_count * added nan count with parameter * Dev fe installments (#106) * added simple features, parallel groupby, last-installment features * refactored last_installment features * added features for the very last installment * Dev fe instalments dynamic (#107) * added dynamic-trend features * formated configs * added skew/iqr features * added number of credit agreement change features (#109) * added number of credit agreement change features * reverted sample size * Dynamic features - previous application (#108) * previous_application handcrafted features * previous application cleaning * Update neptune.yaml * code improvement * Update notebook * Notebook - feature importance (#112) * Dev speed up (#111) * refactored aggs to calculate only once per training, sped up installment and credit card (only single index groupby) * sped up all hand crafted * fixed bureau worker errors * fixed isntallment names * fixed isntallment names * fixed bureau and prev_app naming bugs * reverted to vectorized where possible * updated hyperparams * updated early stopping params to meet convergence * reverted to old fallback neptune file * updated paths * updated paths, explored prev-app features * dropped duplicated agg * POS_CASH added features * POS CASH features added * POS_CASH_balance feature cleaning * Yaml adjustment * Path change
minerva-ml · Jul 16, 2018 · c9c9c69 · c9c9c69
1 parent 68ca3be
commit c9c9c69
Show file tree

Hide file tree

Showing 6 changed files with 370 additions and 18 deletions.
diff --git a/configs/neptune.yaml b/configs/neptune.yaml
@@ -1,7 +1,7 @@
 project: ORGANIZATION/home-credit
 
 name: home-credit-default-risk
-tags: [solution-4, dev]
+tags: [solution-5, dev]
 
 metric:
   channel: 'ROC_AUC'
@@ -54,6 +54,8 @@ parameters:
   installments__last_k_trend_periods: '[10, 50, 100, 500]'
   installments__last_k_agg_periods: '[1, 5, 10, 20, 50, 100]'
   installments__last_k_agg_period_fractions: '[(5,20),(5,50),(10,50),(10,100),(20,100)]'
+  pos_cash__last_k_trend_periods: '[6, 12]'
+  pos_cash__last_k_agg_periods: '[6, 12, 30]'
   application_aggregation__use_diffs_only: True
   use_nan_count: True
 

diff --git a/configs/neptune_random_search.yaml b/configs/neptune_random_search.yaml
@@ -54,6 +54,8 @@ parameters:
   installments__last_k_trend_periods: '[10, 50, 100, 500]'
   installments__last_k_agg_periods: '[1, 5, 10, 50, 100, 500]'
   installments__last_k_agg_period_fractions: '[(5,20),(5,50),(10,50),(10,100),(20,100)]'
+  pos_cash__last_k_trend_periods: '[6, 12]'
+  pos_cash__last_k_agg_periods: '[6, 12, 30]'
   application_aggregation__use_diffs_only: True
   use_nan_count: True
 

diff --git a/configs/neptune_stacking.yaml b/configs/neptune_stacking.yaml
@@ -54,6 +54,8 @@ parameters:
   installments__last_k_trend_periods: None
   installments__last_k_agg_periods: None
   installments__last_k_agg_period_fractions: None
+  pos_cash__last_k_trend_periods: None
+  pos_cash__last_k_agg_periods: None
   application_aggregation__use_diffs_only: True
   use_nan_count: True
 

diff --git a/notebooks/eda-pos_cash_balance.ipynb b/notebooks/eda-pos_cash_balance.ipynb
@@ -7,13 +7,21 @@
    "outputs": [],
    "source": [
     "import os\n",
+    "import sys\n",
     "import pandas as pd\n",
+    "import numpy as np\n",
     "from tqdm import tqdm_notebook as tqdm\n",
+    "from functools import partial\n",
     "from sklearn.externals import joblib\n",
     "%matplotlib inline\n",
     "import seaborn as sns\n",
+    "from sklearn.linear_model import LinearRegression\n",
     "\n",
-    "DIR = '/mnt/ml-team/minerva/open-solutions/home-credit'\n",
+    "sys.path.append('../')\n",
+    "from src.utils import parallel_apply\n",
+    "from src.feature_extraction import add_features_in_group\n",
+    "\n",
+    "DIR = 'PATH/TO/YOUR/DATA'\n",
     "description = pd.read_csv(os.path.join(DIR,'data/HomeCredit_columns_description.csv'),encoding = 'latin1')\n",
     "application = pd.read_csv(os.path.join(DIR, 'files/unzipped_data/application_train.csv'))\n",
     "pos_cash_balance = pd.read_csv(os.path.join(DIR, 'files/unzipped_data/POS_CASH_balance.csv'))"
@@ -170,11 +178,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "application = application.merge(features,\n",
-    "                                left_on=['SK_ID_CURR'],\n",
-    "                                right_on=['SK_ID_CURR'],\n",
+    "X = application.merge(features, left_on=['SK_ID_CURR'], right_on=['SK_ID_CURR'],\n",
     "                                how='left',\n",
-    "                                validate='one_to_one')"
+    "                                validate='one_to_one')\n",
+    "X = X[features.columns.tolist()+['TARGET']]"
    ]
   },
   {
@@ -185,7 +192,7 @@
    "source": [
     "engineered_numerical_columns = list(features.columns)\n",
     "engineered_numerical_columns.remove('SK_ID_CURR')\n",
-    "credit_eng = application[engineered_numerical_columns + ['TARGET']]\n",
+    "credit_eng = X[engineered_numerical_columns + ['TARGET']]\n",
     "credit_eng_corr = abs(credit_eng.corr())"
    ]
   },
@@ -209,6 +216,241 @@
     "            yticklabels=credit_eng_corr.columns)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "features.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Solution 5\n",
+    "\n",
+    "### Hand crafted features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pos_cash_balance['pos_cash_paid_late'] = (pos_cash_balance['SK_DPD'] > 0).astype(int)\n",
+    "pos_cash_balance['pos_cash_paid_late_with_tolerance'] = (pos_cash_balance['SK_DPD_DEF'] > 0).astype(int)\n",
+    "groupby = pos_cash_balance.groupby(['SK_ID_CURR'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def last_k_installment_features(gr, periods):\n",
+    "    gr_ = gr.copy()\n",
+    "    gr_.sort_values(['MONTHS_BALANCE'], ascending=False, inplace=True)\n",
+    "\n",
+    "    features = {}\n",
+    "    for period in periods:\n",
+    "        if period > 10e10:\n",
+    "            period_name = 'all_installment_'\n",
+    "            gr_period = gr_.copy()\n",
+    "        else:\n",
+    "            period_name = 'last_{}_'.format(period)\n",
+    "            gr_period = gr_.iloc[:period]\n",
+    "\n",
+    "        features = add_features_in_group(features, gr_period, 'pos_cash_paid_late',\n",
+    "                                             ['count', 'mean'],\n",
+    "                                             period_name)\n",
+    "        features = add_features_in_group(features, gr_period, 'pos_cash_paid_late_with_tolerance',\n",
+    "                                             ['count', 'mean'],\n",
+    "                                             period_name)\n",
+    "        features = add_features_in_group(features, gr_period, 'SK_DPD',\n",
+    "                                             ['sum', 'mean', 'max', 'min', 'median'],\n",
+    "                                             period_name)\n",
+    "        features = add_features_in_group(features, gr_period, 'SK_DPD_DEF',\n",
+    "                                             ['sum', 'mean', 'max', 'min','median'],\n",
+    "                                             period_name)\n",
+    "    return features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "features = pd.DataFrame({'SK_ID_CURR': pos_cash_balance['SK_ID_CURR'].unique()})\n",
+    "func = partial(last_k_installment_features, periods=[1, 10, 50, 10e16])\n",
+    "g = parallel_apply(groupby, func, index_name='SK_ID_CURR', num_workers=10, chunk_size=10000).reset_index()\n",
+    "features = features.merge(g, on='SK_ID_CURR', how='left')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "features.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X = X.merge(features, on='SK_ID_CURR',how='left')\n",
+    "X.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Last loan features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def last_loan_features(gr):\n",
+    "    gr_ = gr.copy()\n",
+    "    gr_.sort_values(['MONTHS_BALANCE'], ascending=False, inplace=True)\n",
+    "    last_installment_id = gr_['SK_ID_PREV'].iloc[0]\n",
+    "    gr_ = gr_[gr_['SK_ID_PREV'] == last_installment_id]\n",
+    "\n",
+    "    features={}\n",
+    "    features = add_features_in_group(features, gr_, 'pos_cash_paid_late',\n",
+    "                                         ['count', 'sum', 'mean'],\n",
+    "                                         'last_loan_')\n",
+    "    features = add_features_in_group(features, gr_, 'pos_cash_paid_late_with_tolerance',\n",
+    "                                         ['sum', 'mean'],\n",
+    "                                         'last_loan_')\n",
+    "    features = add_features_in_group(features, gr_, 'SK_DPD',\n",
+    "                                         ['sum', 'mean', 'max', 'min', 'std'],\n",
+    "                                         'last_loan_')\n",
+    "    features = add_features_in_group(features, gr_, 'SK_DPD_DEF',\n",
+    "                                         ['sum', 'mean', 'max', 'min', 'std'],\n",
+    "                                         'last_loan_')\n",
+    "    return features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "features = pd.DataFrame({'SK_ID_CURR': pos_cash_balance['SK_ID_CURR'].unique()})\n",
+    "g = parallel_apply(groupby, last_loan_features, index_name='SK_ID_CURR', num_workers=10, chunk_size=10000).reset_index()\n",
+    "features = features.merge(g, on='SK_ID_CURR', how='left')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "features.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X = X.merge(features, on='SK_ID_CURR',how='left')\n",
+    "X.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Trend features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def trend_in_last_k_installment_features(gr, periods):\n",
+    "    gr_ = gr.copy()\n",
+    "    gr_.sort_values(['MONTHS_BALANCE'], ascending=False, inplace=True)\n",
+    "\n",
+    "    features = {}\n",
+    "    for period in periods:\n",
+    "        gr_period = gr_.iloc[:period]\n",
+    "\n",
+    "        features = add_trend_feature(features, gr_period,\n",
+    "                                         'SK_DPD', '{}_period_trend_'.format(period)\n",
+    "                                         )\n",
+    "        features = add_trend_feature(features, gr_period,\n",
+    "                                         'SK_DPD_DEF', '{}_period_trend_'.format(period)\n",
+    "                                         )\n",
+    "    return features\n",
+    "\n",
+    "def add_trend_feature(features, gr, feature_name, prefix):\n",
+    "    y = gr[feature_name].values\n",
+    "    try:\n",
+    "        x = np.arange(0, len(y)).reshape(-1, 1)\n",
+    "        lr = LinearRegression()\n",
+    "        lr.fit(x, y)\n",
+    "        trend = lr.coef_[0]\n",
+    "    except:\n",
+    "        trend = np.nan\n",
+    "    features['{}{}'.format(prefix, feature_name)] = trend\n",
+    "    return features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "features = pd.DataFrame({'SK_ID_CURR': pos_cash_balance['SK_ID_CURR'].unique()})\n",
+    "func = partial(trend_in_last_k_installment_features, periods=[1,6,12,30,60])\n",
+    "g = parallel_apply(groupby, func, index_name='SK_ID_CURR', num_workers=10, chunk_size=10000).reset_index()\n",
+    "features = features.merge(g, on='SK_ID_CURR', how='left')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "features.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X = X.merge(features, on='SK_ID_CURR',how='left')\n",
+    "X_corr = abs(X.corr())\n",
+    "X_corr.sort_values('TARGET', ascending=False)['TARGET']"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,