Skip to content
This repository has been archived by the owner on Jun 22, 2022. It is now read-only.

Commit

Permalink
New handcrafted features (#102)
Browse files Browse the repository at this point in the history
* Dynamic features

* Smart features (#61)

* Update README.md

* Update README.md

* Update

* Smart features update

* More descriptive transformer name

* Reading all data in main

* More application features

* Transformer for cleaning

* Multiinput data dictionary

* Fix (#63)

* fixed configs

* dropped redundand steps, moved stuff to cleaning, refactored groupby (#64)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Fix format string

* Update pipeline_manager.py

clipped prediction -> prediction

* added stratified kfold option (#77)

* Update config (#79)

* dropped redundand steps, moved stuff to cleanining, refactored groupby

* restructured, added stacking + CV

* Update pipeline_config.py

* Dev review (#81)

* dropped feature by type split, refactored pipleine_config

* dropped feature by type split method

* explored application features

* trash

* reverted refactor of aggs

* fixed/updated bureau features

* cleared notebooks

* agg features added to notebook bureau

* credit card cleaned

* added other feature notebooks

* added rank mean

* updated model arch

* reverted to old params

* fixed rank mean calculations

* ApplicationCleaning update (#84)

* Cleaning - application

* Clear output in notebook

* clenaed names in steps, refactored mergeaggregate transformer, changed caching/saving specs (#85)

* local trash

* External sources notebook (#86)

* Update

* External sources notebook

* Dev lgbm params (#88)

* local trash

* updated configs

* dropped comment

* updated lgb params

* Dev app agg fix (#90)

* dropped app_aggs

* app agg features fixed

* cleaned leftovers

* dropped fast read-in for debug

* External_sources statistics (#89)

* Speed-up ext_src notebook

* exernal_sources statistics

* Weighted mean and notebook fix

* application notebook update

* clear notebook output

* Fix auto submission (#95)

* CreditCardBalance monthly diff mean

* POSCASH remaining installments

* POSCASH completed_contracts

* notebook update

* Resolve conflicts

* Fix

* Update neptune.yaml

* Update neptune_random_search.yaml

* Split static and dynamic features - credit card balance
  • Loading branch information
pknut authored and jakubczakon committed Jul 4, 2018
1 parent ea3cff4 commit 9fc12eb
Show file tree
Hide file tree
Showing 7 changed files with 215 additions and 15 deletions.
6 changes: 3 additions & 3 deletions neptune.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
project: ORGANIZATION/home-credit

name: home-credit-default-risk
tags: [solution-3, dev]
tags: [solution-4, dev]

metric:
channel: 'ROC_AUC'
Expand Down Expand Up @@ -32,7 +32,7 @@ parameters:

# Kaggle
kaggle_api: 0
kaggle_message: 'solution-3'
kaggle_message: 'solution-4'

# Data preparation
n_cv_splits: 5
Expand Down Expand Up @@ -122,4 +122,4 @@ parameters:
svc__max_iter: -1

# Postprocessing
aggregation_method: rank_mean
aggregation_method: rank_mean
6 changes: 3 additions & 3 deletions neptune_random_search.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
project: ORGANIZATION/home-credit

name: home-credit-default-risk
tags: [solution-3]
tags: [solution-4]

metric:
channel: 'ROC_AUC'
Expand Down Expand Up @@ -32,7 +32,7 @@ parameters:

# Kaggle
kaggle_api: 0
kaggle_message: 'solution-3'
kaggle_message: 'solution-4'

# Data preparation
n_cv_splits: 5
Expand Down Expand Up @@ -122,4 +122,4 @@ parameters:
svc__max_iter: '[-1, 100, 1000, 10000, 50000, "list"]'

# Postprocessing
aggregation_method: rank_mean
aggregation_method: rank_mean
19 changes: 15 additions & 4 deletions notebooks/eda-credit_card.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -345,7 +345,7 @@
"source": [
"groupby_aggregate_names = []\n",
"for groupby_cols, specs in tqdm(CREDIT_CARD_BALANCE_AGGREGATION_RECIPIES):\n",
" group_object = credit.groupby(groupby_cols)\n",
" group_object = credit_card.groupby(groupby_cols)\n",
" for select, agg in tqdm(specs):\n",
" groupby_aggregate_name = '{}_{}_{}'.format('_'.join(groupby_cols), agg, select)\n",
" application = application.merge(group_object[select]\n",
Expand Down Expand Up @@ -424,14 +424,25 @@
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"source": [
"# credit_card_sorted = credit_card.sort_values(['SK_ID_CURR', 'MONTHS_BALANCE'])\n",
"# credit_card_sorted['credit_card_monthly_diff'] = credit_card_sorted.groupby(\n",
"# by='SK_ID_CURR')['AMT_BALANCE'].diff()\n",
"# group_object = credit_card_sorted.groupby(['SK_ID_CURR'])['credit_card_monthly_diff'].agg('mean').reset_index()\n",
"# group_object.rename(index=str,\n",
"# columns={'credit_card_monthly_diff': 'credit_card_monthly_diff_mean'},\n",
"# inplace=True)\n",
"\n",
"# features = features.merge(group_object, on=['SK_ID_CURR'], how='left')\n",
"# features.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "cpu py3",
"display_name": "Python 3",
"language": "python",
"name": "cpu_py3"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
98 changes: 95 additions & 3 deletions notebooks/eda-pos_cash_balance.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Aggregations"
"### Aggregations"
]
},
{
Expand Down Expand Up @@ -117,6 +117,98 @@
"application_agg_corr.sort_values('TARGET', ascending=False)['TARGET']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Solution 4\n",
"### Hand crafted features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"features = pd.DataFrame({'SK_ID_CURR': pos_cash_balance['SK_ID_CURR'].unique()})"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pos_cash_sorted = pos_cash_balance.sort_values(['SK_ID_CURR', 'MONTHS_BALANCE'])\n",
"group_object = pos_cash_sorted.groupby('SK_ID_CURR')['CNT_INSTALMENT_FUTURE'].last().reset_index()\n",
"group_object.rename(index=str,\n",
" columns={'CNT_INSTALMENT_FUTURE': 'pos_cash_remaining_installments'},\n",
" inplace=True)\n",
"\n",
"features = features.merge(group_object, on=['SK_ID_CURR'], how='left')\n",
"features.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pos_cash_balance['is_contract_status_completed'] = pos_cash_balance['NAME_CONTRACT_STATUS'] == 'Completed'\n",
"group_object = pos_cash_balance.groupby(['SK_ID_CURR'])['is_contract_status_completed'].sum().reset_index()\n",
"group_object.rename(index=str,\n",
" columns={'is_contract_status_completed': 'pos_cash_completed_contracts'},\n",
" inplace=True)\n",
"features = features.merge(group_object, on=['SK_ID_CURR'], how='left')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"application = application.merge(features,\n",
" left_on=['SK_ID_CURR'],\n",
" right_on=['SK_ID_CURR'],\n",
" how='left',\n",
" validate='one_to_one')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"engineered_numerical_columns = list(features.columns)\n",
"engineered_numerical_columns.remove('SK_ID_CURR')\n",
"credit_eng = application[engineered_numerical_columns + ['TARGET']]\n",
"credit_eng_corr = abs(credit_eng.corr())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"credit_eng_corr.sort_values('TARGET', ascending=False)['TARGET']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.heatmap(credit_eng_corr, \n",
" xticklabels=credit_eng_corr.columns,\n",
" yticklabels=credit_eng_corr.columns)"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -127,9 +219,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "cpu py3",
"display_name": "Python 3",
"language": "python",
"name": "cpu_py3"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
4 changes: 2 additions & 2 deletions notebooks/model_exploration.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,9 @@
],
"metadata": {
"kernelspec": {
"display_name": "cpu py3",
"display_name": "Python 3",
"language": "python",
"name": "cpu_py3"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand Down
70 changes: 70 additions & 0 deletions src/feature_extraction.py
Original file line number Diff line number Diff line change
Expand Up @@ -331,6 +331,32 @@ def feature_names(self):
return feature_names

def fit(self, X, credit_card, **kwargs):
static_features = self._static_features(X, credit_card, **kwargs)
dynamic_features = self._dynamic_features(X, credit_card, **kwargs)

self.features = pd.merge(static_features,
dynamic_features,
on=['SK_ID_CURR'],
validate='one_to_one')
return self

def transform(self, X, **kwargs):
X = X.merge(self.features,
left_on=['SK_ID_CURR'],
right_on=['SK_ID_CURR'],
how='left',
validate='one_to_one')

return {'numerical_features': X[self.feature_names]}

def load(self, filepath):
self.features = joblib.load(filepath)
return self

def persist(self, filepath):
joblib.dump(self.features, filepath)

def _static_features(self, X, credit_card, **kwargs):
credit_card['number_of_instalments'] = credit_card.groupby(
by=['SK_ID_CURR', 'SK_ID_PREV'])['CNT_INSTALMENT_MATURE_CUM'].agg('max').reset_index()[
'CNT_INSTALMENT_MATURE_CUM']
Expand Down Expand Up @@ -374,6 +400,50 @@ def fit(self, X, credit_card, **kwargs):
features['credit_card_cash_card_ratio'] = features['credit_card_drawings_atm'] / features[
'credit_card_drawings_total']

return features

def _dynamic_features(self, X, credit_card, **kwargs):
features = pd.DataFrame({'SK_ID_CURR': credit_card['SK_ID_CURR'].unique()})

credit_card_sorted = credit_card.sort_values(['SK_ID_CURR', 'MONTHS_BALANCE'])
credit_card_sorted['credit_card_monthly_diff'] = credit_card_sorted.groupby(
by='SK_ID_CURR')['AMT_BALANCE'].diff()
group_object = credit_card_sorted.groupby(['SK_ID_CURR'])['credit_card_monthly_diff'].agg('mean').reset_index()
group_object.rename(index=str,
columns={'credit_card_monthly_diff': 'credit_card_monthly_diff_mean'},
inplace=True)
features = features.merge(group_object, on=['SK_ID_CURR'], how='left')

return features


class POSCASHBalanceFeatures(BaseTransformer):
def __init__(self, **kwargs):
self.features = None

@property
def feature_names(self):
feature_names = list(self.features.columns)
feature_names.remove('SK_ID_CURR')
return feature_names

def fit(self, X, pos_cash, **kwargs):
features = pd.DataFrame({'SK_ID_CURR': pos_cash['SK_ID_CURR'].unique()})

pos_cash_sorted = pos_cash.sort_values(['SK_ID_CURR', 'MONTHS_BALANCE'])
group_object = pos_cash_sorted.groupby('SK_ID_CURR')['CNT_INSTALMENT_FUTURE'].last().reset_index()
group_object.rename(index=str,
columns={'CNT_INSTALMENT_FUTURE': 'pos_cash_remaining_installments'},
inplace=True)
features = features.merge(group_object, on=['SK_ID_CURR'], how='left')

pos_cash['is_contract_status_completed'] = pos_cash['NAME_CONTRACT_STATUS'] == 'Completed'
group_object = pos_cash.groupby(['SK_ID_CURR'])['is_contract_status_completed'].sum().reset_index()
group_object.rename(index=str,
columns={'is_contract_status_completed': 'pos_cash_completed_contracts'},
inplace=True)
features = features.merge(group_object, on=['SK_ID_CURR'], how='left')

self.features = features
return self

Expand Down
27 changes: 27 additions & 0 deletions src/pipeline_blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ def feature_extraction(config, train_mode, suffix, **kwargs):
application, application_valid = _application(config, train_mode, suffix, **kwargs)
bureau, bureau_valid = _bureau(config, train_mode, suffix, **kwargs)
credit_card_balance, credit_card_balance_valid = _credit_card_balance(config, train_mode, suffix, **kwargs)
pos_cash_balance, pos_cash_balance_valid = _pos_cash_balance(config, train_mode, suffix, **kwargs)

application_agg, application_agg_valid = _application_groupby_agg(config, train_mode, suffix, **kwargs)
bureau_agg, bureau_agg_valid = _bureau_groupby_agg(config, train_mode, suffix, **kwargs)
Expand Down Expand Up @@ -181,6 +182,7 @@ def feature_extraction(config, train_mode, suffix, **kwargs):
credit_card_balance,
credit_card_balance_agg,
installments_payments_agg,
pos_cash_balance,
pos_cash_balance_agg,
],
numerical_features_valid=[application_valid,
Expand All @@ -191,6 +193,7 @@ def feature_extraction(config, train_mode, suffix, **kwargs):
credit_card_balance_valid,
credit_card_balance_agg_valid,
installments_payments_agg_valid,
pos_cash_balance_valid,
pos_cash_balance_agg_valid,
],
categorical_features=[categorical_encoder
Expand All @@ -207,6 +210,7 @@ def feature_extraction(config, train_mode, suffix, **kwargs):
application = _application(config, train_mode, suffix, **kwargs)
bureau = _bureau(config, train_mode, suffix, **kwargs)
credit_card_balance = _credit_card_balance(config, train_mode, suffix, **kwargs)
pos_cash_balance = _pos_cash_balance(config, train_mode, suffix, **kwargs)

application_agg = _application_groupby_agg(config, train_mode, suffix, **kwargs)
bureau_agg = _bureau_groupby_agg(config, train_mode, suffix, **kwargs)
Expand All @@ -223,6 +227,7 @@ def feature_extraction(config, train_mode, suffix, **kwargs):
credit_card_balance,
credit_card_balance_agg,
installments_payments_agg,
pos_cash_balance,
pos_cash_balance_agg,
],
numerical_features_valid=[],
Expand Down Expand Up @@ -571,6 +576,28 @@ def _credit_card_balance(config, train_mode, suffix, **kwargs):
return credit_card_balance


def _pos_cash_balance(config, train_mode, suffix, **kwargs):
pos_cash_balance = Step(name='pos_cash_balance_hand_crafted{}'.format(suffix),
transformer=fe.POSCASHBalanceFeatures(**config.pos_cash_balance),
input_data=['application', 'pos_cash_balance'],
adapter=Adapter({'X': E('application', 'X'),
'pos_cash': E('pos_cash_balance', 'X')}),
experiment_directory=config.pipeline.experiment_directory,
**kwargs)
if train_mode:
pos_cash_balance_valid = Step(name='pos_cash_balance__hand_crafted_valid{}'.format(suffix),
transformer=pos_cash_balance,
input_data=['application'],
adapter=Adapter({'X': E('application', 'X_valid')}),
experiment_directory=config.pipeline.experiment_directory,
**kwargs)

return pos_cash_balance, pos_cash_balance_valid

else:
return pos_cash_balance


def _fillna(fillna_value):
def _inner_fillna(X, X_valid=None):
if X_valid is None:
Expand Down

0 comments on commit 9fc12eb

Please sign in to comment.