CREDIT RISK ANALYSIS A Supporting Notebook

HOME CREDIT ANALYSIS¶

The intention of this analysis is to grasp the programatic process of analyzing publicly available credit data. We will proceed with basic eda and preprocessing of the data.

The models that we have decided to use will be drastically different so that we can gain insight into how they perform with this given dataset. The models include Light Gradient Boosting Machine and an artificial neural network.

The performance metrics that we have chosen to use are primarily the Area under the ROC (AUR), the Gini, and the KS statistic. Also included are each of the rank order charts produced by each model. Lastly, there is a set of overall visualizations per each model output so that we can visualize the output metrics.

Imports¶

When processing data through python and jupyter, we have the ability to customize our plotting output. We also need to import the correct library dependencies. We have done this below, where the only the color scale is visible.

None

Using TensorFlow backend.

['application_test.csv', 'application_train.csv', 'bureau.csv', 'bureau_balance.csv', 'credit_card_balance.csv', 'home-credit-default-risk.zip', 'HomeCredit_columns_description.csv', 'installments_payments.csv', 'POS_CASH_balance.csv', 'previous_application.csv', 'sample_submission.csv']

Functions & Load Data¶

Below, we are loading in specific functions for this analysis.

We are also reviewing the top 5 rows of the training and testing data so that we can review them accordingly. There appears to be 191 attributes, including the target variable, which is indicative of default or not. (1=Yes, 0=No).

Preprocessing started.
Bureau_Balance
Bureau
Previous_Application
POS_CASH_Balance
Credit_Card_Balance
Installments_Payments
Train/Test
Shapes :  (307511, 122) (48744, 121)
Preprocessing done.

Initial EDA¶

Starting our EDA, we will review the sparcity of the data as well as any superficial information that we can gain without going too deep into the analysis processs.

Looking at the distribution of credit amount below, we see initially notice that the data is right skewed, where most of the observations would have credit limits below 1,000,000.

Target variable: Below we can see that ~92% of observations do not show default, while only ~8% do. This almost aligns with the national average which is about 10%. It also means that the dataset is imbalanced, where our negative class is 11.5 times larger, carrying more predictive weight.

<Figure size 1440x720 with 0 Axes>

<Figure size 1440x720 with 0 Axes>

Setting Train and Test¶

Before running the data through the models, we first need to preprocess the data; this is particularly important for the nueral network.

The preprocess steps that we undertake here include encoding of the categorical variables followed by embedding.

We can see that there are 16 categorical variables and 173 strictly numeric variables.

Number of Numerical features: 173
Number of Categorical features: 16

For simplicity, the categorical variables are listed below.

The unique values per attribute are also listed. We can see that some attributes like Organization type have 58 possible values/categories, while attributes like gender, education type, and house type only have between 3 and 5 values.

CODE_GENDER: 3 values
NAME_TYPE_SUITE: 8 values
NAME_INCOME_TYPE: 8 values
NAME_EDUCATION_TYPE: 5 values
NAME_FAMILY_STATUS: 6 values
NAME_HOUSING_TYPE: 6 values
OCCUPATION_TYPE: 19 values
WEEKDAY_APPR_PROCESS_START: 7 values
ORGANIZATION_TYPE: 58 values
FONDKAPREMONT_MODE: 5 values
HOUSETYPE_MODE: 4 values
WALLSMATERIAL_MODE: 8 values
EMERGENCYSTATE_MODE: 3 values

 Number of embeded features : 13

We are using a 80/20 train and test split for this analysis. Because this is only an overview analysis, we will not be running a full pipeline with CV. The output for the train/test split can be seen below.

 Train:  (246008, 189) 246008 80.0 %
 Train:  (246008,) 246008 80.0 %
 Test:  (61503, 189) 61503 20.0 % 
 Test:  (61503,) 61503 20.0 %

110440    0
Name: TARGET, dtype: int64

179491    0
Name: TARGET, dtype: int64

Length of the list: 14

Length of the list: 14

(246008,)

Train and Test - Feature Set¶

This section is a rough duplicate of the above section, with one change. While the above train/test split uses the full set of attributes, the below data only encorporates the top 25 most important features as an output of the shapley values. These are list in the code below.

Number of Numerical features: 21
Number of Categorical features: 4

CODE_GENDER: 3 values
NAME_FAMILY_STATUS: 6 values
NAME_EDUCATION_TYPE: 5 values
NAME_INCOME_TYPE: 8 values

 Number of embeded features : 4

 x Train:  (246008, 25) 246008 80.0 %
 y Train:  (246008,) 246008 80.0 %
 x Test:  (61503, 25) 61503 20.0 % 
 y Test:  (61503,) 61503 20.0 %

110440    0
Name: TARGET, dtype: int64

179491    0
Name: TARGET, dtype: int64

Length of the list: 5

Length of the list: 5

Model - Neural Net¶

For the neural network structure we decided to use a four layer dense neural network with 'relu' activation.

We are able to get achive a fairly close convergence of the train and test loss, different by approximately 0.003. The preliminary AUC for this test run is 73.5%. We also see that the model took roughly 1 minute to run.

Train on 246008 samples, validate on 61503 samples
Epoch 1/5
246008/246008 [==============================] - 5s 20us/step - loss: 0.2930 - val_loss: 0.2600
Epoch 2/5
246008/246008 [==============================] - 4s 15us/step - loss: 0.2595 - val_loss: 0.2570
Epoch 3/5
246008/246008 [==============================] - 4s 15us/step - loss: 0.2564 - val_loss: 0.2553
Epoch 4/5
246008/246008 [==============================] - 4s 15us/step - loss: 0.2549 - val_loss: 0.2538
Epoch 5/5
246008/246008 [==============================] - 4s 15us/step - loss: 0.2543 - val_loss: 0.2533
Train on 246008 samples, validate on 61503 samples
Epoch 1/5
246008/246008 [==============================] - 5s 20us/step - loss: 0.2916 - val_loss: 0.2601
Epoch 2/5
246008/246008 [==============================] - 4s 15us/step - loss: 0.2589 - val_loss: 0.2565
Epoch 3/5
246008/246008 [==============================] - 4s 14us/step - loss: 0.2560 - val_loss: 0.2556
Epoch 4/5
246008/246008 [==============================] - 4s 15us/step - loss: 0.2547 - val_loss: 0.2538
Epoch 5/5
246008/246008 [==============================] - 4s 15us/step - loss: 0.2536 - val_loss: 0.2562
Model Time  :   1.05 Min

Mean out of fold Train AUC: 0.74530
Mean out of fold Test AUC: 0.73562
Full validation Train AUC: 0.74511
Full validation Test AUC: 0.73583

Model - Lightgbm¶

Below we program the lightGBM model. This model runs very fast, taking only 0.18 mins.

The results are better than the nueral network by roughly 3% with an AUC of about 76.1%.

Training until validation scores don't improve for 10 rounds
[100]	valid's auc: 0.754034
[200]	valid's auc: 0.759953
[300]	valid's auc: 0.76142
Early stopping, best iteration is:
[291]	valid's auc: 0.761466
Model Time  :   0.18 Min

Metrics¶

Before moving onto the main metric visualizations, we print out the metrics below. The first set of metrics is for the train and test data on the neural network while the second set is for the LightGBM model.

NN Metrics:¶

TRAIN MODEL:
Gini: 0.5071 | AUC: 0.7536 | KS: 0.3784
TEST MODEL:
Gini: 0.4877 | AUC: 0.7439 | KS: 0.3743

Light Metrics:¶

TRAIN MODEL:
Gini: 0.584 | AUC: 0.792 | KS: 0.4379
TEST MODEL:
Gini: 0.5229 | AUC: 0.7615 | KS: 0.3951

Rank Order¶

TRAINING MODEL DATA:

TESTING MODEL DATA:

TRAINING MODEL DATA:

TESTING MODEL DATA:

Review Output¶

========================================
         METRIC COMPARISONS
- - - - - - - - - - - - - 
Auc NN Model Train        :    75.36%
Auc NN Model Test         :    74.39%
Auc NN Model Var          :    0.97%
Auc LGBM Model Train      :    79.20%
Auc LGBM Model Test       :    76.15%
Auc LGBM Model Var        :    3.05%
- - - - - - - - - - - - - 
Gini NN Model Train       :    50.71%
Gini NN Model Test        :    48.77%
Gini NN Model Var         :    1.94%
Gini LGBM Model Train     :    58.40%
Gini LGBM Model Test      :    52.29%
Gini LGBM Model Var       :    6.11%
- - - - - - - - - - - - - 
KS NN Model Train         :    37.84%
KS NN Model Test          :    37.43%
KS NN Model Var           :    0.41%
KS LGBM Model Train       :    43.79%
KS LGBM Model Test        :    39.51%
KS LGBM Model Var         :    4.28%
- - - - - - - - - - - - - 
========================================

Plots and Visualizations¶

Shapley Feature Importance¶

0.35.0

The features below are those we used in the second iterations of the models.

['EXT_SOURCE_2', 'inst_AMT_PAYMENT', 'SK_DPD_DEF', 'DAYS_EMPLOYED', 'SK_ID_PREV_y', 'REGION_RATING_CLIENT_W_CITY', 'CNT_INSTALMENT_FUTURE', 'EXT_SOURCE_1', 'DAYS_BIRTH', 'NAME_FAMILY_STATUS', 'AMT_DOWN_PAYMENT', 'NAME_EDUCATION_TYPE', 'EXT_SOURCE_3', 'DAYS_ID_PUBLISH', 'CODE_GENDER', 'SK_ID_PREV_x', 'SK_DPD', 'AMT_ANNUITY_x', 'AMT_CREDIT_x', 'CNT_PAYMENT', 'AMT_CREDIT_SUM_DEBT', 'AMT_GOODS_PRICE_x', 'NAME_INCOME_TYPE', 'NFLAG_INSURED_ON_APPROVAL', 'cc_bal_AMT_CREDIT_LIMIT_ACTUAL']
['EXT_SOURCE_2', 'inst_AMT_PAYMENT', 'SK_DPD_DEF']

Feature Importance¶

Text(0.5, 0, 'Shap Importance')

	SK_ID_CURR	TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	AMT_INCOME_TOTAL	AMT_CREDIT_x	AMT_ANNUITY_x	...	cc_bal_CNT_INSTALMENT_MATURE_CUM	cc_bal_SK_DPD	cc_bal_SK_DPD_DEF	inst_SK_ID_PREV	inst_NUM_INSTALMENT_VERSION	inst_NUM_INSTALMENT_NUMBER	inst_DAYS_INSTALMENT	inst_DAYS_ENTRY_PAYMENT	inst_AMT_INSTALMENT	inst_AMT_PAYMENT
0	100002	1	Cash loans	M	N	Y	202500.0	406597.5	24700.5	...	NaN	NaN	NaN	19.0	1.052632	10.000000	-295.000000	-315.421053	11559.247105	11559.247105
1	100003	0	Cash loans	F	N	N	270000.0	1293502.5	35698.5	...	NaN	NaN	NaN	25.0	1.040000	5.080000	-1378.160000	-1385.320000	64754.586000	64754.586000
2	100004	0	Revolving loans	M	Y	Y	67500.0	135000.0	6750.0	...	NaN	NaN	NaN	3.0	1.333333	2.000000	-754.000000	-761.666667	7096.155000	7096.155000
3	100006	0	Cash loans	F	N	Y	135000.0	312682.5	29686.5	...	0.0	0.0	0.0	16.0	1.125000	4.437500	-252.250000	-271.625000	62947.088438	62947.088438
4	100007	0	Cash loans	M	N	Y	121500.0	513000.0	21865.5	...	NaN	NaN	NaN	66.0	1.166667	7.045455	-1028.606061	-1032.242424	12666.444545	12214.060227

	SK_ID_CURR	NAME_CONTRACT_TYPE	CODE_GENDER	FLAG_OWN_CAR	FLAG_OWN_REALTY	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT_x	AMT_ANNUITY_x	AMT_GOODS_PRICE_x	...	cc_bal_CNT_INSTALMENT_MATURE_CUM	cc_bal_SK_DPD	cc_bal_SK_DPD_DEF	inst_SK_ID_PREV	inst_NUM_INSTALMENT_VERSION	inst_NUM_INSTALMENT_NUMBER	inst_DAYS_INSTALMENT	inst_DAYS_ENTRY_PAYMENT	inst_AMT_INSTALMENT	inst_AMT_PAYMENT
0	100001	Cash loans	F	N	Y	0	135000.0	568800.0	20560.5	450000.0	...	NaN	NaN	NaN	7.0	1.142857	2.714286	-2187.714286	-2195.000000	5885.132143	5885.132143
1	100005	Cash loans	M	N	Y	0	99000.0	222768.0	17370.0	180000.0	...	NaN	NaN	NaN	9.0	1.111111	5.000000	-586.000000	-609.555556	6240.205000	6240.205000
2	100013	Cash loans	M	Y	Y	0	202500.0	663264.0	69777.0	630000.0	...	18.719101	0.010417	0.010417	155.0	0.277419	43.729032	-1352.929032	-1358.109677	10897.898516	9740.235774
3	100028	Cash loans	F	N	Y	2	315000.0	1575000.0	49018.5	1575000.0	...	19.547619	0.000000	0.000000	113.0	0.460177	30.504425	-855.548673	-858.548673	4979.282257	4356.731549
4	100038	Cash loans	M	Y	N	1	180000.0	625500.0	32067.0	625500.0	...	NaN	NaN	NaN	12.0	1.000000	6.500000	-622.000000	-634.250000	11100.337500	11100.337500

	RANK_PCT	RANK_NUM	BAD_NUM	BAD_NUM_CUM	BAD_PCT_CUM	BAD_PCT_RANK	GOOD_NUM	GOOD_NUM_CUM	GOOD_PCT_RANK	GOOD_PCT_CUM	CUM_GB_KS	KS
1	5.0	12301	119	119	0.60	0.97	12182	12182	99.03	5.39	4.79
2	10.0	12300	187	306	1.54	1.52	12113	24295	98.48	10.74	9.20
3	15.0	12301	221	527	2.65	1.80	12080	36375	98.20	16.08	13.43
4	20.0	12300	278	805	4.05	2.26	12022	48397	97.74	21.40	17.35
5	25.0	12300	310	1115	5.61	2.52	11990	60387	97.48	26.70	21.09
6	30.0	12301	403	1518	7.64	3.28	11898	72285	96.72	31.96	24.32
7	35.0	12300	417	1935	9.74	3.39	11883	84168	96.61	37.22	27.48
8	40.0	12300	516	2451	12.34	4.20	11784	95952	95.80	42.43	30.09
9	45.0	12301	555	3006	15.14	4.51	11746	107698	95.49	47.62	32.48
10	50.0	12300	607	3613	18.19	4.93	11693	119391	95.07	52.79	34.60
11	55.0	12300	687	4300	21.65	5.59	11613	131004	94.41	57.93	36.28
12	60.0	12301	818	5118	25.77	6.65	11483	142487	93.35	63.01	37.24
13	65.0	12300	906	6024	30.33	7.37	11394	153881	92.63	68.04	37.71	KS
14	70.0	12300	1036	7060	35.55	8.42	11264	165145	91.58	73.03	37.48
15	75.0	12301	1224	8284	41.71	9.95	11077	176222	90.05	77.92	36.21
16	80.0	12300	1417	9701	48.85	11.52	10883	187105	88.48	82.74	33.89
17	85.0	12300	1630	11331	57.05	13.25	10670	197775	86.75	87.45	30.40
18	90.0	12301	2036	13367	67.31	16.55	10265	208040	83.45	91.99	24.68
19	95.0	12300	2546	15913	80.13	20.70	9754	217794	79.30	96.31	16.18
20	100.0	12301	3947	19860	100.00	32.09	8354	226148	67.91	100.00	0.00

	RANK_PCT	RANK_NUM	BAD_NUM	BAD_NUM_CUM	BAD_PCT_CUM	BAD_PCT_RANK	GOOD_NUM	GOOD_NUM_CUM	GOOD_PCT_RANK	GOOD_PCT_CUM	CUM_GB_KS	KS
1	5.0	3076	49	49	0.99	1.59	3027	3027	98.41	5.35	4.36
2	10.0	3075	55	104	2.09	1.79	3020	6047	98.21	10.70	8.61
3	15.0	3075	56	160	3.22	1.82	3019	9066	98.18	16.04	12.82
4	20.0	3075	84	244	4.91	2.73	2991	12057	97.27	21.33	16.42
5	25.0	3075	81	325	6.55	2.63	2994	15051	97.37	26.62	20.07
6	30.0	3075	103	428	8.62	3.35	2972	18023	96.65	31.88	23.26
7	35.0	3075	105	533	10.74	3.41	2970	20993	96.59	37.13	26.39
8	40.0	3075	121	654	13.17	3.93	2954	23947	96.07	42.36	29.19
9	45.0	3075	146	800	16.11	4.75	2929	26876	95.25	47.54	31.43
10	50.0	3075	161	961	19.36	5.24	2914	29790	94.76	52.69	33.33
11	55.0	3076	167	1128	22.72	5.43	2909	32699	94.57	57.84	35.12
12	60.0	3075	198	1326	26.71	6.44	2877	35576	93.56	62.92	36.21
13	65.0	3075	208	1534	30.90	6.76	2867	38443	93.24	67.99	37.09	KS
14	70.0	3075	272	1806	36.37	8.85	2803	41246	91.15	72.95	36.58
15	75.0	3075	326	2132	42.94	10.60	2749	43995	89.40	77.81	34.87
16	80.0	3075	360	2492	50.19	11.71	2715	46710	88.29	82.62	32.43
17	85.0	3075	417	2909	58.59	13.56	2658	49368	86.44	87.32	28.73
18	90.0	3075	498	3407	68.62	16.20	2577	51945	83.80	91.88	23.26
19	95.0	3075	641	4048	81.53	20.85	2434	54379	79.15	96.18	14.65
20	100.0	3076	917	4965	100.00	29.81	2159	56538	70.19	100.00	0.00

	RANK_PCT	RANK_NUM	BAD_NUM	BAD_NUM_CUM	BAD_PCT_CUM	BAD_PCT_RANK	GOOD_NUM	GOOD_NUM_CUM	GOOD_PCT_RANK	GOOD_PCT_CUM	CUM_GB_KS	KS
1	5.0	12301	75	75	0.38	0.61	12226	12226	99.39	5.41	5.03
2	10.0	12300	95	170	0.86	0.77	12205	24431	99.23	10.80	9.94
3	15.0	12301	177	347	1.75	1.44	12124	36555	98.56	16.16	14.41
4	20.0	12300	207	554	2.79	1.68	12093	48648	98.32	21.51	18.72
5	25.0	12300	221	775	3.90	1.80	12079	60727	98.20	26.85	22.95
6	30.0	12301	286	1061	5.34	2.33	12015	72742	97.67	32.17	26.83
7	35.0	12300	347	1408	7.09	2.82	11953	84695	97.18	37.45	30.36
8	40.0	12300	361	1769	8.91	2.93	11939	96634	97.07	42.73	33.82
9	45.0	12301	463	2232	11.24	3.76	11838	108472	96.24	47.97	36.73
10	50.0	12300	513	2745	13.82	4.17	11787	120259	95.83	53.18	39.36
11	55.0	12300	604	3349	16.86	4.91	11696	131955	95.09	58.35	41.49
12	60.0	12301	707	4056	20.42	5.75	11594	143549	94.25	63.48	43.06
13	65.0	12300	900	4956	24.95	7.32	11400	154949	92.68	68.52	43.57	KS
14	70.0	12300	1008	5964	30.03	8.20	11292	166241	91.80	73.51	43.48
15	75.0	12301	1144	7108	35.79	9.30	11157	177398	90.70	78.44	42.65
16	80.0	12300	1357	8465	42.62	11.03	10943	188341	88.97	83.28	40.66
17	85.0	12300	1709	10174	51.23	13.89	10591	198932	86.11	87.97	36.74
18	90.0	12301	2132	12306	61.96	17.33	10169	209101	82.67	92.46	30.50
19	95.0	12300	2792	15098	76.02	22.70	9508	218609	77.30	96.67	20.65
20	100.0	12301	4762	19860	100.00	38.71	7539	226148	61.29	100.00	0.00

	RANK_PCT	RANK_NUM	BAD_NUM	BAD_NUM_CUM	BAD_PCT_CUM	BAD_PCT_RANK	GOOD_NUM	GOOD_NUM_CUM	GOOD_PCT_RANK	GOOD_PCT_CUM	CUM_GB_KS	KS
1	5.0	3076	34	34	0.68	1.11	3042	3042	98.89	5.38	4.70
2	10.0	3075	56	90	1.81	1.82	3019	6061	98.18	10.72	8.91
3	15.0	3075	57	147	2.96	1.85	3018	9079	98.15	16.06	13.10
4	20.0	3075	64	211	4.25	2.08	3011	12090	97.92	21.38	17.13
5	25.0	3075	74	285	5.74	2.41	3001	15091	97.59	26.69	20.95
6	30.0	3075	83	368	7.41	2.70	2992	18083	97.30	31.98	24.57
7	35.0	3075	100	468	9.43	3.25	2975	21058	96.75	37.25	27.82
8	40.0	3075	116	584	11.76	3.77	2959	24017	96.23	42.48	30.72
9	45.0	3075	117	701	14.12	3.80	2958	26975	96.20	47.71	33.59
10	50.0	3075	163	864	17.40	5.30	2912	29887	94.70	52.86	35.46
11	55.0	3076	191	1055	21.25	6.21	2885	32772	93.79	57.96	36.71
12	60.0	3075	178	1233	24.83	5.79	2897	35669	94.21	63.09	38.26
13	65.0	3075	211	1444	29.08	6.86	2864	38533	93.14	68.15	39.07
14	70.0	3075	239	1683	33.90	7.77	2836	41369	92.23	73.17	39.27	KS
15	75.0	3075	297	1980	39.88	9.66	2778	44147	90.34	78.08	38.20
16	80.0	3075	358	2338	47.09	11.64	2717	46864	88.36	82.89	35.80
17	85.0	3075	410	2748	55.35	13.33	2665	49529	86.67	87.60	32.25
18	90.0	3075	516	3264	65.74	16.78	2559	52088	83.22	92.13	26.39
19	95.0	3075	669	3933	79.21	21.76	2406	54494	78.24	96.38	17.17
20	100.0	3076	1032	4965	100.00	33.55	2044	56538	66.45	100.00	0.00

	index	10_PER	20_PER	30_PER	TRAIN_TEST	MODEL
0	MODEL_TRAIN	32.7	51.2	64.4	Train	NN-Model
1	MODEL_TEST	31.4	49.8	63.6	Test	NN-Model
2	BK_BENCH_TRAIN	38.0	57.4	70.0	Train	LGBM-Model
3	BK_BENCH_TEST	34.3	52.9	66.1	Test	LGBM-Model

	MODEL	TRAIN_TEST	AUC	GINI	KS
0	NN-Model	Train	75.36	50.71	37.84
1	NN-Model	Test	74.39	48.77	37.43
2	NN-Model	Variance	0.97	1.94	0.41
3	LGBM-Model	Train	79.20	58.40	43.79
4	LGBM-Model	Test	76.15	52.29	39.51
5	LGBM-Model	Variance	3.05	6.11	4.28

Table of Contents