The intention of this analysis is to grasp the programatic process of analyzing publicly available credit data. We will proceed with basic eda and preprocessing of the data.
The models that we have decided to use will be drastically different so that we can gain insight into how they perform with this given dataset. The models include Light Gradient Boosting Machine and an artificial neural network.
The performance metrics that we have chosen to use are primarily the Area under the ROC (AUR), the Gini, and the KS statistic. Also included are each of the rank order charts produced by each model. Lastly, there is a set of overall visualizations per each model output so that we can visualize the output metrics.
When processing data through python and jupyter, we have the ability to customize our plotting output. We also need to import the correct library dependencies. We have done this below, where the only the color scale is visible.
Below, we are loading in specific functions for this analysis.
We are also reviewing the top 5 rows of the training and testing data so that we can review them accordingly. There appears to be 191 attributes, including the target variable, which is indicative of default or not. (1=Yes, 0=No).
Starting our EDA, we will review the sparcity of the data as well as any superficial information that we can gain without going too deep into the analysis processs.
Looking at the distribution of credit amount below, we see initially notice that the data is right skewed, where most of the observations would have credit limits below 1,000,000.
Target variable: Below we can see that ~92% of observations do not show default, while only ~8% do. This almost aligns with the national average which is about 10%. It also means that the dataset is imbalanced, where our negative class is 11.5 times larger, carrying more predictive weight.
Before running the data through the models, we first need to preprocess the data; this is particularly important for the nueral network.
The preprocess steps that we undertake here include encoding of the categorical variables followed by embedding.
We can see that there are 16 categorical variables and 173 strictly numeric variables.
For simplicity, the categorical variables are listed below.
The unique values per attribute are also listed. We can see that some attributes like Organization type have 58 possible values/categories, while attributes like gender, education type, and house type only have between 3 and 5 values.
We are using a 80/20 train and test split for this analysis. Because this is only an overview analysis, we will not be running a full pipeline with CV. The output for the train/test split can be seen below.
This section is a rough duplicate of the above section, with one change. While the above train/test split uses the full set of attributes, the below data only encorporates the top 25 most important features as an output of the shapley values. These are list in the code below.
For the neural network structure we decided to use a four layer dense neural network with 'relu' activation.
We are able to get achive a fairly close convergence of the train and test loss, different by approximately 0.003. The preliminary AUC for this test run is 73.5%. We also see that the model took roughly 1 minute to run.
Below we program the lightGBM model. This model runs very fast, taking only 0.18 mins.
The results are better than the nueral network by roughly 3% with an AUC of about 76.1%.
Before moving onto the main metric visualizations, we print out the metrics below. The first set of metrics is for the train and test data on the neural network while the second set is for the LightGBM model.
The features below are those we used in the second iterations of the models.