Feel free to skip ahead to the VIZ section to get a quick review and then back into the weeded details (Link's in the TOC click: EDA-> VIZ). Here's a brief glimpse of what you'd find...
This analysis is focused on Human Resources Analytics. The main intent is to review the historical dataset and find insights through modern programming and visualization techniques. I do also include a rudimentary predictive model that can be used in analyzing important attributes and predicting churn cases.
Questions we will answer include:
- In what states do most of our employees reside?
- What is the organizational composition by position?
- What are the top reasons employees leave?
- Are some Employees more or less likely to churn under some managers?
- Which positions have the highest associated churn?
The dataset used for this analysis is available online; the link for this is included below.
Source:https://www.kaggle.com/davidepolizzi/hr-data-set-based-on-human-resources-data-set
In starting this analysis, the main libraries that we use are pandas, numpy, and plotly. There is also a simple function used specifically for jupyter.
To initiate the analysis we import the data. The filenames within the data directory are visibale below.
Also, as with an excel document, the shape (rows & columns) and the column/attribute names are also visible. There are a total of 36 initial attributes within this dataset.
Within most programming oriented analyses, it is often much easier to declare certain functions for both ease of use and consitency of use. While in this analysis I only used one function, which is visible below, usually there would be several functions in use.
In starting the analysis process, we see that on the bottom left there are 3310 observations or employees in this dataset. There are also 36 columns/attributes. We see employee names, marriage status, date of hire, and date of termination.
Given the information on the left we can also see the null value count and the data types of each attribute.
Looking further at the null values, we see that there are some attributes with a very high null value count. Attributes like zip code, performance review date, and the days late. Furthermore, the percent of missing values in these columns is 90% and above. Null values are common and can be dealt with through blanket imputation, column similarity, or the consultation of a subject matter expert (SME).
One of the high null value columns is the 'DateofTermination' column. This possibly indicates that only 71% of employees in this dataset of still 'Active'. We will further investigate this.
Below, we look at the reported reasoning for why employees leave or are terminated. The most notable are:
1. Another Position
2. Unhappy
3. More Money
Given this information, it would be possible for the company to place counter measures that combat the highest reasons for leaving. this could include higher pay, training, or the ability for employees to take on some work that better utilizes their core skill sets.
While reviewing the available termination values, it became apparent that there are some data consistency issues. Firstly, we can see above that there are two illustrations counting reasons for why employees leave. The one on the left is derived from counts that exclude currently 'Active' or 'Leave of Absence' employees. It's clear that this is different from the illustration on the right, which is derived when setting the 'Termd' column equal to 1. There must be a set reason for this inconsistency, but since we don't have someone to directly clarify this situation, we need to figure out which is 'right' to use in our analysis.
Further more, when looking at the dataframe to the left, we see that there are employees marked as being terminated (Termd=1), yet there is no termination date or the employment status remains Active.
Because of this ambiguity, when classifying latter in this analysis, it is decided that we will derive the 'Churn' attribute using the logic noted above and to the left.
Data types are important to know when running an analysis, since each data type needs to be handled in a different way. For instance we cannot run statistics off of text values and we cannot perform time series analysis if our date values are in an unfamiliar format. Through this analysis we will run through the necessary data conversions for analysis.
Below, we can see the number of unique values per categorical type attribute.
After looking at the null values in certain cols it is possible to replace certain values with a baseline value. This is carried out in the code below.
In this section we are simply converting date and time attributes.
We can see that after setting the category data types that we now need to further convert to datetime data types.
We can also see that after converting the datatimes, that there was an issue. Since time travel has not yet been invented, this has also been corrected.
By looking at the initial data, we see that there are two name duplicates. Upon reviewing the attributes associated with these duplicates we can see that the they are not exact copies; for instance, the date of birth values are different. Becuase of this, we decide not to drop these duplicates.
For this analysis I decided to create an additional 'Churn' and 'Age' Column. This can be seen below.
There are serveral columns that were dropped, inclding Zip, employeename, and ID.
We also drop the few rows indicating that an employee had not yet started working with the company.
The are several categorical type attributes in this dataset. In order to pass these into the model, we first need to encode the categorical values. There are several methods to do this, like simple label encoding and one-hot encoding. For this analysis we will use label encoding. The columns for encoding are listed below.
The correlation plot below shows that attributes like payrate and project count are positively correlated. We also see that there is a weak negative correlation between churn and payrate and then again between churn and special project count.
Does it pay to perform better? ...below we can see the median payrate per performance score.
We can also see below how the median payrate for Sales, exceeds Software Engineers, which even exceeds Executive Office position holders.
We see that the lowest rates are in technician positions, early accountants, and even data analysts, however, senior management positions and IT management carry the highest payrate. We see IT Manager - Support at the highest, which could be a consultant?
Below are the meta statistics for the numeric data. We can see the mean age of 41 years old, with the youngest at 28 and the oldest at 69 years. We can see that people generally like the company, with a mean employement satisfaction score of 3.8 out of 5. We see that the mean payrate is 31, with a median of 24, meaning that the pay rate distribition is skewed and more employees are paid less that the average.
For data visualizations there are several visualization techniques available within the python environment, however, for this analysis plotly seemed to be the best vis technique.
Regarding the bivariate analysis, since our dataset has roughly a 70/30 data imbalance for the negative/positive class, we will be running the bivaraite analysis on the oversampled dataset. Directly below we can see this imbalance. The plot on the left shows the raw data 'as is', while the one on the right shows the oversampled data.
We can see that churn is not equally distributed between states. Fore instance, every employee from Pennsylvania has left the company. Most of those from Texas are Churner's.
Looking at churn by age groups, we see that proportions are fairly close. We do see that 45-49's have the most churn, with 50-54's having the least.
The same is not true for each department. Looking at the churns by department, we see that 58% of empoyees from production and 52% of employees from Software Engineering are noted as leaving.
There are no churns numbers in the Executure office and a lowly 3% churn within the Admin Offices.
Below we see Churn by Position and Churn by Pay. We see that that data architects, CIO's, and DB Admins with the highest proportion of churner's.
For pay, firstly, there is a huge payrate gap between the high 60's and 80, but then we see that a payrate of 40-44 and 25-29 has the highest proportion of churner's, with 30-34 and 70-74 with the lowest.
It looks as though some managers have a higher associative percent of employee churn than others. For instance, D. Houlihan, W. Butler, and Amy Dunn have the highest managed employee churn rate, while B. LeBlanc, L. Daneault, and E. Dougall have no associate churn rate.
Before we can pass the data into the model, there are several preprocessing steps that are neccesary, like encoding, scaling, and the imputation/removal of null values. We will carry out these steps quickly before we move into the modeling.
Since this analysis is mainly concerned with an initial dataset review, we will only run two basic models. These are Logistic for a baseline and then a follow up with LightGBM.
In this section, I am simply reviewing the data types and the null value counts. We need to impute of remove null values and encode any categorical attributes.
We use the help of pipeline objects to run both models and the scaling. While we don't take full advantage of the pipeline, if we needed to expound on this analysis later then the essential structures are already in place. Additionally, the aim of this analysis is not to run predictions, so the classifiers are left mostly in their baseline state.
We use the power of python to help with the tran/test split. The train and test data lengths and the percentage split is visible below.
Below are the two confusion matrices per each model. It was decided not to use precision, accuracy, or recall since there is considerable data imbalance, but I have included the Confusion Matrix as a basic overview of performance. Also the AUC is included.
We can see that the LighGBM model is performing better by a massive 25% margin in AUC over the baseline logistic regression model.
Looking at the correlations below, we can see that PayRate and SpecialProjectsCount are the most negatively correlate attributes. This means that as pay or project count decrease, there is a slight increase in churn. The opposite is true for Position and FromdiversityJobFair. As the representative numbers for these values increases, there is a slight
Looking at the Light GBM model, we see that position, payrate, managername are the top three attributes.
Looking at the logreg model, we see that PayRate, Position, and Age are the top three attributes.
The top indicators of people leaving include (looking at the LightGBM model):
When looking at both models, we see that Position, Pay, Manager, and Recruitment source are the top attributes associated with Churn for this company.
Knowing this, HR can further investigate positions where pay is not sufficient for certain employees. They can also investigate management issues. This can be done through surveys or during general reviews.
Questions we will answer include: