Human Resources Data Set - Analysis

A pictogram of a blood bag with blood donation written in it

HR Analytics. Image Credit: analyticsinhr.com

Introduction

Feel free to skip ahead to the VIZ section to get a quick review and then back into the weeded details (Link's in the TOC click: EDA-> VIZ). Here's a brief glimpse of what you'd find...

  This analysis is focused on Human Resources Analytics. The main intent is to review the historical dataset and find insights through modern programming and visualization techniques. I do also include a rudimentary predictive model that can be used in analyzing important attributes and predicting churn cases.

Questions we will answer include:

  • In what states do most of our employees reside?
  • What is the organizational composition by position?
  • What are the top reasons employees leave?
  • Are some Employees more or less likely to churn under some managers?
  • Which positions have the highest associated churn?

The dataset used for this analysis is available online; the link for this is included below.

Source:https://www.kaggle.com/davidepolizzi/hr-data-set-based-on-human-resources-data-set

Dict: https://rpubs.com/rhuebner/HRCodebook-13

IMPORTS

IMPORT - LIBS

In starting this analysis, the main libraries that we use are pandas, numpy, and plotly. There is also a simple function used specifically for jupyter.

IMPORT - DATA

  To initiate the analysis we import the data. The filenames within the data directory are visibale below.

  Also, as with an excel document, the shape (rows & columns) and the column/attribute names are also visible. There are a total of 36 initial attributes within this dataset.

datasets_1632_774340_HRDataset_v13.csv
HRData.csv
HRData1.txt

IMPORT - SET FUNCS

  Within most programming oriented analyses, it is often much easier to declare certain functions for both ease of use and consitency of use. While in this analysis I only used one function, which is visible below, usually there would be several functions in use.

ROUGH REVIEW

SAMPLING AND INFO

  In starting the analysis process, we see that on the bottom left there are 3310 observations or employees in this dataset. There are also 36 columns/attributes. We see employee names, marriage status, date of hire, and date of termination.

  Given the information on the left we can also see the null value count and the data types of each attribute.

Employee_Name EmpID MarriedID MaritalStatusID GenderID EmpStatusID DeptID PerfScoreID FromDiversityJobFairID PayRate ... ManagerName ManagerID RecruitmentSource PerformanceScore EngagementSurvey EmpSatisfaction SpecialProjectsCount LastPerformanceReview_Date DaysLateLast30 Original DS
879 Lucas, Marshall 9028182431 0 0 0 5 5 3 0 22.0 ... Ketsia Liebig 19.0 Newspager/Magazine Fully Meets 2.48 3 0 NaN NaN New
2681 Villanueva, Abagail 9015720629 1 1 1 1 6 1 0 55.0 ... John Smith 17.0 Diversity Job Fair PIP 3.60 5 0 NaN NaN New
1471 Lang, Kale 9006911839 0 0 0 1 4 3 0 55.0 ... Alex Sweetwater 10.0 Billboard Fully Meets 3.79 3 4 NaN NaN New

3 rows × 36 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3310 entries, 0 to 3309
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Employee_Name               3310 non-null   object 
 1   EmpID                       3310 non-null   int64  
 2   MarriedID                   3310 non-null   int64  
 3   MaritalStatusID             3310 non-null   int64  
 4   GenderID                    3310 non-null   int64  
 5   EmpStatusID                 3310 non-null   int64  
 6   DeptID                      3310 non-null   int64  
 7   PerfScoreID                 3310 non-null   int64  
 8   FromDiversityJobFairID      3310 non-null   int64  
 9   PayRate                     3310 non-null   float64
 10  Termd                       3310 non-null   int64  
 11  PositionID                  3310 non-null   int64  
 12  Position                    3310 non-null   object 
 13  State                       3310 non-null   object 
 14  Zip                         310 non-null    float64
 15  DOB                         3310 non-null   object 
 16  Sex                         3310 non-null   object 
 17  MaritalDesc                 3310 non-null   object 
 18  CitizenDesc                 3310 non-null   object 
 19  HispanicLatino              3310 non-null   object 
 20  RaceDesc                    3310 non-null   object 
 21  DateofHire                  3310 non-null   object 
 22  DateofTermination           959 non-null    object 
 23  TermReason                  3299 non-null   object 
 24  EmploymentStatus            3310 non-null   object 
 25  Department                  3310 non-null   object 
 26  ManagerName                 3232 non-null   object 
 27  ManagerID                   3302 non-null   float64
 28  RecruitmentSource           3310 non-null   object 
 29  PerformanceScore            3310 non-null   object 
 30  EngagementSurvey            3310 non-null   float64
 31  EmpSatisfaction             3310 non-null   int64  
 32  SpecialProjectsCount        3310 non-null   int64  
 33  LastPerformanceReview_Date  207 non-null    object 
 34  DaysLateLast30              207 non-null    float64
 35  Original DS                 3310 non-null   object 
dtypes: float64(5), int64(12), object(19)
memory usage: 931.1+ KB

NULL VALS

  Looking further at the null values, we see that there are some attributes with a very high null value count. Attributes like zip code, performance review date, and the days late. Furthermore, the percent of missing values in these columns is 90% and above. Null values are common and can be dealt with through blanket imputation, column similarity, or the consultation of a subject matter expert (SME).

  One of the high null value columns is the 'DateofTermination' column. This possibly indicates that only 71% of employees in this dataset of still 'Active'. We will further investigate this.

Attr Len Nullcnt Nullpct
33 LastPerformanceReview_Date 3310 3103 93.75
34 DaysLateLast30 3310 3103 93.75
14 Zip 3310 3000 90.63
22 DateofTermination 3310 2351 71.03
26 ManagerName 3310 78 2.36
23 TermReason 3310 11 0.33
27 ManagerID 3310 8 0.24

  Below, we look at the reported reasoning for why employees leave or are terminated. The most notable are:

1. Another Position
2. Unhappy
3. More Money 

  Given this information, it would be possible for the company to place counter measures that combat the highest reasons for leaving. this could include higher pay, training, or the ability for employees to take on some work that better utilizes their core skill sets.

  While reviewing the available termination values, it became apparent that there are some data consistency issues. Firstly, we can see above that there are two illustrations counting reasons for why employees leave. The one on the left is derived from counts that exclude currently 'Active' or 'Leave of Absence' employees. It's clear that this is different from the illustration on the right, which is derived when setting the 'Termd' column equal to 1. There must be a set reason for this inconsistency, but since we don't have someone to directly clarify this situation, we need to figure out which is 'right' to use in our analysis.

Termd EmploymentStatus
2686 1 Active
2747 1 Leave of Absence
2996 1 Active
3057 1 Leave of Absence
3306 1 Active

  Further more, when looking at the dataframe to the left, we see that there are employees marked as being terminated (Termd=1), yet there is no termination date or the employment status remains Active.

  Because of this ambiguity, when classifying latter in this analysis, it is decided that we will derive the 'Churn' attribute using the logic noted above and to the left.

DATA MANIPULATION

CATEGORICAL FEATURES

  Data types are important to know when running an analysis, since each data type needs to be handled in a different way. For instance we cannot run statistics off of text values and we cannot perform time series analysis if our date values are in an unfamiliar format. Through this analysis we will run through the necessary data conversions for analysis.

  Below, we can see the number of unique values per categorical type attribute.

Each attribute has the following amount of unique values:  
   0  Employee_Name                Values: 3308
   1  Position                     Values: 32
   2  State                        Values: 28
   3  DOB                          Values: 306
   4  Sex                          Values: 2
   5  MaritalDesc                  Values: 5
   6  CitizenDesc                  Values: 3
   7  HispanicLatino               Values: 4
   8  RaceDesc                     Values: 6
   9  DateofHire                   Values: 99
   10 DateofTermination            Values: 105
   11 TermReason                   Values: 18
   12 EmploymentStatus             Values: 5
   13 Department                   Values: 6
   14 ManagerName                  Values: 22
   15 RecruitmentSource            Values: 23
   16 PerformanceScore             Values: 4
   17 LastPerformanceReview_Date   Values: 43
   18 Original DS                  Values: 2

REPLACE NULL VALUES

After looking at the null values in certain cols it is possible to replace certain values with a baseline value. This is carried out in the code below.

CORRECTING THE DATE COLUMNS

In this section we are simply converting date and time attributes.

We can see that after setting the category data types that we now need to further convert to datetime data types.

DOB DateofHire DateofTermination LastPerformanceReview_Date
244 25-11-87 26-10-09 08-04-15 NaN
2431 06-11-86 05-07-11 NaN NaN
3305 19-05-88 01-05-15 NaN NaN
DOB                           category
DateofHire                    category
DateofTermination             category
LastPerformanceReview_Date    category
dtype: object

We can also see that after converting the datatimes, that there was an issue. Since time travel has not yet been invented, this has also been corrected.

Before: 2051-01-02 00:00:00
After: 1951-01-02 00:00:00

REVIEW DUPLICATES

By looking at the initial data, we see that there are two name duplicates. Upon reviewing the attributes associated with these duplicates we can see that the they are not exact copies; for instance, the date of birth values are different. Becuase of this, we decide not to drop these duplicates.

Employee_Name EmpID DOB
1133 Young, Darien 9071302177 1964-06-01
1992 Young, Darien 9049431318 1973-11-28
2758 Warner, Larissa 9024100552 1985-04-20
2910 Warner, Larissa 9001310400 1970-12-22

CREATING NEW ATTRIBUTES

For this analysis I decided to create an additional 'Churn' and 'Age' Column. This can be seen below.

Churn Age
1993 1 37
2599 0 37
1096 0 36

DROPPING COLUMNS

There are serveral columns that were dropped, inclding Zip, employeename, and ID.

We also drop the few rows indicating that an employee had not yet started working with the company.

EmploymentStatus Active Future Start Leave of Absence Terminated for Cause Voluntarily Terminated
Churn
0 1984 180 187 0 0
1 0 0 0 162 797

OVERSAMPLING

ENCODING

  The are several categorical type attributes in this dataset. In order to pass these into the model, we first need to encode the categorical values. There are several methods to do this, like simple label encoding and one-hot encoding. For this analysis we will use label encoding. The columns for encoding are listed below.

Dropped Column: EmpStatusID
Dropped Column: EmploymentStatus
Dropped Column: TermReason
Dropped Column: PositionID
Dropped Column: DateofHire
Dropped Column: DeptID
Dropped Column: PerfScoreID
Dropped Column: Churn-Yes/No
Dropped Column: GenderID
0 Position
1 State
2 Sex
3 MaritalDesc
4 CitizenDesc
5 HispanicLatino
6 RaceDesc
7 Department
8 ManagerName
9 RecruitmentSource
10 PerformanceScore
11 Original DS
Encoded:  Position
Encoded:  State
Encoded:  Sex
Encoded:  MaritalDesc
Encoded:  CitizenDesc
Encoded:  HispanicLatino
Encoded:  RaceDesc
Encoded:  Department
Encoded:  ManagerName
Encoded:  RecruitmentSource
Encoded:  PerformanceScore
Encoded:  Original DS

EDA

NUMERCIAL

The correlation plot below shows that attributes like payrate and project count are positively correlated. We also see that there is a weak negative correlation between churn and payrate and then again between churn and special project count.

Does it pay to perform better? ...below we can see the median payrate per performance score.

We can also see below how the median payrate for Sales, exceeds Software Engineers, which even exceeds Executive Office position holders.

We see that the lowest rates are in technician positions, early accountants, and even data analysts, however, senior management positions and IT management carry the highest payrate. We see IT Manager - Support at the highest, which could be a consultant?

Below are the meta statistics for the numeric data. We can see the mean age of 41 years old, with the youngest at 28 and the oldest at 69 years. We can see that people generally like the company, with a mean employement satisfaction score of 3.8 out of 5. We see that the mean payrate is 31, with a median of 24, meaning that the pay rate distribition is skewed and more employees are paid less that the average.

VIZ - UNIVARAITE

For data visualizations there are several visualization techniques available within the python environment, however, for this analysis plotly seemed to be the best vis technique.

VIZ - BIVARIATE

  Regarding the bivariate analysis, since our dataset has roughly a 70/30 data imbalance for the negative/positive class, we will be running the bivaraite analysis on the oversampled dataset. Directly below we can see this imbalance. The plot on the left shows the raw data 'as is', while the one on the right shows the oversampled data.

Churn-Yes/No
No 69.36
Yes 30.64

  We can see that churn is not equally distributed between states. Fore instance, every employee from Pennsylvania has left the company. Most of those from Texas are Churner's.

  Looking at churn by age groups, we see that proportions are fairly close. We do see that 45-49's have the most churn, with 50-54's having the least.

  The same is not true for each department. Looking at the churns by department, we see that 58% of empoyees from production and 52% of employees from Software Engineering are noted as leaving.

  There are no churns numbers in the Executure office and a lowly 3% churn within the Admin Offices.

Below we see Churn by Position and Churn by Pay. We see that that data architects, CIO's, and DB Admins with the highest proportion of churner's.

For pay, firstly, there is a huge payrate gap between the high 60's and 80, but then we see that a payrate of 40-44 and 25-29 has the highest proportion of churner's, with 30-34 and 70-74 with the lowest.

It looks as though some managers have a higher associative percent of employee churn than others. For instance, D. Houlihan, W. Butler, and Amy Dunn have the highest managed employee churn rate, while B. LeBlanc, L. Daneault, and E. Dougall have no associate churn rate.

MODEL - DATA PREPROESSING

Before we can pass the data into the model, there are several preprocessing steps that are neccesary, like encoding, scaling, and the imputation/removal of null values. We will carry out these steps quickly before we move into the modeling.

Since this analysis is mainly concerned with an initial dataset review, we will only run two basic models. These are Logistic for a baseline and then a follow up with LightGBM.

Reviewing summary info: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3474 entries, 0 to 3473
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   MarriedID               3474 non-null   int64  
 1   MaritalStatusID         3474 non-null   int64  
 2   FromDiversityJobFairID  3474 non-null   int64  
 3   PayRate                 3474 non-null   float64
 4   Position                3474 non-null   int32  
 5   State                   3474 non-null   int32  
 6   Sex                     3474 non-null   int32  
 7   MaritalDesc             3474 non-null   int32  
 8   CitizenDesc             3474 non-null   int32  
 9   HispanicLatino          3474 non-null   int32  
 10  RaceDesc                3474 non-null   int32  
 11  Department              3474 non-null   int32  
 12  ManagerName             3474 non-null   int32  
 13  RecruitmentSource       3474 non-null   int32  
 14  PerformanceScore        3474 non-null   int32  
 15  EngagementSurvey        3474 non-null   float64
 16  EmpSatisfaction         3474 non-null   int64  
 17  SpecialProjectsCount    3474 non-null   int64  
 18  Original DS             3474 non-null   int32  
 19  Age                     3474 non-null   int64  
 20  Churn                   3474 non-null   int32  
dtypes: float64(2), int32(13), int64(6)
memory usage: 393.7 KB

In this section, I am simply reviewing the data types and the null value counts. We need to impute of remove null values and encode any categorical attributes.

Reviewing null values:
MarriedID                 0
MaritalStatusID           0
FromDiversityJobFairID    0
PayRate                   0
Position                  0
State                     0
Sex                       0
MaritalDesc               0
CitizenDesc               0
HispanicLatino            0
RaceDesc                  0
Department                0
ManagerName               0
RecruitmentSource         0
PerformanceScore          0
EngagementSurvey          0
EmpSatisfaction           0
SpecialProjectsCount      0
Original DS               0
Age                       0
Churn                     0
dtype: int64

MODEL - PIPELINE SCALE AND LOAD MODEL

We use the help of pipeline objects to run both models and the scaling. While we don't take full advantage of the pipeline, if we needed to expound on this analysis later then the essential structures are already in place. Additionally, the aim of this analysis is not to run predictions, so the classifiers are left mostly in their baseline state.

TRAIN - TEST

We use the power of python to help with the tran/test split. The train and test data lengths and the percentage split is visible below.

Sanity check on train/test:

 Train x length:  2779 
 Train y length:  2779 
 Test x  length:  695 
 Test y  length:  695 
 
 Split pct     :  79.99
FIT MODEL

MODEL - METRICS

Below are the two confusion matrices per each model. It was decided not to use precision, accuracy, or recall since there is considerable data imbalance, but I have included the Confusion Matrix as a basic overview of performance. Also the AUC is included.

We can see that the LighGBM model is performing better by a massive 25% margin in AUC over the baseline logistic regression model.

POST MODEL REVIEW

Looking at the correlations below, we can see that PayRate and SpecialProjectsCount are the most negatively correlate attributes. This means that as pay or project count decrease, there is a slight increase in churn. The opposite is true for Position and FromdiversityJobFair. As the representative numbers for these values increases, there is a slight

ATTRIBUTE IMPORTANCE

Light GBM Model - Attributes by Importance
Log Reg Model - Attributes by Importance

Looking at the Light GBM model, we see that position, payrate, managername are the top three attributes.

Looking at the logreg model, we see that PayRate, Position, and Age are the top three attributes.

CONCLUSIONARY REMARKS

The top indicators of people leaving include (looking at the LightGBM model):

  • Position
  • Payrate
  • Manager
  • Recruitment Sources
  • State

When looking at both models, we see that Position, Pay, Manager, and Recruitment source are the top attributes associated with Churn for this company.

Knowing this, HR can further investigate positions where pay is not sufficient for certain employees. They can also investigate management issues. This can be done through surveys or during general reviews.

Questions we will answer include:

  • What states do most of our employees reside?
    • The company seems to reside in MA, since 90% of employees are noted as residing there too.
  • What is the organizational composition by position?
    • 45% of company is staffed by 'Production Technician I's' , then 17% Production Technician 2's, and 9% Area Sales Manager
  • What are the top reasons employees leave?
    • The top reasons employees leave are for another position, unhappiness, and for more money.
    • There is also a high churn rate under the Production department.
  • Are some Employees more or less likely to churn under some managers?
    • Under some managers like D Houlihan, W. Butler, and A. Dunn, there is an alarming churn rate. Under normal circumstances, this would certainly need to be further investigated.
  • Which positions have the highest associated churn?
    • There is high churn for - data architects, production technicians, CIO's, BI Developers, and DB admins.