Human Resources Data Set - Analysis

A pictogram of a blood bag with blood donation written in it

HR Analytics. Image Credit: analyticsinhr.com

Introduction¶

Feel free to skip ahead to the VIZ section to get a quick review and then back into the weeded details (Link's in the TOC click: EDA-> VIZ). Here's a brief glimpse of what you'd find...

This analysis is focused on Human Resources Analytics. The main intent is to review the historical dataset and find insights through modern programming and visualization techniques. I do also include a rudimentary predictive model that can be used in analyzing important attributes and predicting churn cases.

Questions we will answer include:

In what states do most of our employees reside?

What is the organizational composition by position?

What are the top reasons employees leave?

Are some Employees more or less likely to churn under some managers?

Which positions have the highest associated churn?

The dataset used for this analysis is available online; the link for this is included below.

Source:https://www.kaggle.com/davidepolizzi/hr-data-set-based-on-human-resources-data-set

Dict: https://rpubs.com/rhuebner/HRCodebook-13

IMPORTS¶

IMPORT - LIBS¶

In starting this analysis, the main libraries that we use are pandas, numpy, and plotly. There is also a simple function used specifically for jupyter.

IMPORT - DATA¶

To initiate the analysis we import the data. The filenames within the data directory are visibale below.

Also, as with an excel document, the shape (rows & columns) and the column/attribute names are also visible. There are a total of 36 initial attributes within this dataset.

datasets_1632_774340_HRDataset_v13.csv
HRData.csv
HRData1.txt

IMPORT - SET FUNCS¶

Within most programming oriented analyses, it is often much easier to declare certain functions for both ease of use and consitency of use. While in this analysis I only used one function, which is visible below, usually there would be several functions in use.

ROUGH REVIEW¶

SAMPLING AND INFO¶

In starting the analysis process, we see that on the bottom left there are 3310 observations or employees in this dataset. There are also 36 columns/attributes. We see employee names, marriage status, date of hire, and date of termination.

Given the information on the left we can also see the null value count and the data types of each attribute.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3310 entries, 0 to 3309
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Employee_Name               3310 non-null   object 
 1   EmpID                       3310 non-null   int64  
 2   MarriedID                   3310 non-null   int64  
 3   MaritalStatusID             3310 non-null   int64  
 4   GenderID                    3310 non-null   int64  
 5   EmpStatusID                 3310 non-null   int64  
 6   DeptID                      3310 non-null   int64  
 7   PerfScoreID                 3310 non-null   int64  
 8   FromDiversityJobFairID      3310 non-null   int64  
 9   PayRate                     3310 non-null   float64
 10  Termd                       3310 non-null   int64  
 11  PositionID                  3310 non-null   int64  
 12  Position                    3310 non-null   object 
 13  State                       3310 non-null   object 
 14  Zip                         310 non-null    float64
 15  DOB                         3310 non-null   object 
 16  Sex                         3310 non-null   object 
 17  MaritalDesc                 3310 non-null   object 
 18  CitizenDesc                 3310 non-null   object 
 19  HispanicLatino              3310 non-null   object 
 20  RaceDesc                    3310 non-null   object 
 21  DateofHire                  3310 non-null   object 
 22  DateofTermination           959 non-null    object 
 23  TermReason                  3299 non-null   object 
 24  EmploymentStatus            3310 non-null   object 
 25  Department                  3310 non-null   object 
 26  ManagerName                 3232 non-null   object 
 27  ManagerID                   3302 non-null   float64
 28  RecruitmentSource           3310 non-null   object 
 29  PerformanceScore            3310 non-null   object 
 30  EngagementSurvey            3310 non-null   float64
 31  EmpSatisfaction             3310 non-null   int64  
 32  SpecialProjectsCount        3310 non-null   int64  
 33  LastPerformanceReview_Date  207 non-null    object 
 34  DaysLateLast30              207 non-null    float64
 35  Original DS                 3310 non-null   object 
dtypes: float64(5), int64(12), object(19)
memory usage: 931.1+ KB

NULL VALS¶

Looking further at the null values, we see that there are some attributes with a very high null value count. Attributes like zip code, performance review date, and the days late. Furthermore, the percent of missing values in these columns is 90% and above. Null values are common and can be dealt with through blanket imputation, column similarity, or the consultation of a subject matter expert (SME).

One of the high null value columns is the 'DateofTermination' column. This possibly indicates that only 71% of employees in this dataset of still 'Active'. We will further investigate this.

Below, we look at the reported reasoning for why employees leave or are terminated. The most notable are:

1. Another Position
2. Unhappy
3. More Money

Given this information, it would be possible for the company to place counter measures that combat the highest reasons for leaving. this could include higher pay, training, or the ability for employees to take on some work that better utilizes their core skill sets.

While reviewing the available termination values, it became apparent that there are some data consistency issues. Firstly, we can see above that there are two illustrations counting reasons for why employees leave. The one on the left is derived from counts that exclude currently 'Active' or 'Leave of Absence' employees. It's clear that this is different from the illustration on the right, which is derived when setting the 'Termd' column equal to 1. There must be a set reason for this inconsistency, but since we don't have someone to directly clarify this situation, we need to figure out which is 'right' to use in our analysis.

Further more, when looking at the dataframe to the left, we see that there are employees marked as being terminated (Termd=1), yet there is no termination date or the employment status remains Active.

Because of this ambiguity, when classifying latter in this analysis, it is decided that we will derive the 'Churn' attribute using the logic noted above and to the left.

DATA MANIPULATION¶

CATEGORICAL FEATURES¶

Data types are important to know when running an analysis, since each data type needs to be handled in a different way. For instance we cannot run statistics off of text values and we cannot perform time series analysis if our date values are in an unfamiliar format. Through this analysis we will run through the necessary data conversions for analysis.

Below, we can see the number of unique values per categorical type attribute.

Each attribute has the following amount of unique values:  
   0  Employee_Name                Values: 3308
   1  Position                     Values: 32
   2  State                        Values: 28
   3  DOB                          Values: 306
   4  Sex                          Values: 2
   5  MaritalDesc                  Values: 5
   6  CitizenDesc                  Values: 3
   7  HispanicLatino               Values: 4
   8  RaceDesc                     Values: 6
   9  DateofHire                   Values: 99
   10 DateofTermination            Values: 105
   11 TermReason                   Values: 18
   12 EmploymentStatus             Values: 5
   13 Department                   Values: 6
   14 ManagerName                  Values: 22
   15 RecruitmentSource            Values: 23
   16 PerformanceScore             Values: 4
   17 LastPerformanceReview_Date   Values: 43
   18 Original DS                  Values: 2

REPLACE NULL VALUES¶

After looking at the null values in certain cols it is possible to replace certain values with a baseline value. This is carried out in the code below.

CORRECTING THE DATE COLUMNS¶

In this section we are simply converting date and time attributes.

We can see that after setting the category data types that we now need to further convert to datetime data types.

DOB                           category
DateofHire                    category
DateofTermination             category
LastPerformanceReview_Date    category
dtype: object

We can also see that after converting the datatimes, that there was an issue. Since time travel has not yet been invented, this has also been corrected.

Before: 2051-01-02 00:00:00
After: 1951-01-02 00:00:00

REVIEW DUPLICATES¶

By looking at the initial data, we see that there are two name duplicates. Upon reviewing the attributes associated with these duplicates we can see that the they are not exact copies; for instance, the date of birth values are different. Becuase of this, we decide not to drop these duplicates.

CREATING NEW ATTRIBUTES¶

For this analysis I decided to create an additional 'Churn' and 'Age' Column. This can be seen below.

DROPPING COLUMNS¶

There are serveral columns that were dropped, inclding Zip, employeename, and ID.

We also drop the few rows indicating that an employee had not yet started working with the company.

OVERSAMPLING¶

ENCODING¶

The are several categorical type attributes in this dataset. In order to pass these into the model, we first need to encode the categorical values. There are several methods to do this, like simple label encoding and one-hot encoding. For this analysis we will use label encoding. The columns for encoding are listed below.

Dropped Column: EmpStatusID
Dropped Column: EmploymentStatus
Dropped Column: TermReason
Dropped Column: PositionID
Dropped Column: DateofHire
Dropped Column: DeptID
Dropped Column: PerfScoreID
Dropped Column: Churn-Yes/No
Dropped Column: GenderID

0 Position
1 State
2 Sex
3 MaritalDesc
4 CitizenDesc
5 HispanicLatino
6 RaceDesc
7 Department
8 ManagerName
9 RecruitmentSource
10 PerformanceScore
11 Original DS

Encoded:  Position
Encoded:  State
Encoded:  Sex
Encoded:  MaritalDesc
Encoded:  CitizenDesc
Encoded:  HispanicLatino
Encoded:  RaceDesc
Encoded:  Department
Encoded:  ManagerName
Encoded:  RecruitmentSource
Encoded:  PerformanceScore
Encoded:  Original DS

EDA¶

NUMERCIAL¶

The correlation plot below shows that attributes like payrate and project count are positively correlated. We also see that there is a weak negative correlation between churn and payrate and then again between churn and special project count.

Does it pay to perform better? ...below we can see the median payrate per performance score.

We can also see below how the median payrate for Sales, exceeds Software Engineers, which even exceeds Executive Office position holders.

We see that the lowest rates are in technician positions, early accountants, and even data analysts, however, senior management positions and IT management carry the highest payrate. We see IT Manager - Support at the highest, which could be a consultant?

Below are the meta statistics for the numeric data. We can see the mean age of 41 years old, with the youngest at 28 and the oldest at 69 years. We can see that people generally like the company, with a mean employement satisfaction score of 3.8 out of 5. We see that the mean payrate is 31, with a median of 24, meaning that the pay rate distribition is skewed and more employees are paid less that the average.

VIZ - UNIVARAITE¶

For data visualizations there are several visualization techniques available within the python environment, however, for this analysis plotly seemed to be the best vis technique.

VIZ - BIVARIATE¶

Regarding the bivariate analysis, since our dataset has roughly a 70/30 data imbalance for the negative/positive class, we will be running the bivaraite analysis on the oversampled dataset. Directly below we can see this imbalance. The plot on the left shows the raw data 'as is', while the one on the right shows the oversampled data.

We can see that churn is not equally distributed between states. Fore instance, every employee from Pennsylvania has left the company. Most of those from Texas are Churner's.

Looking at churn by age groups, we see that proportions are fairly close. We do see that 45-49's have the most churn, with 50-54's having the least.

The same is not true for each department. Looking at the churns by department, we see that 58% of empoyees from production and 52% of employees from Software Engineering are noted as leaving.

There are no churns numbers in the Executure office and a lowly 3% churn within the Admin Offices.

Below we see Churn by Position and Churn by Pay. We see that that data architects, CIO's, and DB Admins with the highest proportion of churner's.

For pay, firstly, there is a huge payrate gap between the high 60's and 80, but then we see that a payrate of 40-44 and 25-29 has the highest proportion of churner's, with 30-34 and 70-74 with the lowest.

It looks as though some managers have a higher associative percent of employee churn than others. For instance, D. Houlihan, W. Butler, and Amy Dunn have the highest managed employee churn rate, while B. LeBlanc, L. Daneault, and E. Dougall have no associate churn rate.

MODEL - DATA PREPROESSING¶

Before we can pass the data into the model, there are several preprocessing steps that are neccesary, like encoding, scaling, and the imputation/removal of null values. We will carry out these steps quickly before we move into the modeling.

Since this analysis is mainly concerned with an initial dataset review, we will only run two basic models. These are Logistic for a baseline and then a follow up with LightGBM.

Reviewing summary info: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3474 entries, 0 to 3473
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   MarriedID               3474 non-null   int64  
 1   MaritalStatusID         3474 non-null   int64  
 2   FromDiversityJobFairID  3474 non-null   int64  
 3   PayRate                 3474 non-null   float64
 4   Position                3474 non-null   int32  
 5   State                   3474 non-null   int32  
 6   Sex                     3474 non-null   int32  
 7   MaritalDesc             3474 non-null   int32  
 8   CitizenDesc             3474 non-null   int32  
 9   HispanicLatino          3474 non-null   int32  
 10  RaceDesc                3474 non-null   int32  
 11  Department              3474 non-null   int32  
 12  ManagerName             3474 non-null   int32  
 13  RecruitmentSource       3474 non-null   int32  
 14  PerformanceScore        3474 non-null   int32  
 15  EngagementSurvey        3474 non-null   float64
 16  EmpSatisfaction         3474 non-null   int64  
 17  SpecialProjectsCount    3474 non-null   int64  
 18  Original DS             3474 non-null   int32  
 19  Age                     3474 non-null   int64  
 20  Churn                   3474 non-null   int32  
dtypes: float64(2), int32(13), int64(6)
memory usage: 393.7 KB

In this section, I am simply reviewing the data types and the null value counts. We need to impute of remove null values and encode any categorical attributes.

Reviewing null values:

MarriedID                 0
MaritalStatusID           0
FromDiversityJobFairID    0
PayRate                   0
Position                  0
State                     0
Sex                       0
MaritalDesc               0
CitizenDesc               0
HispanicLatino            0
RaceDesc                  0
Department                0
ManagerName               0
RecruitmentSource         0
PerformanceScore          0
EngagementSurvey          0
EmpSatisfaction           0
SpecialProjectsCount      0
Original DS               0
Age                       0
Churn                     0
dtype: int64

MODEL - PIPELINE SCALE AND LOAD MODEL¶

We use the help of pipeline objects to run both models and the scaling. While we don't take full advantage of the pipeline, if we needed to expound on this analysis later then the essential structures are already in place. Additionally, the aim of this analysis is not to run predictions, so the classifiers are left mostly in their baseline state.

TRAIN - TEST¶

We use the power of python to help with the tran/test split. The train and test data lengths and the percentage split is visible below.

Sanity check on train/test:

 Train x length:  2779 
 Train y length:  2779 
 Test x  length:  695 
 Test y  length:  695 
 
 Split pct     :  79.99

FIT MODEL¶

MODEL - METRICS¶

Below are the two confusion matrices per each model. It was decided not to use precision, accuracy, or recall since there is considerable data imbalance, but I have included the Confusion Matrix as a basic overview of performance. Also the AUC is included.

We can see that the LighGBM model is performing better by a massive 25% margin in AUC over the baseline logistic regression model.

POST MODEL REVIEW¶

Looking at the correlations below, we can see that PayRate and SpecialProjectsCount are the most negatively correlate attributes. This means that as pay or project count decrease, there is a slight increase in churn. The opposite is true for Position and FromdiversityJobFair. As the representative numbers for these values increases, there is a slight

ATTRIBUTE IMPORTANCE¶

Light GBM Model - Attributes by Importance

Log Reg Model - Attributes by Importance

Looking at the Light GBM model, we see that position, payrate, managername are the top three attributes.

Looking at the logreg model, we see that PayRate, Position, and Age are the top three attributes.

CONCLUSIONARY REMARKS¶

The top indicators of people leaving include (looking at the LightGBM model):

Position
Payrate
Manager
Recruitment Sources
State

When looking at both models, we see that Position, Pay, Manager, and Recruitment source are the top attributes associated with Churn for this company.

Knowing this, HR can further investigate positions where pay is not sufficient for certain employees. They can also investigate management issues. This can be done through surveys or during general reviews.

Questions we will answer include:

What states do most of our employees reside?
- The company seems to reside in MA, since 90% of employees are noted as residing there too.

What is the organizational composition by position?
- 45% of company is staffed by 'Production Technician I's' , then 17% Production Technician 2's, and 9% Area Sales Manager

What are the top reasons employees leave?
- The top reasons employees leave are for another position, unhappiness, and for more money.
- There is also a high churn rate under the Production department.

Are some Employees more or less likely to churn under some managers?
- Under some managers like D Houlihan, W. Butler, and A. Dunn, there is an alarming churn rate. Under normal circumstances, this would certainly need to be further investigated.

Which positions have the highest associated churn?
- There is high churn for - data architects, production technicians, CIO's, BI Developers, and DB admins.

	Employee_Name	EmpID	MarriedID	MaritalStatusID	GenderID	EmpStatusID	DeptID	PerfScoreID	PayRate	...	ManagerName	ManagerID	RecruitmentSource	PerformanceScore	EngagementSurvey	EmpSatisfaction	SpecialProjectsCount	LastPerformanceReview_Date	DaysLateLast30	Original DS
879	Lucas, Marshall	9028182431	0	0	0	5	5	3	22.0	...	Ketsia Liebig	19.0	Newspager/Magazine	Fully Meets	2.48	3	0	NaN	NaN	New
2681	Villanueva, Abagail	9015720629	1	1	1	1	6	1	55.0	...	John Smith	17.0	Diversity Job Fair	PIP	3.60	5	0	NaN	NaN	New
1471	Lang, Kale	9006911839	0	0	0	1	4	3	55.0	...	Alex Sweetwater	10.0	Billboard	Fully Meets	3.79	3	4	NaN	NaN	New

	DOB	DateofHire	DateofTermination	LastPerformanceReview_Date
244	25-11-87	26-10-09	08-04-15	NaN
2431	06-11-86	05-07-11	NaN	NaN
3305	19-05-88	01-05-15	NaN	NaN

	Employee_Name	EmpID	DOB
1133	Young, Darien	9071302177	1964-06-01
1992	Young, Darien	9049431318	1973-11-28
2758	Warner, Larissa	9024100552	1985-04-20
2910	Warner, Larissa	9001310400	1970-12-22

	Churn	Age
1993	1	37
2599	0	37
1096	0	36

EmploymentStatus	Active	Future Start	Leave of Absence	Terminated for Cause	Voluntarily Terminated
Churn
0	1984	180	187	0	0
1	0	0	0	162	797

	Attr	Len	Nullcnt	Nullpct
33	LastPerformanceReview_Date	3310	3103	93.75
34	DaysLateLast30	3310	3103	93.75
14	Zip	3310	3000	90.63
22	DateofTermination	3310	2351	71.03
26	ManagerName	3310	78	2.36
23	TermReason	3310	11	0.33
27	ManagerID	3310	8	0.24

	Termd	EmploymentStatus
2686	1	Active
2747	1	Leave of Absence
2996	1	Active
3057	1	Leave of Absence
3306	1	Active

	Churn-Yes/No
No	69.36
Yes	30.64