Exploration of Human Resources Data by Amir Rahbaran

Introduction

The data consists of 14.999 observations with 10 variables. The main feature of interest is to gain a better understanding which variables might have an impact why some employees from this company choose to leave. (variable named left).

(Initial) Data Structure

A quick overview first:

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##                                                      
##  promotion_last_5years         sales         salary    
##  Min.   :0.00000       sales      :4140   high  :1237  
##  1st Qu.:0.00000       technical  :2720   low   :7316  
##  Median :0.00000       support    :2229   medium:6446  
##  Mean   :0.02127       IT         :1227                
##  3rd Qu.:0.00000       product_mng: 902                
##  Max.   :1.00000       marketing  : 858                
##                        (Other)    :2923

Only two of the 10 variable are of type Factor. The int variables “Work_accident”, “left” and “promotion_last_5years” may solely assume the values of 0 and 1. The var sales is a bit awkwardly named. We are going to change it to occupation and fix the typo in the word “montly”.

Out of the 14999 employees, 3571 have left, which equivalents to nearly 24 %:

## [1] 0.2380825

Interestingly, “good” employees (defined here as: a rating above the third quarter) have even a higher rating of leaving (more than 32%):

## Source: local data frame [4 x 3]
## Groups: left [?]
## 
##    left `last_evaluation >= 0.87`     n
##   <int>                     <lgl> <int>
## 1     0                     FALSE  8814
## 2     0                      TRUE  2614
## 3     1                     FALSE  2340
## 4     1                      TRUE  1231

## [1] 0.3447214

Univariate and Bivariate Plots Section

The univariate and bivariate section are conflated into one section as I prefer to start the analysis with some general plots built with corrplot and ggpairs. For the correlation plot, we choose all variables that are not of type factor.

Univariate and Bivariate Analysis

The correlation plot confirms the obvious: Those who are unsatisfied leave the company. Besides, “work accidents” and “promotion within the last five years” also seems to play a role reg. leaving the company. There’s also a negative correlation reg. satisfaction and number of projects involved.

To get a broader overview, we made use of ggpairs. However, some plots are not very explanatory due to the variable types and order. We change the order of salary to high, medium, low and convert the int variables Work_accident, left and promotion_last_5years to factor variables as they just can assume values of 0 and 1. Furthermore, as left is our main interest here, the charts are going to be colour coded by this variable.

The result of ggpairs gives us some more insights. The 25th percentile of the satisfaction level gets lower and lower, the less the salary (cf. upper right graph). However, the median and 75th percentile are almost stable.

The share of people who have received a promotion in the last 5 years is marginal (8th row, 8th column).

If we look at the left column (fourth column from the right), we see a much larger IQR regarding “number of projects” and “average monthly hours” for those who have left the company.

Let’s have a deeper look at avg_monthly_hours:

The histogram of those who left resembles a bimodal distribution where as the histogram of the staying employees resembles mostly a unimodal one. Those who left seem to have worked a lot or not enough. The leaving employees seem to lack a “golden mean”.The boxplots reveal that the distribution for those who stay is fairly symmetric, i.e. the median is situated in the center of the IQR and it is very close to the mean. Besides, the whiskers have a similar length. In comparison, those who left have a broader IQR, the upper whisker is much longer and the media is clearly above the median and close relatively close to the 25th percentile.

A quick look considering the impact of job role might also be helpful. Therefore, we’re going to transform the data a bit:

The last two graphs give indeed some insights about the relavance of the job role. Employees in management leave less, maybe because they’re more likely to receive a promotion? Poor people in product management don’t get any promotion at all.

Although ggpairs has given some insights at a glance, some of the 100 plots shown are overplotted. So, let’s replot some of them:

Here we see some interesting patterns. There’s a “cohort” with employees working 240+ hours who seem to be very unhappy. In the next figure, we see very well evaluated employees (0.75+), with a similar satisfaction_level as those working 240+ hours.

Multivariate Plots and Analysis Section

It’s interesting to see if many of these well evaluated and much working employees have already left.

This first graph clearly shows that people of two specific cohorts tend to leave. The first cohort is evaluated between roughly 0.425 and 0.575 and has a satisfaction level between 0.35 and 0.45. This specific “purple island” in the middle of the figure leads to some questions. Why aren’t employees prone to leave, whose satisfaction level is below 0.35? Are they cognitively biased to feel unsatisfied no matter who their employer is? An analogous question concerns the last evaluation: why do employees with a score of less than 0.425 actually stay? Are they generally afraid of the quality of their working skills and hence assume they won’t find a new job? Far more concerning is the state of the second leaving cohort as they have been very well evaluated by the employer (i.e., 0.77+). This group is extremely dissatisfied (satisfaction level 0.125 and less). Do they leave because they know what their worth in the working market?

The second figure looks strikingly similar to the one above. It shows that the top rated employees with low satisfaction also work a lot. Half of the cohort works more than 275 hours per month, which is rarely the case for others — except of a few individuals who have mostly left, too (glance through the righter fifth of the chart from above to spot the purple dots). So we can surely say that long working hours has a very strong correlation on leaving and a weaker but still strong correlation to the employee’s satisfaction level.

The third graph summarizes some very relevant information to our question why employees leave. Employees leave who either have a lot of projects or too few and whose satisfaction level is below roughly 0.45. The “sweet spot” of project numbers seems to be three. People with three projects barely leave and to be more satisfied with their work (around 0.48+). The reason that three projects is a good balance might be that employees are neither underwhelmed nor overwhelmed reg. quantity of work. Plus, three different projects seems to be a good number for not being bored by repetitive work (which might be the case if there is only 1-2 projects) but also the chance of getting bogged down in too many projects is low in contrast to too many projects.

Besides evaluation and “hours/project-load”, maybe promotions, work accidents and how much time an employee spent in the company could have an influence on leaving.

The fourth graph shows that people who don’t get promoted are more likely to leave than those who do.

The fifth graph does NOT show that people who have had an accident are prone to leave, which I would have expected.

The sixth graph shows indeed that’s there’s a “sweet spot”" when people tend to leave. Many employees leave between 2.5 and 4 years if they’re unhappy. Maybe that’s the time whey they’re fed up and those who pass this “fed-up-phase”" just accept their fate and stay with the company. Those who are more satisfied are likely to leave between four to six years. A hypothesis could be that after this time it’s time foster their own career if they don’t get promoted.

The last paragraph also highlights some issues. Marketing employees with high salaries work the least but those with low salaries the a lot.Also, hr people with high salaries work more than hours.

Bonus: Machine Learning

Analyzing the data via statistical learning and machine learning, respectively was not part of this project’s requirement. However, due to gaining some practice with ML in R, curiosity and a better understanding of the data, a brief section is dedicated to the topic.

To keep interpret ability of the data, we begin with a simple decision tree.

## eXtreme Gradient Boosting 
## 
## 11249 samples
##     9 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11249, 11249, 11249, 11249, 11249, 11249, ... 
## Resampling results across tuning parameters:
## 
##   eta    max_depth  colsample_bytree  Accuracy   Kappa    
##   0.001  2          0.5               0.8521392  0.5093679
##   0.001  2          0.7               0.8498621  0.5010322
##   0.001  2          1.0               0.8498718  0.5010558
##   0.001  3          0.5               0.9169548  0.7466815
##   0.001  3          0.7               0.9524161  0.8710397
##   0.001  3          1.0               0.9525509  0.8714344
##   0.001  4          0.5               0.9338731  0.8220070
##   0.001  4          0.7               0.9671161  0.9089446
##   0.001  4          1.0               0.9670683  0.9088210
##   0.100  2          0.5               0.9098433  0.7284714
##   0.100  2          0.7               0.9027194  0.7109425
##   0.100  2          1.0               0.9027098  0.7109185
##   0.100  3          0.5               0.9211083  0.7613005
##   0.100  3          0.7               0.9634414  0.8992131
##   0.100  3          1.0               0.9629869  0.8980719
##   0.100  4          0.5               0.9483820  0.8534257
##   0.100  4          0.7               0.9711795  0.9197931
##   0.100  4          1.0               0.9709303  0.9191371
##   0.400  2          0.5               0.9182825  0.7578001
##   0.400  2          0.7               0.9462145  0.8465668
##   0.400  2          1.0               0.9415363  0.8325713
##   0.400  3          0.5               0.9469512  0.8483589
##   0.400  3          0.7               0.9704855  0.9179001
##   0.400  3          1.0               0.9705541  0.9179600
##   0.400  4          0.5               0.9537221  0.8699171
##   0.400  4          0.7               0.9737855  0.9268565
##   0.400  4          1.0               0.9743056  0.9282956
## 
## Tuning parameter 'nrounds' was held constant at a value of 10
## 
## Tuning parameter 'min_child_weight' was held constant at a value of
##  2
## Tuning parameter 'subsample' was held constant at a value of 0.75
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were nrounds = 10, max_depth = 4,
##  eta = 0.4, gamma = 0, colsample_bytree = 1, min_child_weight = 2
##  and subsample = 0.75.

##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 27      10         4 0.4     0                1                2      0.75

## y_pred
## FALSE  TRUE 
##  2901   849

##    y_pred
##     FALSE TRUE
##   0  2828   29
##   1    73  820

## [1] 0.9728

## [1]  train-rmse:0.528996 
## [2]  train-rmse:0.342812 
## [3]  train-rmse:0.240810 
## [4]  train-rmse:0.191193 
## [5]  train-rmse:0.165264 
## [6]  train-rmse:0.154979 
## [7]  train-rmse:0.149908 
## [8]  train-rmse:0.146582 
## [9]  train-rmse:0.144555 
## [10] train-rmse:0.143234

Except of the feature weighting, there are no huge surprises here. Our algorithm puts satisfaction_level by far first - which seems common sense though. However, it’s more intriguing to see that satisfaction_level alone is weighted all other features combined (almost 55%). The smaller half is made up just by features No. 2-4 (side note: the rest of features don’t play a role out all, hence I cut them off).

Final Plots and Summary

Plot One

Description One

This first final plot is equivalent to the second plot in the multivariate section. The plot illustrated that the two leaving cohorts in the lower end and mid left have a different work load. Plus, both cohorts are not very happy about their situation and hence might leave because of this reason.

Plot Two

Description Two

The second plot shows the importance of a promotion. Looking at employees with a satifaction level around 0.75 and above still have a high likelihood to leave if they don’t receive any promotion.

Plot Three

Description Three

Also, the management has the lowest leaving rate probably partially due to their high promotion rate.

Reflection

To me, this project was very exciting as it was also an introduction to the kaggle site. There, I found this data set via the latest upvotes.

I had some issues with the data set as there was no cookbook. What does projects involved exactly mean? Does the value refer to all the projects in an employee’s career with the company, average yearly projects or even the current status (i.e., involved in simultaneous projects)? The salary variable is also a bit dissatisfactory. When is the salary defined as high or low? In comparison to other employees? If so, to all employees or only to the peers (e.g. comparing to other accountants)? Or is it a comparison to some industry standards? Also confusing: What is exactly a work accident?

I really enjoyed the functions corrplot and ggpairs as they quickly give you a quick overview regarding univariate and bivariate analysis. I’m grateful that Udacity introduced us to the GGally library. Suddenly plotting 100 visualizations tremendously speeds up the EDA-process. However, for me it was difficult to gain additional insights from the uni- and bivariate analysis to the multivariate analysis.

In future works, the salary as a number variable would be interesting as we would get a finer view how salary might correlate with evaluation and working hours.

I added the machine learning section because I had started taking the machine learning class nearly simultaneously. It was a good exercise to “convert” concepts from the ml-course, which were taut in Python, into R.

Nonetheless, it was a lot of fun and I hope to learn more regarding predictions, soon.

References

Udacity EDA Course
Udacity Review
Data and Acquaintance to Corrplot Library via Kaggle: https://www.kaggle.com/ludobenistant/hr-analytics
Stack Overflow
R Bloggers
Udemy Machine Learning: https://www.udemy.com/machinelearning/learn/v4