The data consists of 14.999 observations with 10 variables. The main feature of interest is to gain a better understanding which variables might have an impact why some employees from this company choose to leave. (variable named left
).
A quick overview first:
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left
## Min. : 2.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 3.000 Median :0.0000 Median :0.0000
## Mean : 3.498 Mean :0.1446 Mean :0.2381
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :10.000 Max. :1.0000 Max. :1.0000
##
## promotion_last_5years sales salary
## Min. :0.00000 sales :4140 high :1237
## 1st Qu.:0.00000 technical :2720 low :7316
## Median :0.00000 support :2229 medium:6446
## Mean :0.02127 IT :1227
## 3rd Qu.:0.00000 product_mng: 902
## Max. :1.00000 marketing : 858
## (Other) :2923
Only two of the 10 variable are of type Factor. The int variables “Work_accident”, “left” and “promotion_last_5years” may solely assume the values of 0 and 1. The var sales
is a bit awkwardly named. We are going to change it to occupation
and fix the typo in the word “montly”.
Out of the 14999 employees, 3571 have left, which equivalents to nearly 24 %:
## [1] 0.2380825
Interestingly, “good” employees (defined here as: a rating above the third quarter) have even a higher rating of leaving (more than 32%):
## Source: local data frame [4 x 3]
## Groups: left [?]
##
## left `last_evaluation >= 0.87` n
## <int> <lgl> <int>
## 1 0 FALSE 8814
## 2 0 TRUE 2614
## 3 1 FALSE 2340
## 4 1 TRUE 1231
## [1] 0.3447214
The univariate and bivariate section are conflated into one section as I prefer to start the analysis with some general plots built with corrplot and ggpairs. For the correlation plot, we choose all variables that are not of type factor.
The correlation plot confirms the obvious: Those who are unsatisfied leave the company. Besides, “work accidents” and “promotion within the last five years” also seems to play a role reg. leaving the company. There’s also a negative correlation reg. satisfaction and number of projects involved.
To get a broader overview, we made use of ggpairs. However, some plots are not very explanatory due to the variable types and order. We change the order of salary to high, medium, low and convert the int variables Work_accident
, left
and promotion_last_5years
to factor variables as they just can assume values of 0 and 1. Furthermore, as left
is our main interest here, the charts are going to be colour coded by this variable.
The result of ggpairs gives us some more insights. The 25th percentile of the satisfaction level gets lower and lower, the less the salary (cf. upper right graph). However, the median and 75th percentile are almost stable.
The share of people who have received a promotion in the last 5 years is marginal (8th row, 8th column).
If we look at the left
column (fourth column from the right), we see a much larger IQR regarding “number of projects” and “average monthly hours” for those who have left the company.
Let’s have a deeper look at avg_monthly_hours
:
The histogram of those who left resembles a bimodal distribution where as the histogram of the staying employees resembles mostly a unimodal one. Those who left seem to have worked a lot or not enough. The leaving employees seem to lack a “golden mean”.The boxplots reveal that the distribution for those who stay is fairly symmetric, i.e. the median is situated in the center of the IQR and it is very close to the mean. Besides, the whiskers have a similar length. In comparison, those who left have a broader IQR, the upper whisker is much longer and the media is clearly above the median and close relatively close to the 25th percentile.
A quick look considering the impact of job role might also be helpful. Therefore, we’re going to transform the data a bit:
The last two graphs give indeed some insights about the relavance of the job role. Employees in management leave less, maybe because they’re more likely to receive a promotion? Poor people in product management don’t get any promotion at all.
Although ggpairs has given some insights at a glance, some of the 100 plots shown are overplotted. So, let’s replot some of them:
Here we see some interesting patterns. There’s a “cohort” with employees working 240+ hours who seem to be very unhappy. In the next figure, we see very well evaluated employees (0.75+), with a similar satisfaction_level
as those working 240+ hours.
It’s interesting to see if many of these well evaluated and much working employees have already left.
This first graph clearly shows that people of two specific cohorts tend to leave. The first cohort is evaluated between roughly 0.425 and 0.575 and has a satisfaction level between 0.35 and 0.45. This specific “purple island” in the middle of the figure leads to some questions. Why aren’t employees prone to leave, whose satisfaction level is below 0.35? Are they cognitively biased to feel unsatisfied no matter who their employer is? An analogous question concerns the last evaluation: why do employees with a score of less than 0.425 actually stay? Are they generally afraid of the quality of their working skills and hence assume they won’t find a new job? Far more concerning is the state of the second leaving cohort as they have been very well evaluated by the employer (i.e., 0.77+). This group is extremely dissatisfied (satisfaction level 0.125 and less). Do they leave because they know what their worth in the working market?
The second figure looks strikingly similar to the one above. It shows that the top rated employees with low satisfaction also work a lot. Half of the cohort works more than 275 hours per month, which is rarely the case for others — except of a few individuals who have mostly left, too (glance through the righter fifth of the chart from above to spot the purple dots). So we can surely say that long working hours has a very strong correlation on leaving and a weaker but still strong correlation to the employee’s satisfaction level.
The third graph summarizes some very relevant information to our question why employees leave. Employees leave who either have a lot of projects or too few and whose satisfaction level is below roughly 0.45. The “sweet spot” of project numbers seems to be three. People with three projects barely leave and to be more satisfied with their work (around 0.48+). The reason that three projects is a good balance might be that employees are neither underwhelmed nor overwhelmed reg. quantity of work. Plus, three different projects seems to be a good number for not being bored by repetitive work (which might be the case if there is only 1-2 projects) but also the chance of getting bogged down in too many projects is low in contrast to too many projects.
Besides evaluation and “hours/project-load”, maybe promotions, work accidents and how much time an employee spent in the company could have an influence on leaving.
The fourth graph shows that people who don’t get promoted are more likely to leave than those who do.
The fifth graph does NOT show that people who have had an accident are prone to leave, which I would have expected.
The sixth graph shows indeed that’s there’s a “sweet spot”" when people tend to leave. Many employees leave between 2.5 and 4 years if they’re unhappy. Maybe that’s the time whey they’re fed up and those who pass this “fed-up-phase”" just accept their fate and stay with the company. Those who are more satisfied are likely to leave between four to six years. A hypothesis could be that after this time it’s time foster their own career if they don’t get promoted.
The last paragraph also highlights some issues. Marketing employees with high salaries work the least but those with low salaries the a lot.Also, hr people with high salaries work more than hours.
Analyzing the data via statistical learning and machine learning, respectively was not part of this project’s requirement. However, due to gaining some practice with ML in R, curiosity and a better understanding of the data, a brief section is dedicated to the topic.
To keep interpret ability of the data, we begin with a simple decision tree.
## eXtreme Gradient Boosting
##
## 11249 samples
## 9 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11249, 11249, 11249, 11249, 11249, 11249, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree Accuracy Kappa
## 0.001 2 0.5 0.8521392 0.5093679
## 0.001 2 0.7 0.8498621 0.5010322
## 0.001 2 1.0 0.8498718 0.5010558
## 0.001 3 0.5 0.9169548 0.7466815
## 0.001 3 0.7 0.9524161 0.8710397
## 0.001 3 1.0 0.9525509 0.8714344
## 0.001 4 0.5 0.9338731 0.8220070
## 0.001 4 0.7 0.9671161 0.9089446
## 0.001 4 1.0 0.9670683 0.9088210
## 0.100 2 0.5 0.9098433 0.7284714
## 0.100 2 0.7 0.9027194 0.7109425
## 0.100 2 1.0 0.9027098 0.7109185
## 0.100 3 0.5 0.9211083 0.7613005
## 0.100 3 0.7 0.9634414 0.8992131
## 0.100 3 1.0 0.9629869 0.8980719
## 0.100 4 0.5 0.9483820 0.8534257
## 0.100 4 0.7 0.9711795 0.9197931
## 0.100 4 1.0 0.9709303 0.9191371
## 0.400 2 0.5 0.9182825 0.7578001
## 0.400 2 0.7 0.9462145 0.8465668
## 0.400 2 1.0 0.9415363 0.8325713
## 0.400 3 0.5 0.9469512 0.8483589
## 0.400 3 0.7 0.9704855 0.9179001
## 0.400 3 1.0 0.9705541 0.9179600
## 0.400 4 0.5 0.9537221 0.8699171
## 0.400 4 0.7 0.9737855 0.9268565
## 0.400 4 1.0 0.9743056 0.9282956
##
## Tuning parameter 'nrounds' was held constant at a value of 10
##
## Tuning parameter 'min_child_weight' was held constant at a value of
## 2
## Tuning parameter 'subsample' was held constant at a value of 0.75
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 10, max_depth = 4,
## eta = 0.4, gamma = 0, colsample_bytree = 1, min_child_weight = 2
## and subsample = 0.75.
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 27 10 4 0.4 0 1 2 0.75
## y_pred
## FALSE TRUE
## 2901 849
## y_pred
## FALSE TRUE
## 0 2828 29
## 1 73 820
## [1] 0.9728
## [1] train-rmse:0.528996
## [2] train-rmse:0.342812
## [3] train-rmse:0.240810
## [4] train-rmse:0.191193
## [5] train-rmse:0.165264
## [6] train-rmse:0.154979
## [7] train-rmse:0.149908
## [8] train-rmse:0.146582
## [9] train-rmse:0.144555
## [10] train-rmse:0.143234
Except of the feature weighting, there are no huge surprises here. Our algorithm puts satisfaction_level
by far first - which seems common sense though. However, it’s more intriguing to see that satisfaction_level
alone is weighted all other features combined (almost 55%). The smaller half is made up just by features No. 2-4 (side note: the rest of features don’t play a role out all, hence I cut them off).
This first final plot is equivalent to the second plot in the multivariate section. The plot illustrated that the two leaving cohorts in the lower end and mid left have a different work load. Plus, both cohorts are not very happy about their situation and hence might leave because of this reason.
The second plot shows the importance of a promotion. Looking at employees with a satifaction level around 0.75 and above still have a high likelihood to leave if they don’t receive any promotion.
Also, the management has the lowest leaving rate probably partially due to their high promotion rate.
To me, this project was very exciting as it was also an introduction to the kaggle site. There, I found this data set via the latest upvotes.
I had some issues with the data set as there was no cookbook. What does projects involved exactly mean? Does the value refer to all the projects in an employee’s career with the company, average yearly projects or even the current status (i.e., involved in simultaneous projects)? The salary variable is also a bit dissatisfactory. When is the salary defined as high or low? In comparison to other employees? If so, to all employees or only to the peers (e.g. comparing to other accountants)? Or is it a comparison to some industry standards? Also confusing: What is exactly a work accident?
I really enjoyed the functions corrplot and ggpairs as they quickly give you a quick overview regarding univariate and bivariate analysis. I’m grateful that Udacity introduced us to the GGally library. Suddenly plotting 100 visualizations tremendously speeds up the EDA-process. However, for me it was difficult to gain additional insights from the uni- and bivariate analysis to the multivariate analysis.
In future works, the salary as a number variable would be interesting as we would get a finer view how salary might correlate with evaluation and working hours.
I added the machine learning section because I had started taking the machine learning class nearly simultaneously. It was a good exercise to “convert” concepts from the ml-course, which were taut in Python, into R.
Nonetheless, it was a lot of fun and I hope to learn more regarding predictions, soon.