Exploratory Data Analysis of IBM HR Attrition Dataset
Attrition: When an employee leaves the company due to resignation or retirement, then it is called Attrition. Employees leave the company for personal and professional reasons like retirement, lower growth potential, lower work satisfaction, lower pay rate, bad work environment, etc. Attrition is part and parcel of any business. Attrition is a cause of concern when it crosses a limit.
The attrition rate, also known as churn rate, can be defined as the rate at which employees leave an organization from a specific group over a particular period of time. [Source]
The dataset for the analysis is taken from Kaggle. To get insights about what factors contribute to employee attrition, we use Python and libraries like pandas, matplotlib, and seaborn. In this blog, we mostly talk about absolute and percentage values.
Importing Libraries
# importing necesasary libraries%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
Loading Data and displaying top 5 rows
data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
data.head()
Output:
Moving “Attrition” column to last
data["attrition"] = data["Attrition"]
data.drop(columns = ["Attrition"], inplace = True)
data.head()
Output:
Datatype of every variable
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1470 non-null int64
1 BusinessTravel 1470 non-null object
2 DailyRate 1470 non-null int64
3 Department 1470 non-null object
4 DistanceFromHome 1470 non-null int64
5 Education 1470 non-null int64
6 EducationField 1470 non-null object
7 EmployeeCount 1470 non-null int64
8 EmployeeNumber 1470 non-null int64
9 EnvironmentSatisfaction 1470 non-null int64
10 Gender 1470 non-null object
11 HourlyRate 1470 non-null int64
12 JobInvolvement 1470 non-null int64
13 JobLevel 1470 non-null int64
14 JobRole 1470 non-null object
15 JobSatisfaction 1470 non-null int64
16 MaritalStatus 1470 non-null object
17 MonthlyIncome 1470 non-null int64
18 MonthlyRate 1470 non-null int64
19 NumCompaniesWorked 1470 non-null int64
20 Over18 1470 non-null object
21 OverTime 1470 non-null object
22 PercentSalaryHike 1470 non-null int64
23 PerformanceRating 1470 non-null int64
24 RelationshipSatisfaction 1470 non-null int64
25 StandardHours 1470 non-null int64
26 StockOptionLevel 1470 non-null int64
27 TotalWorkingYears 1470 non-null int64
28 TrainingTimesLastYear 1470 non-null int64
29 WorkLifeBalance 1470 non-null int64
30 YearsAtCompany 1470 non-null int64
31 YearsInCurrentRole 1470 non-null int64
32 YearsSinceLastPromotion 1470 non-null int64
33 YearsWithCurrManager 1470 non-null int64
34 attrition 1470 non-null object
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
Observation: There are 26 variables which are integers(some of the categorical variables are represented in integer format) and 9 variables that are of string datatype.
Attrition percentage
#Attrition percentageprint("{:.2f}% of the employees resigned/retired" .format(data["attrition"].value_counts(["Yes"]/len(data["attrition"])*100))sns.displot(data = data, x = "attrition")
Output:
Observation: The attrition percentage is 16.12% in our dataset, which varies for different companies.
Gender
groupby_Gender = data.groupby("Gender")["attrition"]
print(groupby_Gender.value_counts())sns.displot(data = data, x = "Gender", hue = "attrition")
Output:
Of the total employees resigned/retired, 36.71% are Female
Of the total employees resigned/retired, 63.29% are MaleOf the total Male employees, 17.01% resigned/retired
Of the total Female employees, 14.80% resigned/retired
Observation: Considering both absolute and percentage values, Male attrition is more than Female attrition.
Business Travel
groupby_BusinessTravel = data.groupby("attrition")["BusinessTravel"]
print(groupby_BusinessTravel.value_counts())sns.displot(data = data, x = "BusinessTravel", hue = "attrition")
Output:
Observation: Considering absolute values, employees traveling rarely resigned/retired mostly (156 employes) and when percentage values are considered, employees traveling frequently resigned/retired mostly (around 25%).
Age
Minimum and Maximum age of the workforce is 18 and 60 years respectively.
plt.figure()
sns.histplot(data = data, x = “Age”, bins = 7)
Output:
Observation: From above histogram, we can get to know that most of the employees (82%) fall in the working age between 24 and 48.
plt.figure()
sns.histplot(data = data, x = "Age", bins = 3, hue = "attrition")
plt.figure()
sns.histplot(data = data, x = "Age", bins = 7, hue = "attrition")
Output:
Observation: From the histplot (where bins = 3), we can infer that absolute (more than 100 employees) and percentage values (around 33%) of employees are resigning whose age is less than 32. If this percentage value (around 33%) is greater than the industry average then this should be a cause of worry and the top-level management should try to bring this down by retaining the workforce.
From the histplot (where bins = 7), we can infer that around 45% of the employees whose age is less than 24 resign. This number should also be a cause of worry for top-level management.
Education
"""
Education
1 'Below College'
2 'College'
3 'Bachelor'
4 'Master'
5 'Doctor'
"""
sns.displot(data = data, x = "Education", hue = "attrition")
Output:
Observation: Around 100 employees Education = 3 (Bachelor) resigned/retired and around 42% of employees with Bachelor Education resigned/retired.
sns.displot(data = data, x = "Education", hue = "attrition", col = "Gender")
Output:
When Education is combined with Gender, we get the above distribution plot.
Daily Rate
"""
minimum DailyRate of employee is 102
maximum DailyRate of employee is 1499
"""plt.figure()
sns.histplot(data = data, x = "DailyRate", hue = "attrition", bins = 7)
Output:
Observation: With histogram bins = 7, the distribution looks almost uniform. By visual inspection of the above histogram, we can say that employees with a Daily rate in between 300 and 500 resign/retire more when compared to other Daily rates. Around 22% of the employees whose Daily rate is in between 300 and 500 resign/retire.
Department
sns.displot(data = data, x = "Department", hue = "attrition", height = 6.5)
Output:
Observation: Absolute number (133) of employees resigning/retiring from R&D is more but percentage-wise Sales (20.6%) and HR (19%) of employees resign/retire from Sales and HR respectively.
Education Field
sns.displot(data = data, x = "EducationField", hue = "attrition", height = 10)
Output:
Observation: When absolute numbers are compared, employees in Life Sciences Department resign/retire most. But when percentages are compared, employees in HR Department resign/retire most (around 26%).
Environment Satisfaction
""""EnvironmentSatisfaction
1 'Low'
2 'Medium'
3 'High'
4 'Very High'
"""sns.displot(data = data, x = "EnvironmentSatisfaction", hue = "attrition")
Output:
Observation: Above distribution plot clearly shows that employees with ‘Low’ Environment Satisfaction resign/retire more (both in absolute and percentage-wise).
Job Involvement
"""
JobInvolvement
1 'Low'
2 'Medium'
3 'High'
4 'Very High'
"""sns.displot(data = data, x = "JobInvolvement", hue = "attrition")
Output:
Observation: There are fewer employees with Job Involvement = ‘Low’. But around 33% of employees with ‘Low’ Job Involvement resign/retire.
Job Satisfaction
"""
JobSatisfaction
1 'Low'
2 'Medium'
3 'High'
4 'Very High'
"""sns.displot(data = data, x = "JobSatisfaction", hue = "attrition")
Output:
Of all the resigned/retired, employees with ‘High’ Job Satisfaction are maximum but when percentages are compared, employees with ‘Low’ Job Satisfaction is highest.
Work Life Balance
"""
WorkLifeBalance
1 'Bad'
2 'Good'
3 'Better'
4 'Best'
"""sns.displot(data = data, x = "WorkLifeBalance", hue = "attrition")
Output:
Observation: Almost 1/3rd of the employees with ‘Bad’ work-life balance resign/retire.
Job Role
sns.displot(data = data, x = "JobRole", hue = "attrition", height = 20)
Output:
Observation: Around 40% of people working in the Sales Representative Role resigned/retires, and of all employees resigned/retired, most of them belong to the Laboratory Technician Job Role.
Marital Status
sns.displot(data = data, x = "MaritalStatus", hue = "attrition")
Output:
Observation: The above distribution plot proves that most married employees settle in the current company and don’t shift companies. Before getting married, employees tend to be growth-oriented and shift companies for better pay, better opportunities, etc.
Over Time
sns.displot(data = data, x = "OverTime", hue = "attrition")
Output:
Observation: The distribution plot just conveys the message. Almost 33% of employees working overtime resign/retire.
Percent Salary Hike
sns.displot(data = data, x = "PercentSalaryHike", hue = "attrition")
Output:
sns.displot(data = data, x = "PercentSalaryHike", hue = "attrition", col = "OverTime")
Output:
Observation: From the above distribution plot, we can infer that even after working overtime, if employees do get a good salary hike, they resign. But the thing to be noted is working overtime doesn’t mean working productively.
Performance Rating
sns.displot(data = data, x = "PerformanceRating", hue = "attrition", col = "OverTime")
Output:
Observation: The observation for Salary Hike holds true for the above distribution plot too.
Stock Option Level
sns.displot(data = data, x = "StockOptionLevel", hue = "attrition", col = "OverTime")
Output:
Observation: The interesting thing to observe from the above plot is that more than 1/3rd of the overtime working employees with stock option level = 3 resign/retire.
Correlation
corr = data.corr("pearson")
plt.figure(figsize = (10,10))
sns.heatmap(corr, linewidth = 1)
Output:
plt.figure(figsize = (10,10))
sns.heatmap(abs(corr) > 0.75, linewidth = 1)
Output:
Observation: Variables like JobLevel, MonthlyIncome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsWithCurrentManager are highly correlated. These variables may lead to multicollinearity.
Conclusion: There are many factors that make an employee resign. Using the IBM dataset, some interesting insights were obtained. These insights can be used to build the model.
Github code can be found here.
Quora blog can be found here.
Link to Linkedin post.