Exploratory Data Analysis of IBM HR Attrition Dataset

8 min readFeb 27, 2021

Attrition: When an employee leaves the company due to resignation or retirement, then it is called Attrition. Employees leave the company for personal and professional reasons like retirement, lower growth potential, lower work satisfaction, lower pay rate, bad work environment, etc. Attrition is part and parcel of any business. Attrition is a cause of concern when it crosses a limit.

The attrition rate, also known as churn rate, can be defined as the rate at which employees leave an organization from a specific group over a particular period of time. [Source]

The dataset for the analysis is taken from Kaggle. To get insights about what factors contribute to employee attrition, we use Python and libraries like pandas, matplotlib, and seaborn. In this blog, we mostly talk about absolute and percentage values.

Importing Libraries

# importing necesasary libraries%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd

Loading Data and displaying top 5 rows

data = pd.read_csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
data.head()

Output:

Moving “Attrition” column to last

data["attrition"] = data["Attrition"]
data.drop(columns = ["Attrition"], inplace = True)
data.head()

Output:

Datatype of every variable

RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   BusinessTravel            1470 non-null   object
 2   DailyRate                 1470 non-null   int64 
 3   Department                1470 non-null   object
 4   DistanceFromHome          1470 non-null   int64 
 5   Education                 1470 non-null   int64 
 6   EducationField            1470 non-null   object
 7   EmployeeCount             1470 non-null   int64 
 8   EmployeeNumber            1470 non-null   int64 
 9   EnvironmentSatisfaction   1470 non-null   int64 
 10  Gender                    1470 non-null   object
 11  HourlyRate                1470 non-null   int64 
 12  JobInvolvement            1470 non-null   int64 
 13  JobLevel                  1470 non-null   int64 
 14  JobRole                   1470 non-null   object
 15  JobSatisfaction           1470 non-null   int64 
 16  MaritalStatus             1470 non-null   object
 17  MonthlyIncome             1470 non-null   int64 
 18  MonthlyRate               1470 non-null   int64 
 19  NumCompaniesWorked        1470 non-null   int64 
 20  Over18                    1470 non-null   object
 21  OverTime                  1470 non-null   object
 22  PercentSalaryHike         1470 non-null   int64 
 23  PerformanceRating         1470 non-null   int64 
 24  RelationshipSatisfaction  1470 non-null   int64 
 25  StandardHours             1470 non-null   int64 
 26  StockOptionLevel          1470 non-null   int64 
 27  TotalWorkingYears         1470 non-null   int64 
 28  TrainingTimesLastYear     1470 non-null   int64 
 29  WorkLifeBalance           1470 non-null   int64 
 30  YearsAtCompany            1470 non-null   int64 
 31  YearsInCurrentRole        1470 non-null   int64 
 32  YearsSinceLastPromotion   1470 non-null   int64 
 33  YearsWithCurrManager      1470 non-null   int64 
 34  attrition                 1470 non-null   object
dtypes: int64(26), object(9)
memory usage: 402.1+ KB

Observation: There are 26 variables which are integers(some of the categorical variables are represented in integer format) and 9 variables that are of string datatype.

Attrition percentage

#Attrition percentageprint("{:.2f}% of the employees resigned/retired" .format(data["attrition"].value_counts(["Yes"]/len(data["attrition"])*100))sns.displot(data = data, x = "attrition")

Output:

Observation: The attrition percentage is 16.12% in our dataset, which varies for different companies.

Gender

groupby_Gender = data.groupby("Gender")["attrition"]
print(groupby_Gender.value_counts())sns.displot(data = data, x = "Gender", hue = "attrition")

Output:

Of the total employees resigned/retired, 36.71% are Female
Of the total employees resigned/retired, 63.29% are MaleOf the total Male employees, 17.01% resigned/retired
Of the total Female employees, 14.80% resigned/retired

Observation: Considering both absolute and percentage values, Male attrition is more than Female attrition.

Business Travel

groupby_BusinessTravel = data.groupby("attrition")["BusinessTravel"]
print(groupby_BusinessTravel.value_counts())sns.displot(data = data, x = "BusinessTravel", hue = "attrition")

Output:

Observation: Considering absolute values, employees traveling rarely resigned/retired mostly (156 employes) and when percentage values are considered, employees traveling frequently resigned/retired mostly (around 25%).

Age

Minimum and Maximum age of the workforce is 18 and 60 years respectively.

plt.figure()
sns.histplot(data = data, x = “Age”, bins = 7)

Output:

Observation: From above histogram, we can get to know that most of the employees (82%) fall in the working age between 24 and 48.

plt.figure()
sns.histplot(data = data, x = "Age", bins = 3, hue = "attrition")
plt.figure()
sns.histplot(data = data, x = "Age", bins = 7, hue = "attrition")

Output:

Observation: From the histplot (where bins = 3), we can infer that absolute (more than 100 employees) and percentage values (around 33%) of employees are resigning whose age is less than 32. If this percentage value (around 33%) is greater than the industry average then this should be a cause of worry and the top-level management should try to bring this down by retaining the workforce.
From the histplot (where bins = 7), we can infer that around 45% of the employees whose age is less than 24 resign. This number should also be a cause of worry for top-level management.

Education

"""
Education
1 'Below College'
2 'College'
3 'Bachelor'
4 'Master'
5 'Doctor'
"""
sns.displot(data = data, x = "Education", hue = "attrition")

Output:

Observation: Around 100 employees Education = 3 (Bachelor) resigned/retired and around 42% of employees with Bachelor Education resigned/retired.

sns.displot(data = data, x = "Education", hue = "attrition", col = "Gender")

Output:

When Education is combined with Gender, we get the above distribution plot.

Daily Rate

"""
minimum DailyRate of employee is 102
maximum DailyRate of employee is 1499
"""plt.figure()
sns.histplot(data = data, x = "DailyRate", hue = "attrition", bins = 7)

Output:

Observation: With histogram bins = 7, the distribution looks almost uniform. By visual inspection of the above histogram, we can say that employees with a Daily rate in between 300 and 500 resign/retire more when compared to other Daily rates. Around 22% of the employees whose Daily rate is in between 300 and 500 resign/retire.

Department

sns.displot(data = data, x = "Department", hue = "attrition", height = 6.5)

Output:

Observation: Absolute number (133) of employees resigning/retiring from R&D is more but percentage-wise Sales (20.6%) and HR (19%) of employees resign/retire from Sales and HR respectively.

Education Field

sns.displot(data = data, x = "EducationField", hue = "attrition", height = 10)

Output:

Observation: When absolute numbers are compared, employees in Life Sciences Department resign/retire most. But when percentages are compared, employees in HR Department resign/retire most (around 26%).

Environment Satisfaction

""""EnvironmentSatisfaction
1 'Low'
2 'Medium'
3 'High'
4 'Very High'
"""sns.displot(data = data, x = "EnvironmentSatisfaction", hue = "attrition")

Output:

Observation: Above distribution plot clearly shows that employees with ‘Low’ Environment Satisfaction resign/retire more (both in absolute and percentage-wise).

Job Involvement

"""
JobInvolvement
1 'Low'
2 'Medium'
3 'High'
4 'Very High'
"""sns.displot(data = data, x = "JobInvolvement", hue = "attrition")

Output:

Observation: There are fewer employees with Job Involvement = ‘Low’. But around 33% of employees with ‘Low’ Job Involvement resign/retire.

Job Satisfaction

"""
JobSatisfaction
1 'Low'
2 'Medium'
3 'High'
4 'Very High'
"""sns.displot(data = data, x = "JobSatisfaction", hue = "attrition")

Output:

Of all the resigned/retired, employees with ‘High’ Job Satisfaction are maximum but when percentages are compared, employees with ‘Low’ Job Satisfaction is highest.

Work Life Balance

"""
WorkLifeBalance
1 'Bad'
2 'Good'
3 'Better'
4 'Best'
"""sns.displot(data = data, x = "WorkLifeBalance", hue = "attrition")

Output:

Observation: Almost 1/3rd of the employees with ‘Bad’ work-life balance resign/retire.

Job Role

sns.displot(data = data, x = "JobRole", hue = "attrition", height = 20)

Output:

Observation: Around 40% of people working in the Sales Representative Role resigned/retires, and of all employees resigned/retired, most of them belong to the Laboratory Technician Job Role.

Marital Status

sns.displot(data = data, x = "MaritalStatus", hue = "attrition")

Output:

Observation: The above distribution plot proves that most married employees settle in the current company and don’t shift companies. Before getting married, employees tend to be growth-oriented and shift companies for better pay, better opportunities, etc.

Over Time

sns.displot(data = data, x = "OverTime", hue = "attrition")

Output:

Observation: The distribution plot just conveys the message. Almost 33% of employees working overtime resign/retire.

Percent Salary Hike

sns.displot(data = data, x = "PercentSalaryHike", hue = "attrition")

Output:

sns.displot(data = data, x = "PercentSalaryHike", hue = "attrition", col = "OverTime")

Output:

Observation: From the above distribution plot, we can infer that even after working overtime, if employees do get a good salary hike, they resign. But the thing to be noted is working overtime doesn’t mean working productively.

Performance Rating

sns.displot(data = data, x = "PerformanceRating", hue = "attrition", col = "OverTime")

Output:

Observation: The observation for Salary Hike holds true for the above distribution plot too.

Stock Option Level

sns.displot(data = data, x = "StockOptionLevel", hue = "attrition", col = "OverTime")

Output:

Observation: The interesting thing to observe from the above plot is that more than 1/3rd of the overtime working employees with stock option level = 3 resign/retire.

Correlation

corr = data.corr("pearson")
plt.figure(figsize = (10,10))
sns.heatmap(corr, linewidth = 1)

Output:

plt.figure(figsize = (10,10))
sns.heatmap(abs(corr) > 0.75, linewidth = 1)

Output:

Observation: Variables like JobLevel, MonthlyIncome, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsWithCurrentManager are highly correlated. These variables may lead to multicollinearity.

Conclusion: There are many factors that make an employee resign. Using the IBM dataset, some interesting insights were obtained. These insights can be used to build the model.

Github code can be found here.
Quora blog can be found here.
Link to Linkedin post.

Exploratory Data Analysis of IBM HR Attrition Dataset

Importing Libraries

Loading Data and displaying top 5 rows

Moving “Attrition” column to last

Datatype of every variable

Attrition percentage

Gender

Business Travel

Age

Education

Daily Rate

Department

Education Field

Environment Satisfaction

Job Involvement

Job Satisfaction

Work Life Balance

Job Role

Marital Status

Over Time

Percent Salary Hike

Performance Rating

Stock Option Level

Correlation

Written by Bingi Nagesh

No responses yet