Employee Attrition & HR Analytics Project Using Pandas, NumPy, and Matplotlib¶
Step 1: Import Libraries¶
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 2: Load Dataset¶
In [2]:
df = pd.read_csv("employee_data.csv")
Step 3: Preview Data¶
In [3]:
print("First 5 rows of dataset:")
print(df.head())
First 5 rows of dataset: Age Attrition BusinessTravel DailyRate Department \ 0 41 Yes Travel_Rarely 1102 Sales 1 49 No Travel_Frequently 279 Research & Development 2 37 Yes Travel_Rarely 1373 Research & Development 3 33 No Travel_Frequently 1392 Research & Development 4 27 No Travel_Rarely 591 Research & Development DistanceFromHome Education EducationField EmployeeCount EmployeeNumber \ 0 1 2 Life Sciences 1 1 1 8 1 Life Sciences 1 2 2 2 2 Other 1 4 3 3 4 Life Sciences 1 5 4 2 1 Medical 1 7 ... RelationshipSatisfaction StandardHours StockOptionLevel \ 0 ... 1 80 0 1 ... 4 80 1 2 ... 2 80 0 3 ... 3 80 0 4 ... 4 80 1 TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany \ 0 8 0 1 6 1 10 3 3 10 2 7 3 3 0 3 8 3 3 8 4 6 3 3 2 YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager 0 4 0 5 1 7 1 7 2 0 0 0 3 7 3 0 4 2 2 2 [5 rows x 35 columns]
Step 4: Data Cleaning¶
In [4]:
# Drop irrelevant columns
irrelevant_cols = ['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours']
df = df.drop(columns=[col for col in irrelevant_cols if col in df.columns], errors='ignore')
# Handle missing values
df = df.dropna()
# Encode categorical variables if needed
if 'Attrition' in df.columns:
df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})
print("\nCleaned dataset shape:", df.shape)
Cleaned dataset shape: (1470, 31)
Step 5: Attrition Rate by Department, Age Group, Gender¶
In [5]:
# Create age group column
bins = [18, 25, 35, 45, 55, 65]
labels = ['18–25', '26–35', '36–45', '46–55', '56–65']
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=True)
# Calculate attrition rates
attrition_dept = df.groupby('Department')['Attrition'].mean() * 100
attrition_age = df.groupby('AgeGroup')['Attrition'].mean() * 100
attrition_gender = df.groupby('Gender')['Attrition'].mean() * 100
print("\nAttrition by Department:\n", attrition_dept)
print("\nAttrition by Age Group:\n", attrition_age)
print("\nAttrition by Gender:\n", attrition_gender)
Attrition by Department: Department Human Resources 19.047619 Research & Development 13.839750 Sales 20.627803 Name: Attrition, dtype: float64 Attrition by Age Group: AgeGroup 18–25 35.772358 26–35 19.141914 36–45 9.188034 46–55 11.504425 56–65 17.021277 Name: Attrition, dtype: float64 Attrition by Gender: Gender Female 14.795918 Male 17.006803 Name: Attrition, dtype: float64
C:\Users\user\AppData\Local\Temp\ipykernel_12600\1346252096.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
attrition_age = df.groupby('AgeGroup')['Attrition'].mean() * 100
Step 6: Correlation (Salary, Age, Attrition)¶
In [6]:
if all(col in df.columns for col in ['MonthlyIncome', 'Age', 'Attrition']):
correlation = np.corrcoef(df['MonthlyIncome'], df['Age'])[0,1]
attrition_corr = df[['MonthlyIncome', 'Age', 'Attrition']].corr()
print("\nCorrelation Matrix:\n", attrition_corr)
else:
print("\nColumns for correlation not found.")
Correlation Matrix:
MonthlyIncome Age Attrition
MonthlyIncome 1.000000 0.497855 -0.159840
Age 0.497855 1.000000 -0.159205
Attrition -0.159840 -0.159205 1.000000
Step 7: Visualizations¶
(a) Stacked Bar Chart - Attrition by Department¶
In [21]:
attrition_counts = df.groupby(['Department', 'Attrition']).size().unstack()
plt.figure(figsize=(8,5))
attrition_counts.plot(kind='bar', stacked=True, color=['lightgreen', 'salmon'])
plt.title("Attrition by Department", fontsize=14)
plt.xlabel("Department")
plt.ylabel("Number of Employees")
plt.legend(["No Attrition", "Attrition"])
plt.show()
<Figure size 800x500 with 0 Axes>
(b) Histogram – Age Distribution¶
In [18]:
plt.figure(figsize=(8,5))
plt.hist(df['Age'], bins=10, color='skyblue', edgecolor='black')
plt.title("Age Distribution of Employees", fontsize=14)
plt.xlabel("Age")
plt.ylabel("Number of Employees")
plt.show()
(c) Boxplot – Salary vs Performance Rating¶
In [19]:
plt.figure(figsize=(8,5))
df.boxplot(column='MonthlyIncome', by='PerformanceRating', grid=False, patch_artist=True,
boxprops=dict(facecolor='lavender'))
plt.title("Salary vs Performance Rating", fontsize=14)
plt.suptitle("") # removes the automatic secondary title
plt.xlabel("Performance Rating")
plt.ylabel("Monthly Income")
plt.show()
<Figure size 800x500 with 0 Axes>