Employee Attrition & HR Analytics Project Using Pandas, NumPy, and Matplotlib¶

Step 1: Import Libraries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2: Load Dataset¶

In [2]:
df = pd.read_csv("employee_data.csv")

Step 3: Preview Data¶

In [3]:
print("First 5 rows of dataset:")
print(df.head())
First 5 rows of dataset:
   Age Attrition     BusinessTravel  DailyRate              Department  \
0   41       Yes      Travel_Rarely       1102                   Sales   
1   49        No  Travel_Frequently        279  Research & Development   
2   37       Yes      Travel_Rarely       1373  Research & Development   
3   33        No  Travel_Frequently       1392  Research & Development   
4   27        No      Travel_Rarely        591  Research & Development   

   DistanceFromHome  Education EducationField  EmployeeCount  EmployeeNumber  \
0                 1          2  Life Sciences              1               1   
1                 8          1  Life Sciences              1               2   
2                 2          2          Other              1               4   
3                 3          4  Life Sciences              1               5   
4                 2          1        Medical              1               7   

   ...  RelationshipSatisfaction StandardHours  StockOptionLevel  \
0  ...                         1            80                 0   
1  ...                         4            80                 1   
2  ...                         2            80                 0   
3  ...                         3            80                 0   
4  ...                         4            80                 1   

   TotalWorkingYears  TrainingTimesLastYear WorkLifeBalance  YearsAtCompany  \
0                  8                      0               1               6   
1                 10                      3               3              10   
2                  7                      3               3               0   
3                  8                      3               3               8   
4                  6                      3               3               2   

  YearsInCurrentRole  YearsSinceLastPromotion  YearsWithCurrManager  
0                  4                        0                     5  
1                  7                        1                     7  
2                  0                        0                     0  
3                  7                        3                     0  
4                  2                        2                     2  

[5 rows x 35 columns]

Step 4: Data Cleaning¶

In [4]:
# Drop irrelevant columns
irrelevant_cols = ['EmployeeCount', 'EmployeeNumber', 'Over18', 'StandardHours']
df = df.drop(columns=[col for col in irrelevant_cols if col in df.columns], errors='ignore')

# Handle missing values
df = df.dropna()

# Encode categorical variables if needed
if 'Attrition' in df.columns:
    df['Attrition'] = df['Attrition'].map({'Yes': 1, 'No': 0})

print("\nCleaned dataset shape:", df.shape)
Cleaned dataset shape: (1470, 31)

Step 5: Attrition Rate by Department, Age Group, Gender¶

In [5]:
# Create age group column
bins = [18, 25, 35, 45, 55, 65]
labels = ['18–25', '26–35', '36–45', '46–55', '56–65']
df['AgeGroup'] = pd.cut(df['Age'], bins=bins, labels=labels, include_lowest=True)

# Calculate attrition rates
attrition_dept = df.groupby('Department')['Attrition'].mean() * 100
attrition_age = df.groupby('AgeGroup')['Attrition'].mean() * 100
attrition_gender = df.groupby('Gender')['Attrition'].mean() * 100

print("\nAttrition by Department:\n", attrition_dept)
print("\nAttrition by Age Group:\n", attrition_age)
print("\nAttrition by Gender:\n", attrition_gender)
Attrition by Department:
 Department
Human Resources           19.047619
Research & Development    13.839750
Sales                     20.627803
Name: Attrition, dtype: float64

Attrition by Age Group:
 AgeGroup
18–25    35.772358
26–35    19.141914
36–45     9.188034
46–55    11.504425
56–65    17.021277
Name: Attrition, dtype: float64

Attrition by Gender:
 Gender
Female    14.795918
Male      17.006803
Name: Attrition, dtype: float64
C:\Users\user\AppData\Local\Temp\ipykernel_12600\1346252096.py:8: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  attrition_age = df.groupby('AgeGroup')['Attrition'].mean() * 100

Step 6: Correlation (Salary, Age, Attrition)¶

In [6]:
if all(col in df.columns for col in ['MonthlyIncome', 'Age', 'Attrition']):
    correlation = np.corrcoef(df['MonthlyIncome'], df['Age'])[0,1]
    attrition_corr = df[['MonthlyIncome', 'Age', 'Attrition']].corr()
    print("\nCorrelation Matrix:\n", attrition_corr)
else:
    print("\nColumns for correlation not found.")
Correlation Matrix:
                MonthlyIncome       Age  Attrition
MonthlyIncome       1.000000  0.497855  -0.159840
Age                 0.497855  1.000000  -0.159205
Attrition          -0.159840 -0.159205   1.000000

Step 7: Visualizations¶

(a) Stacked Bar Chart - Attrition by Department¶

In [21]:
attrition_counts = df.groupby(['Department', 'Attrition']).size().unstack()

plt.figure(figsize=(8,5))
attrition_counts.plot(kind='bar', stacked=True, color=['lightgreen', 'salmon'])
plt.title("Attrition by Department", fontsize=14)
plt.xlabel("Department")
plt.ylabel("Number of Employees")
plt.legend(["No Attrition", "Attrition"])
plt.show()
<Figure size 800x500 with 0 Axes>
No description has been provided for this image

(b) Histogram – Age Distribution¶

In [18]:
plt.figure(figsize=(8,5))
plt.hist(df['Age'], bins=10, color='skyblue', edgecolor='black')
plt.title("Age Distribution of Employees", fontsize=14)
plt.xlabel("Age")
plt.ylabel("Number of Employees")
plt.show()
No description has been provided for this image

(c) Boxplot – Salary vs Performance Rating¶

In [19]:
plt.figure(figsize=(8,5))
df.boxplot(column='MonthlyIncome', by='PerformanceRating', grid=False, patch_artist=True,
           boxprops=dict(facecolor='lavender'))
plt.title("Salary vs Performance Rating", fontsize=14)
plt.suptitle("")  # removes the automatic secondary title
plt.xlabel("Performance Rating")
plt.ylabel("Monthly Income")
plt.show()
<Figure size 800x500 with 0 Axes>
No description has been provided for this image