k-means for Customer Segmentation
We will use k-means clustering for customer segmentation.
Imagine that you have a customer dataset, and you are interested in exploring the behavior of your customers using their historical data.
Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy as a business can target these specific groups of customers and effectively allocate marketing resources.
The dataset being worked on consists of 850 customers, with information about their income and debt. You can download the data set here: https://cocl.us/customer_dataset
import random # library for random number generation
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes
import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline
from sklearn.cluster import KMeans
from sklearn.datasets.samples_generator import make_blobs
print('Libraries imported.')
Read the data into a pandas dataframe.
customers_df = pd.read_csv('Cust_Segmentation.csv')
customers_df.head()
Address in this dataset is a categorical variable. k-means algorithm isn't directly applicable to categorical variables because the Euclidean distance function isn't reallly meaningful for discrete variables.
So let's drop this feature and run clustering.
df = customers_df.drop('Address', axis=1)
df.head()
Normalize the dataset using StandardScalar()
. Normalization is astatistical method that helps mathematical-based algorithms interpret features with different magnitudes and distributions equaly.
from sklearn.preprocessing import StandardScaler
X = df.values[:,1:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)
cluster_dataset
num_clusters = 3
k_means = KMeans(init='k-means++', n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_dataset)
labels = k_means.labels_
print(labels)
df['Labels'] = labels
df.head()
Check the centroid values by averaging the features in each cluster.
df.groupby('Labels').mean()
Let's look at the distribution of customers based on their age and income.
area = np.pi * ( X[:, 1])**2
plt.scatter(X[:, 0], X[:, 3], s=area, c=labels.astype(np.float), alpha=0.5)
plt.xlabel('Age', fontsize=16)
plt.ylabel('Income', fontsize=16)
plt.show()
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
ax.set_xlabel('Age')
ax.set_ylabel('Income')
ax.set_zlabel('DebtIncomeRatio')
ax.scatter(X[:, 1], X[:, 0], X[:, 3], c= labels.astype(np.float))
k-means will partition the customers into three groups since we specified the algorithm to generate three clusters. The customers in each cluster are similar to each other in terms of the features included in the dataset.
We can create a profile for each group, considering the common characteristics of each cluster. For example, the three clusters can be:
- older, high income, and indebted
- middle-aged, middle income, and financially responsible
- young, low income, and indebted
You can devise your own profiles based on the means above and come up with labels that you think best describe each cluster.