Comparing Vehicles using Hierarchical Clustering
We will use the heirarchical clustering to find the most distinctive clusters of existing vehicles, which will help manufacturers make decisions about the supply of new models.
Imagine that an automobile manufacturer has developed prototypes for a new vehicle. Before introducing the new model into its range, the manufacturer wants to determine which existing vehicles on the market are most like the prototypes, i.e. how vehicles can be grouped, which group is the most similar with the model, and therefore which models they will be competing against.
In this blog post, we will use the heirarchical clustering to find the most distinctive clusters of vehicles. It will summarize the existing vehicles and help manufacturers to make a decision about the supply of new models.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import AgglomerativeClustering
%matplotlib inline
filename = 'cars_clus.csv'
df = pd.read_csv(filename)
print('Shape of the dataframe: ', df.shape)
df.head()
The feature sets include price in thousands (price), engine size (engine_s), horsepower (horsepow), wheelbase (wheelbas), width (width), length (length), curb weight (curb_wgt), fuel capacity (fuel_cap) and fuel efficiency (mpg).
print ("Size of dataset before cleaning: ", df.size)
df[[ 'sales', 'resale', 'type', 'price', 'engine_s',
'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
'mpg', 'lnsales']] = df[['sales', 'resale', 'type', 'price', 'engine_s',
'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap',
'mpg', 'lnsales']].apply(pd.to_numeric, errors='coerce')
df = df.dropna()
df = df.reset_index(drop=True)
print ("Size of dataset after cleaning: ", df.size)
df.head(5)
Let's select our feature set.
featureset = df[['engine_s', 'horsepow', 'wheelbas', 'width', 'length', 'curb_wgt', 'fuel_cap', 'mpg']]
Normalize the feature set.MinMaxScalar transforms features by scaling each feature to a given range. It is by default (0, 1). That is, the estimator scales and translates each feature individually such that is is betweeen 0 and 1.
from sklearn.preprocessing import MinMaxScaler
x = featureset.values #returns a numpy array
min_max_scaler = MinMaxScaler()
feature_mtx = min_max_scaler.fit_transform(x)
feature_mtx [0:5]
import scipy
leng = feature_mtx.shape[0]
D = scipy.zeros([leng, leng])
for i in range(leng):
for j in range(leng):
D[i,j] = scipy.spatial.distance.euclidean(feature_mtx[i], feature_mtx[j])
In agglomerative clustering, at each iteration, the algorithm must update the distance matrix to reflect the distance of the newly formed cluster with the remaining clusters in the forest.
import pylab
import scipy.cluster.hierarchy
Z = hierarchy.linkage(D, 'complete')
Essentially, hierarchical clustering does not require a pre-specified number of clusters. However, in some application we might want a partition of disjoint clusters just as in flat clustering. So we can use a cutting line.
from scipy.cluster.hierarchy import fcluster
max_d = 3
clusters = fcluster(Z, max_d, criterion='distance')
clusters
You can determine the number of clusters directly.
from scipy.cluster.hierarchy import fcluster
k = 5
clusters = fcluster(Z,k, criterion='maxclust')
clusters
Plot the dendrogram.
fig = pylab.figure(figsize=(18,50))
def llf(id):
return '[%s %s %s]' % (df['manufact'][id], df['model'][id], int(float(df['type'][id])) )
dendro = hierarchy.dendrogram(Z, leaf_label_func=llf, leaf_rotation=0, leaf_font_size =12, orientation = 'right')
dist_matrix = distance_matrix(feature_mtx, feature_mtx)
print(dist_matrix)
agglom = AgglomerativeClustering(n_clusters=6, linkage='complete')
agglom.fit(feature_mtx)
agglom.labels_
We can add a new field to our dataframe to show the cluster of each row.
df['cluster_'] = agglom.labels_
df.head()
import matplotlib.cm as cm
n_clusters = max(agglom.labels_)+1
colors = cm.rainbow(np.linspace(0, 1, n_clusters))
cluster_labels = list(range(0, n_clusters))
plt.figure(figsize=(12,10))
for color, label in zip(colors, cluster_labels):
subset = df[df.cluster_ == label]
for i in subset.index:
plt.text(subset.horsepow[i], subset.mpg[i],str(subset['model'][i]), rotation=25)
plt.scatter(subset.horsepow, subset.mpg, s= subset.price*10, c=color, label='cluster'+str(label),alpha=0.5)
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')
As you can see, we are seeing the distribution of each cluster using the scatter plot, but it is not very clear where is the centroid of each cluster. Moreover, there are 2 types of vehicles in our dataset, "truck" (value of 1 in the type column) and "car" (value of 0 in the type column). So, we use them to distinguish the classes, and summarize the cluster.
First we count the number of cases in each group.
df.groupby(['cluster_','type'])['cluster_'].count()
Now we can look at the characteristics of each cluster.
agg_cars = df.groupby(['cluster_','type'])['horsepow','engine_s','mpg','price'].mean()
agg_cars
Cars:
- Cluster 1: low horsepower, high mileage, and low price
- Cluster 2: medium horsepower, medium mileage, and medium price
- Cluster 3: high horsepower, low mileage, and high price
- Cluster 4: very low horsepower, very high mileage, and very low price
- Cluster 5: very high horsepower, very low mileage, and very high price
Trucks:
- Cluster 0: high horsepower, low mileage, and high price
- Cluster 1: low horsepower, medium mileage, and low price
- Cluster 2: high horsepower, low mileage, and high price
- Cluster 3: low horsepower, low mileage, and medium price
plt.figure(figsize=(12,8))
for color, label in zip(colors, cluster_labels):
subset = agg_cars.loc[(label,),]
for i in subset.index:
plt.text(subset.loc[i][0]+5, subset.loc[i][2], 'type='+str(int(i)) + ', price='+str(int(subset.loc[i][3]))+'k')
plt.scatter(subset.horsepow, subset.mpg, s=subset.price*20, c=color, label='cluster'+str(label))
plt.legend()
plt.title('Clusters')
plt.xlabel('horsepow')
plt.ylabel('mpg')