Using SVM to detect Cancerous Cells
We will use Support Vector Machines (SVM) to build and train a model using human cell records, and classify cells so as to realize whether the samples are benign or malignant.
In this blog post, we will use Support Vector Machines (SVM) to build and train a model using human cell records, and classify cells so as to realize whether the samples are benign or malignant.
The dataset (download here) we will be working with consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:
- ID: Patient ID
- Clump - Clump thickness
- UnifSize - Uniformity of cell size
- UnifShape - Uniformity of cell shape
- MargAdh - Marginal adhesion
- SingEpiSize - Single epithetical cell size
- BareNuc - Bare nuclei
- BlandChrom - Bland chromatin
- NormNucl - Normal nucleoli
- Mit - Mitoses
- Class - Benign or malignant
SVM works by mapping data to a high-dimensional feature space so that data points can be categorized even when the data is not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the droup to which a new record should belong.
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt
df = pd.read_csv('cell_samples.csv')
df.head()
df.shape
The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.
The Class field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4)
Let's look at the distribution of the classes based on Clump thickness and Uniformity of cell size.
ax = df[df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Red', label='malignant')
df[df['Class'] == 2][0:50].plot(figsize=(8,6), fontsize=14, kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='benign', ax=ax)
plt.ylabel('UnifSize', fontsize=14)
plt.xlabel('Clump', fontsize=14)
plt.show()
df.dtypes
Let's drop the rows that have non numerical values in the BareNuc column. Then we will convert the data type into int
df = df[pd.to_numeric(df['BareNuc'], errors='coerce').notnull()]
df['BareNuc'] = df['BareNuc'].astype('int')
df.dtypes
Create the feature set.
feature_df = df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(feature_df)
X[0:5]
We need the model to predict the value of Class (i.e. benign=2 or malignant=4). As this target can have one of only two possible values, we need to change its measurement to reflect this.
df['Class'] = df['Class'].astype('int')
y = np.asarray(df['Class'])
y[0:5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print('Train set: ', X_train.shape, y_train.shape)
print('Test set: ', X_test.shape, y_test.shape)
Modeling
The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional spaceis called kerneling. The mathrmatical function used for the transformation is known as the kernel function, and can be of different types, such as:
- Linear
- Polynomial
- Radial Basis Function (RBF)
- Sigmoid Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different function in turn and compare the results. Let's use the Radial Basis Function for now.
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
After being fitted, the model can then be used to predict new values.
yhat = clf.predict(X_test)
yhat[0:5]
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes, normalize=False,
title='Confusion matrix', cmap=plt.cm.Blues):
'''
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`
'''
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label', fontsize=14)
plt.xlabel('Predicted label', fontsize=14)
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)
print(classification_report(y_test, yhat))
# plot non-normalized confusion martix
plt.figure(figsize=(8,6))
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'], normalize=False)
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)
Let's try to rebuild the model with other kernel function and check the accuracy.
Accracy using the linear kernel.
clf2 = svm.SVC(kernel='linear')
clf2.fit(X_train, y_train)
yhat2 = clf2.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(y_test, yhat2, average='weighted'))
print("Jaccard score: %.4f" % jaccard_similarity_score(y_test, yhat2))
Accracy using the polynomial kernel.
clf3 = svm.SVC(kernel='poly')
clf3.fit(X_train, y_train)
yhat3 = clf3.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(y_test, yhat3, average='weighted'))
print("Jaccard score: %.4f" % jaccard_similarity_score(y_test, yhat3))
Accracy using the sigmoid kernel.
clf4 = svm.SVC(kernel='sigmoid')
clf4.fit(X_train, y_train)
yhat4 = clf4.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(y_test, yhat4, average='weighted'))
print("Jaccard score: %.4f" % jaccard_similarity_score(y_test, yhat4))
As we can see, the polynomial function results in the most accurate model.
cnf_matrix = confusion_matrix(y_test, yhat3, labels=[2,4])
np.set_printoptions(precision=2)
print(classification_report(y_test, yhat3))
# plot non-normalized confusion martix
plt.figure(figsize=(8,6))
plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'], normalize=False)
Result
The first row is for the cells that are benign in the test set. Out of 137 cells, 90 are benign, and the model correctly predicted 86 of them. Hence, this is an accurate model.
The second row is for the cells that are malignant. There are 47 malignant cells in the test set and the model predicted all of them accurately. That is a perfect result.