Predicting Wine Quality
We will be analyzing the quality of red and white wines, and check which are the attributes that affect wine quality the most.
In this blog post we will be analyzing the quality of red and white wines, and check which are the attributes that affect wine quality the most.
There are two datasets, related to red and white Vinho Verde wine samples, from the north of Portugal. The datasets can be downloaded here. The goal is to model wine quality based on physicochemical tests. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
Attribute Information:
Input variables (based on physicochemical tests):
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
Output variable (based on sensory data):
- quality (score between 0 and 10)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df_red = pd.read_csv('winequality-red.csv')
df_red.head()
df_red.shape
There are 1,599 samples and 12 features, including our target feature - the wine quality.
df_red.info()
All of our dataset is numeric and there are no missing values.
df_red.describe()
To understand how much each attribute correlates with the wine quality, we can compute the standard correlation coefficient or Pearson's r between every pair of attributes.
corr_matrix = df_red.corr()
corr_matrix['quality'].sort_values(ascending=False)
We now know the features that most affect the wine quality.
Wine quality is directly proportional to the amount of alcohol, sulphates, citric acid.
Wine quality is inversely proportional to the amount of volatile acidity, total sulfur dioxide, density.
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
df_red.hist(bins=10, figsize=(20, 15))
plt.show()
df_red.plot(kind='density', subplots=True, figsize=(20,15),
layout=(4,3), sharex=False)
plt.show()
sns.distplot(df_red['quality'], hist=True, kde=True,
bins='auto', color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4})
plt.show()
The data distribution for the alcohol, citric acid and sulfur dioxide content atrributes is positively skewed.
The data distribution for the density and pH attributes is quite normally distributed.
The wine quality data distribution is a bimodal distribution and there are more wines with an average quality than wines with good or bad quality.
from pandas.plotting import scatter_matrix
sm = scatter_matrix(df_red, figsize=(12, 12), diagonal='kde')
#Change label rotation
[s.xaxis.label.set_rotation(40) for s in sm.reshape(-1)]
[s.yaxis.label.set_rotation(0) for s in sm.reshape(-1)]
#May need to offset label when rotating to prevent overlap of figure
[s.get_yaxis().set_label_coords(-0.6,0.5) for s in sm.reshape(-1)]
#Hide all ticks
[s.set_xticks(()) for s in sm.reshape(-1)]
[s.set_yticks(()) for s in sm.reshape(-1)]
plt.show()
Let's create a pivot table that describes the median value of each feature for each quality score.
column_names = ['fixed acidity', 'volatile acidity', 'citric acid',
'residual sugar', 'chlorides', 'free sulfur dioxide',
'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
df_red_pivot_table = df_red.pivot_table(column_names,
['quality'],
aggfunc='median')
df_red_pivot_table
We can see just how much effect does the alcohol content and volatile acidity have on the quality of the wine.
We can plot a correlation matrix to see how two variables interact, both in direction and magnitude.
column_names = ['fixed acidity', 'volatile acidity', 'citric acid',
'residual sugar', 'chlorides', 'free sulfur dioxide',
'total sulfur dioxide', 'density', 'pH', 'sulphates',
'alcohol', 'quality']
# plot correlation matrix
fig, ax = plt.subplots(figsize=(20, 20))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr_matrix, cmap=colormap, annot=True,
fmt='.2f', annot_kws={'size': 20})
ax.set_xticklabels(column_names,
rotation=45,
horizontalalignment='right',
fontsize=20);
ax.set_yticklabels(column_names, fontsize=20);
plt.show()
Data Cleaning
In our dataset, there aren't any missing values, outliers, or attributes that provide no useful information for the task. So, we could conclude than our dataset is quite clean.
The wine preference scores vary from 3 to 8, so it's straightforward to categorize them into 'bad' or 'good' quality of wines. We will assign discrete values of 0 and 1 for the corresponding categories.
bins = (2, 6, 8)
group_names = ['bad', 'good']
df_red['quality'] = pd.cut(df_red['quality'], bins = bins, labels = group_names)
from sklearn.preprocessing import LabelEncoder
# let's assign labels to our quality variable
label_quality = LabelEncoder()
# Bad becomes 0 and good becomes 1
df_red['quality'] = label_quality.fit_transform(df_red['quality'])
print(df_red['quality'].value_counts())
sns.countplot(df_red['quality'])
plt.show()
X = df_red.drop('quality', axis=1)
y = df_red['quality']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
We will scale the features so as to get optimized results.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
Modeling
We will be evaluating 8 different algorithms.
- Support Vector Classifier
- Stochastic Gradient Decent Classifier
- Random Forest Classifier
- Decision Tree Classifier
- Gaussian Naive Bayes
- K-Neighbors Classifier
- Ada Boost Classifier
- Logistic Regression
The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data. K-fold Cross Validation provides a solution to this problem by dividing the data into folds and ensuring that each fold is used as a testing set at some point.
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV, cross_val_score
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
seed = 7
# prepare models
models = []
models.append(('SupportVectorClassifier', SVC()))
models.append(('StochasticGradientDecentC', SGDClassifier()))
models.append(('RandomForestClassifier', RandomForestClassifier()))
models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))
models.append(('GaussianNB', GaussianNB()))
models.append(('KNeighborsClassifier', KNeighborsClassifier()))
models.append(('AdaBoostClassifier', AdaBoostClassifier()))
models.append(('LogisticRegression', LogisticRegression()))# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# boxplot algorithm comparison
fig = plt.figure(figsize=(40, 20))
fig.suptitle('Algorithm Comparison', fontsize=40)
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names, fontdict={'fontsize': 20})
plt.show()
The Box Plots of these algorithms' accuracy distribution is quite symmetrical, with negligible outliers. The adjacent box plot values are close together, which correspond to the high density of accuracy scores.
Hyperparameter Tuning
There are several factors that can help us determine which algorithm performs best. One such factor is the performance on the cross-validation set and another factor is the choice of parameters for an algorithm.
SVC
Let's fine-tune our algorithms. The first algorithm that we trained and evaluated was the Support Vector Classifier and the mean value for model prediction was 0.889. We will use GridSearchCV for the hyperparameter tuning.
svc = SVC()
svc.fit(X_train, y_train)
pred_svc = svc.predict(X_test)
def svc_param_selection(X, y, nfolds):
param = {
'C': [0.1, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4],
'kernel': ['linear', 'rbf'],
'gamma': [0.1, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4]
}
grid_search = GridSearchCV(svc, param_grid=param,
scoring='accuracy', cv=nfolds)
grid_search.fit(X, y)
return grid_search.best_params_
svc_param_selection(X_train, y_train, 10)
Hence, the best parameters for the SVC algorithm are {C= 1.2, gamma= 0.9 , kernel= rbf}.
Let's run our SVC algorithm again with the best parameters.
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, mean_absolute_error
svc2 = SVC(C= 1.2, gamma= 0.9, kernel= 'rbf')
svc2.fit(X_train, y_train)
pred_svc2 = svc2.predict(X_test)
print('Confusion matrix')
print(confusion_matrix(y_test, pred_svc2))
print('Classification report')
print(classification_report(y_test, pred_svc2))
print('Accuracy score',accuracy_score(y_test, pred_svc2))
The overall accuracy of the classifier is 89.69%, and f1-score of the weighted avg is 0.88, which is very good.
sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=60)
sgd.fit(X_train, y_train)
pred_sgd = sgd.predict(X_test)
rfc = RandomForestClassifier(n_estimators=200, max_depth=20,
random_state=0)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
pred_knn = knn.predict(X_test)
def knn_param_selection(X, y, nfolds):
param = {
'n_neighbors': [2, 3, 4, 5, 6],
'weights': ['uniform', 'distance'],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'p': [1, 2]
}
grid_search = GridSearchCV(knn, param_grid=param,
scoring='accuracy', cv=nfolds)
grid_search.fit(X, y)
return grid_search.best_params_
knn_param_selection(X_train, y_train, 10)
Hence, the best parameters for the KNeighborsClassifier algorithm are {algorithm= auto, n_neighbors= 4 , p= 2, weights= distance}.
Let's run our knn algorithm again with the best parameters.
knn2 = KNeighborsClassifier(algorithm= 'auto',
n_neighbors= 5, p=2,
weights='distance')
knn2.fit(X_train, y_train)
pred_knn2 = knn2.predict(X_test)
print('Confusion matrix')
print(confusion_matrix(y_test, pred_knn2))
print('Classification report')
print(classification_report(y_test, pred_knn2))
print('Accuracy score',accuracy_score(y_test, pred_knn2))
The overall accuracy of the classifier is 90.3%, and f1-score of the weighted avg is 0.90, which is very good.
ada_classifier = AdaBoostClassifier(n_estimators=100)
ada_classifier.fit(X_train, y_train)
pred_ada = ada_classifier.predict(X_test)
scores = cross_val_score(ada_classifier,X_test,y_test, cv=5)
print('Accuracy score',scores.mean())
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(
mean_absolute_error(test_labels, predictions)))
print('Accuracy = {:0.2f}%.'.format(accuracy_score(
test_labels, predictions)*100))
evaluate(svc,X_test,y_test)
evaluate(svc2,X_test,y_test)
evaluate(sgd,X_test,y_test)
evaluate(rfc,X_test,y_test)
evaluate(knn2, X_test, y_test)
evaluate(ada_classifier,X_test,y_test)
The KNeighborsClassifier model with hyperparameter tuning performs the best with an accuracy of 90.31%.
importance = rfc.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfc.estimators_],
axis=0)
indices = np.argsort(importance)
# plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.barh(range(X.shape[1]), importance[indices],
color="b", align="center")
plt.yticks(range(X.shape[1]), column_names)
plt.ylim([0, X.shape[1]])
plt.show()
df_white = pd.read_csv('winequality-white.csv')
df_white.head()
df_white.shape
There are 4,898 samples and 12 features, including our target feature - the wine quality.
df_white.info()
df_white.describe()
corr_matrix2 = df_white.corr()
corr_matrix2['quality'].sort_values(ascending=False)
df_white.hist(bins=10, figsize=(20, 15))
plt.show()
column_names = ['fixed acidity', 'volatile acidity', 'citric acid',
'residual sugar', 'chlorides', 'free sulfur dioxide',
'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
df_white_pivot_table = df_white.pivot_table(column_names,
['quality'],
aggfunc='median')
df_white_pivot_table
df_white.plot(kind='density', subplots=True, figsize=(20,15),
layout=(4,3), sharex=False)
plt.show()
sns.distplot(df_white['quality'], hist=True, kde=True,
bins='auto', color = 'darkblue',
hist_kws={'edgecolor':'black'},
kde_kws={'linewidth': 4})
plt.show()
from pandas.plotting import scatter_matrix
sm = scatter_matrix(df_white, figsize=(12, 12), diagonal='kde')
#Change label rotation
[s.xaxis.label.set_rotation(40) for s in sm.reshape(-1)]
[s.yaxis.label.set_rotation(0) for s in sm.reshape(-1)]
#May need to offset label when rotating to prevent overlap of figure
[s.get_yaxis().set_label_coords(-0.6,0.5) for s in sm.reshape(-1)]
#Hide all ticks
[s.set_xticks(()) for s in sm.reshape(-1)]
[s.set_yticks(()) for s in sm.reshape(-1)]
plt.show()
column_names = ['fixed acidity', 'volatile acidity', 'citric acid',
'residual sugar', 'chlorides', 'free sulfur dioxide',
'total sulfur dioxide', 'density', 'pH', 'sulphates',
'alcohol', 'quality']
# plot correlation matrix
fig, ax = plt.subplots(figsize=(20, 20))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr_matrix2, cmap=colormap, annot=True,
fmt='.2f', annot_kws={'size': 20})
ax.set_xticklabels(column_names,
rotation=45,
horizontalalignment='right',
fontsize=20);
ax.set_yticklabels(column_names, fontsize=20);
plt.show()
bins = (2, 6, 9)
group_names = ['bad', 'good']
df_white['quality'] = pd.cut(df_white['quality'], bins = bins, labels = group_names)
label_quality = LabelEncoder()
# Bad becomes 0 and good becomes 1
df_white['quality'] = label_quality.fit_transform(df_white['quality'])
print(df_white['quality'].value_counts())
sns.countplot(df_white['quality'])
plt.show()
There are 3,838 bad quality wines, and 1,060 good quality wines.
X = df_white.drop('quality', axis=1)
y = df_white['quality']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
seed = 7
# prepare models
models = []
models.append(('SupportVectorClassifier', SVC()))
models.append(('StochasticGradientDecentC', SGDClassifier()))
models.append(('RandomForestClassifier', RandomForestClassifier()))
models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))
models.append(('GaussianNB', GaussianNB()))
models.append(('KNeighborsClassifier', KNeighborsClassifier()))
models.append(('AdaBoostClassifier', AdaBoostClassifier()))
models.append(('LogisticRegression', LogisticRegression()))# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# boxplot algorithm comparison
fig = plt.figure(figsize=(40, 20))
fig.suptitle('Algorithm Comparison', fontsize=40)
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names, fontdict={'fontsize': 20})
plt.show()
svc = SVC()
svc.fit(X_train, y_train)
pred_svc = svc.predict(X_test)
def svc_param_selection(X, y, nfolds):
param = {
'C': [0.1, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4],
'kernel': ['linear', 'rbf'],
'gamma': [0.1, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4]
}
grid_search = GridSearchCV(svc, param_grid=param,
scoring='accuracy', cv=nfolds)
grid_search.fit(X, y)
return grid_search.best_params_
svc_param_selection(X_train, y_train, 10)
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, mean_absolute_error
svc2 = SVC(C= 1.4, gamma= 1.1, kernel= 'rbf')
svc2.fit(X_train, y_train)
pred_svc2 = svc2.predict(X_test)
print('Confusion matrix')
print(confusion_matrix(y_test, pred_svc2))
print('Classification report')
print(classification_report(y_test, pred_svc2))
print('Accuracy score',accuracy_score(y_test, pred_svc2))
sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=60)
sgd.fit(X_train, y_train)
pred_sgd = sgd.predict(X_test)
rfc = RandomForestClassifier(n_estimators=200, max_depth=20,
random_state=0)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
pred_knn = knn.predict(X_test)
def knn_param_selection(X, y, nfolds):
param = {
'n_neighbors': [2, 3, 4, 5, 6],
'weights': ['uniform', 'distance'],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'p': [1, 2]
}
grid_search = GridSearchCV(knn, param_grid=param,
scoring='accuracy', cv=nfolds)
grid_search.fit(X, y)
return grid_search.best_params_
knn_param_selection(X_train, y_train, 10)
knn2 = KNeighborsClassifier(algorithm= 'auto',
n_neighbors= 5, p=2,
weights='distance')
knn2.fit(X_train, y_train)
pred_knn2 = knn2.predict(X_test)
print('Confusion matrix')
print(confusion_matrix(y_test, pred_knn2))
print('Classification report')
print(classification_report(y_test, pred_knn2))
print('Accuracy score',accuracy_score(y_test, pred_knn2))
ada_classifier = AdaBoostClassifier(n_estimators=100)
ada_classifier.fit(X_train, y_train)
pred_ada = ada_classifier.predict(X_test)
scores = cross_val_score(ada_classifier,X_test,y_test, cv=5)
print('Accuracy score',scores.mean())
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(
mean_absolute_error(test_labels, predictions)))
print('Accuracy = {:0.2f}%.'.format(accuracy_score(
test_labels, predictions)*100))
evaluate(svc,X_test,y_test)
evaluate(svc2,X_test,y_test)
evaluate(sgd,X_test,y_test)
evaluate(rfc,X_test,y_test)
evaluate(knn2, X_test, y_test)
evaluate(ada_classifier,X_test,y_test)
importance = rfc.feature_importances_
std = np.std([tree.feature_importances_ for tree in rfc.estimators_],
axis=0)
indices = np.argsort(importance)
# plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.barh(range(X.shape[1]), importance[indices],
color="b", align="center")
plt.yticks(range(X.shape[1]), column_names)
plt.ylim([0, X.shape[1]])
plt.show()