In this blog post we will be analyzing the quality of red and white wines, and check which are the attributes that affect wine quality the most.

There are two datasets, related to red and white Vinho Verde wine samples, from the north of Portugal. The datasets can be downloaded here. The goal is to model wine quality based on physicochemical tests. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

Attribute Information:

Input variables (based on physicochemical tests):

fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

Output variable (based on sensory data):

quality (score between 0 and 10)

Red Wine

Let's first consider the red wine dataset.

Data Analysis

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

df_red = pd.read_csv('winequality-red.csv')
df_red.head()

df_red.shape

(1599, 12)

There are 1,599 samples and 12 features, including our target feature - the wine quality.

df_red.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB

All of our dataset is numeric and there are no missing values.

df_red.describe()

To understand how much each attribute correlates with the wine quality, we can compute the standard correlation coefficient or Pearson's r between every pair of attributes.

corr_matrix = df_red.corr()
corr_matrix['quality'].sort_values(ascending=False)

quality                 1.000000
alcohol                 0.476166
sulphates               0.251397
citric acid             0.226373
fixed acidity           0.124052
residual sugar          0.013732
free sulfur dioxide    -0.050656
pH                     -0.057731
chlorides              -0.128907
density                -0.174919
total sulfur dioxide   -0.185100
volatile acidity       -0.390558
Name: quality, dtype: float64

We now know the features that most affect the wine quality.
Wine quality is directly proportional to the amount of alcohol, sulphates, citric acid.
Wine quality is inversely proportional to the amount of volatile acidity, total sulfur dioxide, density.

Data Visualization

Let's visualize the data by creating histograms and density plots. We can understand the distribution for separate attributes.

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

df_red.hist(bins=10, figsize=(20, 15))
plt.show()

df_red.plot(kind='density', subplots=True, figsize=(20,15),
           layout=(4,3), sharex=False)
plt.show()

sns.distplot(df_red['quality'], hist=True, kde=True,
             bins='auto', color = 'darkblue',
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 4})
plt.show()

The data distribution for the alcohol, citric acid and sulfur dioxide content atrributes is positively skewed.
The data distribution for the density and pH attributes is quite normally distributed.
The wine quality data distribution is a bimodal distribution and there are more wines with an average quality than wines with good or bad quality.

from pandas.plotting import scatter_matrix
sm = scatter_matrix(df_red, figsize=(12, 12), diagonal='kde')
#Change label rotation
[s.xaxis.label.set_rotation(40) for s in sm.reshape(-1)]
[s.yaxis.label.set_rotation(0) for s in sm.reshape(-1)]
#May need to offset label when rotating to prevent overlap of figure
[s.get_yaxis().set_label_coords(-0.6,0.5) for s in sm.reshape(-1)]
#Hide all ticks
[s.set_xticks(()) for s in sm.reshape(-1)]
[s.set_yticks(()) for s in sm.reshape(-1)]
plt.show()

Let's create a pivot table that describes the median value of each feature for each quality score.

column_names = ['fixed acidity', 'volatile acidity', 'citric acid', 
                'residual sugar', 'chlorides', 'free sulfur dioxide', 
                'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']

df_red_pivot_table = df_red.pivot_table(column_names, 
                                    ['quality'],
                                    aggfunc='median')

df_red_pivot_table

We can see just how much effect does the alcohol content and volatile acidity have on the quality of the wine.

We can plot a correlation matrix to see how two variables interact, both in direction and magnitude.

column_names = ['fixed acidity', 'volatile acidity', 'citric acid', 
                'residual sugar', 'chlorides', 'free sulfur dioxide', 
                'total sulfur dioxide', 'density', 'pH', 'sulphates', 
                'alcohol', 'quality']

# plot correlation matrix
fig, ax = plt.subplots(figsize=(20, 20))
colormap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corr_matrix, cmap=colormap, annot=True,
           fmt='.2f', annot_kws={'size': 20})
ax.set_xticklabels(column_names, 
                   rotation=45, 
                   horizontalalignment='right',
                   fontsize=20);
ax.set_yticklabels(column_names, fontsize=20);
plt.show()

Data Cleaning

In our dataset, there aren't any missing values, outliers, or attributes that provide no useful information for the task. So, we could conclude than our dataset is quite clean.
The wine preference scores vary from 3 to 8, so it's straightforward to categorize them into 'bad' or 'good' quality of wines. We will assign discrete values of 0 and 1 for the corresponding categories.

bins = (2, 6, 8)
group_names = ['bad', 'good']
df_red['quality'] = pd.cut(df_red['quality'], bins = bins, labels = group_names)

from sklearn.preprocessing import LabelEncoder

# let's assign labels to our quality variable
label_quality = LabelEncoder()
# Bad becomes 0 and good becomes 1
df_red['quality'] = label_quality.fit_transform(df_red['quality'])
print(df_red['quality'].value_counts())

sns.countplot(df_red['quality'])
plt.show()

0    1382
1     217
Name: quality, dtype: int64

As we can see, there are far more bad quality red wines (1,382) than good quality ones (217).

Train/Test Split

Now we will split the dataset into a training set and a testing set.

X = df_red.drop('quality', axis=1)
y = df_red['quality']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

Data Preprocessing

We will scale the features so as to get optimized results.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

Modeling

We will be evaluating 8 different algorithms.

Support Vector Classifier
Stochastic Gradient Decent Classifier
Random Forest Classifier
Decision Tree Classifier
Gaussian Naive Bayes
K-Neighbors Classifier
Ada Boost Classifier
Logistic Regression

The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data. K-fold Cross Validation provides a solution to this problem by dividing the data into folds and ensuring that each fold is used as a testing set at some point.

from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV, cross_val_score

from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

seed = 7
# prepare models
models = []
models.append(('SupportVectorClassifier', SVC()))
models.append(('StochasticGradientDecentC', SGDClassifier()))
models.append(('RandomForestClassifier', RandomForestClassifier()))
models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))
models.append(('GaussianNB', GaussianNB()))
models.append(('KNeighborsClassifier', KNeighborsClassifier()))
models.append(('AdaBoostClassifier', AdaBoostClassifier()))
models.append(('LogisticRegression', LogisticRegression()))# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
   kfold = model_selection.KFold(n_splits=10, random_state=seed)
   cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
   results.append(cv_results)
   names.append(name)
   msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
   print(msg)
    
# boxplot algorithm comparison
fig = plt.figure(figsize=(40, 20))
fig.suptitle('Algorithm Comparison', fontsize=40)
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names, fontdict={'fontsize': 20})
plt.show()

SupportVectorClassifier: 0.889782 (0.023210)
StochasticGradientDecentC: 0.849151 (0.044240)
RandomForestClassifier: 0.912457 (0.029968)
DecisionTreeClassifier: 0.877264 (0.028120)
GaussianNB: 0.836559 (0.022781)
KNeighborsClassifier: 0.873364 (0.021081)
AdaBoostClassifier: 0.885876 (0.019715)
LogisticRegression: 0.883526 (0.031077)

The Box Plots of these algorithms' accuracy distribution is quite symmetrical, with negligible outliers. The adjacent box plot values are close together, which correspond to the high density of accuracy scores.

Hyperparameter Tuning

There are several factors that can help us determine which algorithm performs best. One such factor is the performance on the cross-validation set and another factor is the choice of parameters for an algorithm.

SVC

Let's fine-tune our algorithms. The first algorithm that we trained and evaluated was the Support Vector Classifier and the mean value for model prediction was 0.889. We will use GridSearchCV for the hyperparameter tuning.

svc = SVC()
svc.fit(X_train, y_train)
pred_svc = svc.predict(X_test)

def svc_param_selection(X, y, nfolds):
    param = {
        'C': [0.1, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4],
        'kernel': ['linear', 'rbf'],
        'gamma': [0.1, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4]
    }
    grid_search = GridSearchCV(svc, param_grid=param, 
                               scoring='accuracy', cv=nfolds)
    grid_search.fit(X, y)
    return grid_search.best_params_

svc_param_selection(X_train, y_train, 10)

{'C': 1.2, 'gamma': 0.9, 'kernel': 'rbf'}

Hence, the best parameters for the SVC algorithm are {C= 1.2, gamma= 0.9 , kernel= rbf}.
Let's run our SVC algorithm again with the best parameters.

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, mean_absolute_error

svc2 = SVC(C= 1.2, gamma= 0.9, kernel= 'rbf')
svc2.fit(X_train, y_train)
pred_svc2 = svc2.predict(X_test)
print('Confusion matrix')
print(confusion_matrix(y_test, pred_svc2))
print('Classification report')
print(classification_report(y_test, pred_svc2))
print('Accuracy score',accuracy_score(y_test, pred_svc2))

Confusion matrix
[[271   2]
 [ 31  16]]
Classification report
              precision    recall  f1-score   support

           0       0.90      0.99      0.94       273
           1       0.89      0.34      0.49        47

    accuracy                           0.90       320
   macro avg       0.89      0.67      0.72       320
weighted avg       0.90      0.90      0.88       320

Accuracy score 0.896875

The overall accuracy of the classifier is 89.69%, and f1-score of the weighted avg is 0.88, which is very good.

Stochastic Gradient Descent Classifier

sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=60)
sgd.fit(X_train, y_train)
pred_sgd = sgd.predict(X_test)

Random Forest Classifier

rfc = RandomForestClassifier(n_estimators=200, max_depth=20,
                             random_state=0)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)

KNeighbors Classifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
pred_knn = knn.predict(X_test)

def knn_param_selection(X, y, nfolds):
    param = {
        'n_neighbors': [2, 3, 4, 5, 6],
        'weights': ['uniform', 'distance'],
        'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
        'p': [1, 2]
    }
    grid_search = GridSearchCV(knn, param_grid=param, 
                               scoring='accuracy', cv=nfolds)
    grid_search.fit(X, y)
    return grid_search.best_params_

knn_param_selection(X_train, y_train, 10)

{'algorithm': 'auto', 'n_neighbors': 4, 'p': 2, 'weights': 'distance'}

Hence, the best parameters for the KNeighborsClassifier algorithm are {algorithm= auto, n_neighbors= 4 , p= 2, weights= distance}.
Let's run our knn algorithm again with the best parameters.

knn2 = KNeighborsClassifier(algorithm= 'auto', 
                            n_neighbors= 5, p=2,
                           weights='distance')
knn2.fit(X_train, y_train)
pred_knn2 = knn2.predict(X_test)
print('Confusion matrix')
print(confusion_matrix(y_test, pred_knn2))
print('Classification report')
print(classification_report(y_test, pred_knn2))
print('Accuracy score',accuracy_score(y_test, pred_knn2))

Confusion matrix
[[261  12]
 [ 19  28]]
Classification report
              precision    recall  f1-score   support

           0       0.93      0.96      0.94       273
           1       0.70      0.60      0.64        47

    accuracy                           0.90       320
   macro avg       0.82      0.78      0.79       320
weighted avg       0.90      0.90      0.90       320

Accuracy score 0.903125

The overall accuracy of the classifier is 90.3%, and f1-score of the weighted avg is 0.90, which is very good.

AdaBoost Classifier

ada_classifier = AdaBoostClassifier(n_estimators=100)
ada_classifier.fit(X_train, y_train)
pred_ada = ada_classifier.predict(X_test)

scores = cross_val_score(ada_classifier,X_test,y_test, cv=5)
print('Accuracy score',scores.mean())

Accuracy score 0.84375

Model Evaluation

We can compare the models by calculating their mean absolute error and accuracy.

def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(
        mean_absolute_error(test_labels, predictions)))
    print('Accuracy = {:0.2f}%.'.format(accuracy_score(
        test_labels, predictions)*100))

evaluate(svc,X_test,y_test)
evaluate(svc2,X_test,y_test)
evaluate(sgd,X_test,y_test)
evaluate(rfc,X_test,y_test)
evaluate(knn2, X_test, y_test)
evaluate(ada_classifier,X_test,y_test)

Model Performance
Average Error: 0.1250 degrees.
Accuracy = 87.50%.
Model Performance
Average Error: 0.1031 degrees.
Accuracy = 89.69%.
Model Performance
Average Error: 0.1688 degrees.
Accuracy = 83.12%.
Model Performance
Average Error: 0.1125 degrees.
Accuracy = 88.75%.
Model Performance
Average Error: 0.0969 degrees.
Accuracy = 90.31%.
Model Performance
Average Error: 0.1594 degrees.
Accuracy = 84.06%.

The KNeighborsClassifier model with hyperparameter tuning performs the best with an accuracy of 90.31%.

Feature Importance

We could also analyze the feature importance for an algorithm.

importance = rfc.feature_importances_

std = np.std([tree.feature_importances_ for tree in rfc.estimators_],
             axis=0)
indices = np.argsort(importance)

# plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.barh(range(X.shape[1]), importance[indices],
       color="b",  align="center")

plt.yticks(range(X.shape[1]), column_names)
plt.ylim([0, X.shape[1]])
plt.show()

White Wine

Let us now consider the white wine dataset.

df_white = pd.read_csv('winequality-white.csv')
df_white.head()

df_white.shape

(4898, 12)

There are 4,898 samples and 12 features, including our target feature - the wine quality.

df_white.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB

df_white.describe()

corr_matrix2 = df_white.corr()
corr_matrix2['quality'].sort_values(ascending=False)

quality                 1.000000
alcohol                 0.435575
pH                      0.099427
sulphates               0.053678
free sulfur dioxide     0.008158
citric acid            -0.009209
residual sugar         -0.097577
fixed acidity          -0.113663
total sulfur dioxide   -0.174737
volatile acidity       -0.194723
chlorides              -0.209934
density                -0.307123
Name: quality, dtype: float64

The features that have the biggest impact on wine quality are alcohol, pH, suplphates, chlorides, density, and volatile acidity.