Wine Machine Learning

Wine Type Breakdown

A look at the data shows that there are more white wines than red wines in our dataset. The Vinho Verde region is primarily known for light, fruity and slightly effervescent white wines so this is to be expected, however, we were somewhat concerned about an unblanced dataset.

Dataset Features

Luckily for us the dataset was primarily clean and filled with continous (numerical) data. The chemical makeup of white wines and red wines are essentially identical which makes this data ideal for binary classification using a logistic regression model.

Feature Name
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol

Results of the Model

Classification Report

Our model returned an accuracy of 88%.

	precision	recall	f1-score	support
0.0	0.79	0.73	0.76	499.00
1.0	0.91	0.94	0.92	1450.00
accuracy	0.88	0.88	0.88	0.88
macro avg	0.85	0.83	0.84	1949.00
weighted avg	0.88	0.88	0.88	1949.00

Confusion Matrix

Here you can see the model was good at classifying wines that were not white.

n=1949	Predicted YES	Predicted NO
Actual YES
Actual NO

Mean Absolute Error = 0.117

Mean Squared Error = 0.117

Root Mean Squared Error = 0.342

Tuning and Evaluating the Model

An 88% percent accuracy score is pretty good but we wanted to see whether we could improve that with some advanced techniques. Using Grid Search Cross Validation, we were able to increase our score from 88% to a whopping 93%!

Area Under the Curve

Area Under the Curve or AUC represents the probability that a model ranks a positive example more highly than a radnom negative one. AUC ranges from 0 to 1. Models that are 100% right would have an AUC of 1.0 and those that are 100% wrong would have an AUC of 0.0. As you can see, our model returned am AUC of .95, suggesting that this model is highly accurate.