Regression Model

Logistic Regression

Using data from UCI's wine dataset analyzing the chemical breakdown of wines from Portugal's Vinho Verde region, we set out to design a model that could predict wine type (white or red) based on chemical analyses.

Vinho Verde Map





Wine Type Breakdown

A look at the data shows that there are more white wines than red wines in our dataset. The Vinho Verde region is primarily known for light, fruity and slightly effervescent white wines so this is to be expected, however, we were somewhat concerned about an unblanced dataset.




Dataset Features

Luckily for us the dataset was primarily clean and filled with continous (numerical) data. The chemical makeup of white wines and red wines are essentially identical which makes this data ideal for binary classification using a logistic regression model.

Feature Name
fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
free sulfur dioxide
total sulfur dioxide
density
pH
sulphates
alcohol



Results of the Model



Classification Report

Our model returned an accuracy of 88%.

precision recall f1-score support
0.0 0.79 0.73 0.76 499.00
1.0 0.91 0.94 0.92 1450.00
accuracy 0.88 0.88 0.88 0.88
macro avg 0.85 0.83 0.84 1949.00
weighted avg 0.88 0.88 0.88 1949.00


Confusion Matrix

Here you can see the model was good at classifying wines that were not white.

n=1949 Predicted YES Predicted NO
Actual YES
Actual NO

Mean Absolute Error = 0.117

Mean Squared Error = 0.117

Root Mean Squared Error = 0.342




Tuning and Evaluating the Model

An 88% percent accuracy score is pretty good but we wanted to see whether we could improve that with some advanced techniques. Using Grid Search Cross Validation, we were able to increase our score from 88% to a whopping 93%!

AUC Curve
Area Under the Curve

Area Under the Curve or AUC represents the probability that a model ranks a positive example more highly than a radnom negative one. AUC ranges from 0 to 1. Models that are 100% right would have an AUC of 1.0 and those that are 100% wrong would have an AUC of 0.0. As you can see, our model returned am AUC of .95, suggesting that this model is highly accurate.

AUC Curve