Red Wine Quality Classification

Pranav Kumar
6 min readFeb 10, 2021

The more it ages, the better it gets

Image Source : Google Images

Introduction

Red wine is an alcoholic beverage made by fermenting the juice of dark-skinned grapes. Red wine differs from white wine in its base material and production process.

The first and most obvious characteristic of red wine is the color. Red wines range in hue from deep, opaque purple to pale ruby and everything in between. As red wine ages, its bright, youthful colors turn garnet and even brown.

Dataset

Red Wine quality classification Model The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

In the data set, there 1599 different wine as row data and 12 features as columns. Furthermore, there is no nun value to deal with it and all values are numeric means that input values are float and only output value is integer.

Red Wine Quality Classification with Machine Learning

First, we will import the required libraries. Then, we will import the data from winequality-red.csv file using read_csv() function. The imported data is transferred to dataframe df. A dataframe is a two-dimensional data structure which stores the data in tabular form i.e. rows and columns. And now we will have a look at the data. head() is used to display the first five rows of the dataframe by default.

Shape is used to tell the no. of rows and columns present in the dataset. Describe is used to view some statistical details like percentile, mean, std etc. of a dataframe.

The info() function is used to get a concise summary of the dataframe.

Scikit-learn’s algorithms generally cannot be powered by missing data, so we’ll look at the columns to see if there are any attributes that contain missing data. The method isnull().any() will tell us whether there are any missing values present or not.

Value_counts returns a series containing counts of unique values and dtypes return the data type of each attribute.

Now we will do some plotting/visualizing our data to understand the relation ship between the numerical features.
I have used both python matplotlib and seaborn library to visualize the data.

A countplot is kind of like a histogram or a bar graph for some categorical area. It simply shows the number of occurrences of an item based on a certain type of category.

Now, our target variable is quality. The quality here is classified into 6 categories on the scale of 1–10 which is not convenient for consumers to identify which one is good or not. So, we will convert down the values into two final categories Good or Bad.

Now, we will use LabelEncoder from sklearn.Preprocessing library. It involves converting each value in a column to a number.

Scikit-learn’s algorithms generally cannot be powered by objects or string values. It has to be either in float or in integer which the LabelEncoder provides.

Now we will see how these features are correlated to each other using correlation heatmap in seaborn library.

It combines the visualization of a heatmap and the information of the correlation matrix in a visually appealing way.

Here, ‘X’ is my input variable which contains the attributes that are required for training the model. Whereas ‘y’ is my target variable or the desired variable.

For training my model I have imported train_test_split from model_selection from sklearn library and have used GaussianNB.

The test_size which I have given is 0.2 i.e. 20%. Generally the test_size varies from 20% to 30% and the rest 70% to 80% of the data is used for training the model.

By using Gaussian naive bayes I got the accuracy of 84%. Then, I tried training the model with Decision Tree Classifier.

A Decision Tree is a simple representation for classifying examples. It is a Supervised Machine Learning where the data is continuously split according to a certain parameter.

Decision Tree consists of :

  1. Nodes : Test for the value of a certain attribute.
  2. Edges/ Branch : Correspond to the outcome of a test and connect to the next node or leaf.
  3. Leaf nodes : Terminal nodes that predict the outcome (represent class labels or class distribution).

While making decision tree, at each node of tree we ask different type of questions. Based on the asked question we will calculate the information gain corresponding to it.

Tree models where the target variable can take a discrete set of values are called classification trees.

By using Decision Tree Classifier I got the accuracy of 87.8%. Then, I tried training the model with Random Forest Classifier.

Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is.

Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

max_depth represents the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data.

random_state as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices .

By using Random Forest Classifier I got the of 90%.

Thanks for visiting…

You can have a more precise look at the jupyter notebook in my Github repository.

--

--