Iris Classification

Pranav Kumar
4 min readFeb 3, 2021

Flower which bloomed that day.

The Iris flower data is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis . It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”.

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

Iris Classification with Machine Learning

First, we will import the required libraries. Then, we will import the data from iris.csv file using read_csv() function and provide attribute names using names parameter of the read_csv() function. The imported data is transferred to dataframe df. A dataframe is a two-dimensional data structure which stores the data in tabular form i.e. rows and columns. And now we will have a look at the data. head() is used to display the first five rows of the dataframe by default.

Shape is used to tell the no. of rows and columns present in the dataset. Describe is used to view some statistical details like percentile, mean, std etc. of a dataframe.

Columns is used to return the labels. Value_counts returns a series containing counts of unique values.

Now we will do some plotting/visualizing our data to understand the relation ship between the numerical features.
I have used both python matplotlib and seaborn library to visualize the data.

Scatter plot is very useful when we are analyzing the relationship between two features on x and y axis.

sepal_length v/s sepal_width
petal_length v/s petal_width

Now, we will be plotting histogram for all four features.

Now let’s visualize the data with violin plot of all the input variables against output variable which is Species.

Subplot creates a figure and a grid of subplot with a single call, while providing reasonable control over how the individual plots are created.

Violin Plot is used to visualize the distribution of the data and its probability density. This chart is a combination of a Box Plot and a Density Plot that is rotated and placed on each side, to show the distribution shape of the data.

The thinner part denotes that there is less density whereas the fatter part conveys higher density.

Now we will see how these features are correlated to each other using correlation heatmap in seaborn library. We can see that Sepal Length and Sepal Width features are slightly correlated with each other.

It combines the visualization of a heatmap and the information of the correlation matrix in a visually appealing way.

Now, we will use LabelEncoder from sklearn.Preprocessing library. It involves converting each value in a column to a number.

Scikit-learn’s algorithms generally cannot be powered by objects or string values. It has to be either in float or in integer which the LabelEncoder provides.

Here, ‘X’ is my input variable which contains the attributes that are required for training the model. Whereas ‘y’ is my target variable or the desired variable.

For training my model I have imported train_test_split from model_selection from sklearn library and have used Logistic Regression.

The test_size which I have given is 0.2 i.e. 20%. Generally the test_size varies from 20% to 30% and the rest 70% to 80% of the data is used for training the model.

Thanks for visiting…

You can have a more precise look at the jupyter notebook in my Github repository.

--

--