Mushroom Classification

Pranav Kumar
4 min readFeb 7, 2021

Choose your mushroom wisely

Image Source : Google Images

Introduction

  • Mushroom is a fleshy fruiting body of some fungi arising from a group of mycelium buried in substratum. Most of the mushrooms belong to the Sub- Division: Basidiomycotina and a few belong to Ascomycotina of Kingdom-Fungi.
  • It is reported that there are about 50,000 known species of fungi and about 10,000 are considered as edible ones. Of which, about one hundred and eighty mushrooms can be tried for artificial cultivation and seventy are widely accepted as food.

Dataset

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one.

Mushroom Classification with Machine Learning

First, we will import the required libraries. Then, we will import the data from mushrooms.csv file using read_csv() function. The imported data is transferred to dataframe df. A dataframe is a two-dimensional data structure which stores the data in tabular form i.e. rows and columns. And now we will have a look at the data. head() is used to display the first five rows of the dataframe by default.

Shape is used to tell the no. of rows and columns present in the dataset. Describe is used to view some statistical details like percentile, mean, std etc. of a dataframe.

The info() function is used to get a concise summary of the dataframe.

Scikit-learn’s algorithms generally cannot be powered by missing data, so we’ll look at the columns to see if there are any attributes that contain missing data.

Value_counts returns a series containing counts of unique values.

Now, we will use LabelEncoder from sklearn.Preprocessing library. It involves converting each value in a column to a number.

Scikit-learn’s algorithms generally cannot be powered by objects or string values. It has to be either in float or in integer which the LabelEncoder provides.

Now we will do some plotting/visualizing our data to understand the relation ship between the numerical features.
I have used both python matplotlib and seaborn library to visualize the data.

A countplot is kind of like a histogram or a bar graph for some categorical area. It simply shows the number of occurrences of an item based on a certain type of category.

Now we will see how these features are correlated to each other using correlation heatmap in seaborn library.

It combines the visualization of a heatmap and the information of the correlation matrix in a visually appealing way.

Here, ‘X’ is my input variable which contains the attributes that are required for training the model. Whereas ‘y’ is my target variable or the desired variable.

For training my model I have imported train_test_split from model_selection from sklearn library and have used GaussianNB.

The test_size which I have given is 0.2 i.e. 20%. Generally the test_size varies from 20% to 30% and the rest 70% to 80% of the data is used for training the model.

By using Gaussian naive bayes I got the accuracy of 91.8%. Then, I tried training the model with Decision Tree Classifier.

A Decision Tree is a simple representation for classifying examples. It is a Supervised Machine Learning where the data is continuously split according to a certain parameter.

Decision Tree consists of :

  1. Nodes : Test for the value of a certain attribute.
  2. Edges/ Branch : Correspond to the outcome of a test and connect to the next node or leaf.
  3. Leaf nodes : Terminal nodes that predict the outcome (represent class labels or class distribution).

While making decision tree, at each node of tree we ask different type of questions. Based on the asked question we will calculate the information gain corresponding to it.

Tree models where the target variable can take a discrete set of values are called classification trees.

By using Decision Tree Classifier I got the perfect score i.e. the accuracy of 100%.

Thanks for visiting…

You can have a more precise look at the jupyter notebook in my Github repository.

--

--