Iris dataset dissected

Iris dataset is considered as hello world program for a new data scientist. It is easy and one of the most widely studied data set. Iris data set is a multi-class classification problem where from features of Iris flowers, it is required to find which species the flower belongs. The features provided in the dataset are Sepal length, Sepal Width, Petal length and Petal width while the 3 iris species considered in this dataset are Setosa, Versicolor, and Virginica.

A snapshot of the dataset can be seen as follows:

   Index Sepal length Sepal width Petal length Petal width Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 6.1 2.8 4.7 1.2 Iris-versicolor
3 5.0 2.3 3.3 1.0 Iris-versicolor
4 6.0 3.0 4.8 1.8 Iris-virginica

From dataset, it is clear that it is quite difficult to distinguish iris species just by looking at sepal and petal feature. But this task is quite achievable using Machine Learning. Formally, a system is required which can differentiate species of iris dataset when provided by sepal and petal feature data.

Description of feature’s data can be tabularized as

  sepal length sepal width petal length petal width
count 152.000000 152.000000 152.000000 152.000000
mean 5.830921 3.057895 3.728289 1.184868
std 0.829604 0.432998 1.772469 0.767533
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.500000 0.300000
50% 5.800000 3.000000 4.300000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

While correlations between features are:

  sepal length sepal width petal length petal width
sepal length 1.000000 -0.118029 0.874073 0.821341
sepal width -0.118029 1.000000 -0.425485 -0.363006
petal length 0.874073 -0.425485 1.000000 0.963589
petal width 0.821341 -0.363006 0.963589 1.000000

From these data, it is quite clear that sepal length and sepal width are inversely related with each other. In fact, sepal width is inversely proportional to each other feature, while other features are directly correlated to each other. This all makes some sense regarding how data is related to each other. But it does not tell how data is related to the classes. This relationship of features with species can be seen from a histographical representation of data for each species and together of all of them.

Thus it is clear that virginica and versicolor are clustered quite together in the dataset while setosa can be identified directly from the graph. But classification from the graph cannot be done by computer. A machine needs to have some understanding about the correlation to classify them. So we can use supervised classification algorithms. The KFold method with 10 splits is considered. Validation split was considered to be @20%. Different algorithms considered for study along with their results are

Linear Regression scoring: mean: 0.927273 (Std Deviation: 0.218182)
K Neighbors scoring: mean: 0.927273 (Std Deviation: 0.218182)
Decision Tree Classifiers scoring: mean: 0.909091 (Std Deviation: 0.215130)
Gaussian naive Bayes scoring: mean: 0.909091 (Std Deviation: 0.215130)
Support Vector Machine scoring: mean: 0.927273 (Std Deviation: 0.218182)

Here, Neural networks are not used since the amount of data is very small for the training of neural network.

From above scenario, we can choose SVM for further analysis of the study. Out of many parameters, The SVM algorithms contains, we are considering  2 parameters which can be further tuned viz

  1. C: learning Rate. Its values is considered as [0.001, 0.01, 0.1, 1, 10]
  2. Gamma: Kernel coefficient. Its values are considered as [0.001, 0.01, 0.1, 1]

The best parameters are chosen using the GridSearchCV function of scikit Learn. This function provided and Gamma’s best values as 1 and 0.1 with accuracy as 0.9736842105263158.

Thus now we have an SVM best model with an accuracy of about 97%

 

You can find my code here.

About the author: sagarjain2030

Leave a Reply

Your email address will not be published.Email address is required.