## Introduction

In this article, I'll demonstrate some sort of a framework for working on machine learning projects. As you may know, machine learning in general is about extracting knowledge from data therefore, most of machine learning projects will depend on a data collection - called dataset - from a specific domain on which, we are investigating a certain problem to build a predictive model suitable for it. This model should follow certain set of steps to accomplish its purpose. In the following sections, I will practically introduce a simplified clarification about the main steps for performing statistical learning or building machine learning model.

## Background

I assumed that the explanation project is implemented in Python programming language inside Jupiter Notebook (IPython) depending on using Numpy, Pandas, and Scikit-Learn packages.

## Problem Statement

To build better models, you should clearly define the problem that you are trying to solve, including the strategy you will use to achieve the desired solution. I’ve chosen a simple application of Iris Species Classification in which we will create a simple machine learning model that could be used in distinguishing the species of some iris flowers by identifying some measurements associated with each iris such as the petals’ length and width as well as sepals’ length and the width, all measured in centimetres. We will depend on a data set of previously identified measurements by experts, the flowers have been classified into the species *stosa*, *versicolor*, or *virginica*. Our mission is to build a model that can learn from these measurements, so that we can predict the species for a new iris.

## Algorithm Selection

Depending on the nature and characteristics of the problem under investigation, we need to select algorithms and techniques suitable for solving it. Since we have measurements for which we know the correct species of iris, this is a **Supervised Learning problem**. In this problem, we want to predict one of several options (the species of iris). This is an example of a **classification problem**. The possible outputs (different species of irises) are called classes. Every iris in the data set belongs to one of three classes, so this problem is a three-class classification problem. The desired output for a single data point (an iris) is the species of this flower. For a particular data point, the species it belongs to is called its label or class.

## Project Preparation

To begin working with our project’s data, we’ll first import the functionalities we need in our implementation such as *Python *libraries, setting our environment to allow us to accomplish our mission as well as loading our dataset successfully:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# For Pretty display for plots in jupyter notebooks
%matplotlib inline
# And to allow the use of the function display() for pandas data frames
from IPython.display import display
# To Ignore warnings of loaded modules during runtime execution if needed
import warnings
warnings.filterwarnings("ignore")
```

Then, we start loading our dataset into a `pandas`

. The data that will be used here is the * ***DataFrame**`Iris`

* *dataset, a classical dataset in machine learning and statistics. It is included in* *`scikit-learn`

within datasets module.

```
from sklearn.datasets import load_iris
iris=load_iris()
# printing feature names
print('features: %s'%iris['feature_names'])
# printing species of iris
print('target categories: %s'%iris['target_names'])
# iris data shape
print("Shape of data: {}".format(iris['data'].shape))
full_data=pd.DataFrame(iris['data'],columns=iris['feature_names'])
# converting to pandas DataFrame
full_data['Classification']=iris['target']
full_data['Classification']=
full_data['Classification'].apply(lambda x: iris['target_names'][x])
# Note: Here we are using built-in dataset that comes
with Sci-kit Learn library but in practice,
# the data set often comes as csv files so we could
use a code looks like the following commented one:
# myfile='MyFileName.csv'
# full_data=pd.read_csv(myfile, thousands=',',
delimiter=';',encoding='latin1',na_values="n/a")
```

The output will be as follows:

**features**: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
**target categories**: ['setosa' 'versicolor' 'virginica']
**Shape of data**: (150, 4)

The data of iris flowers contains the numeric measurements of sepal length, sepal width, petal length, and petal width and is stored as ** Numpy array**, so we have converted it to pandas

`Dataframe`

.

Note: Here we are using built-in dataset that comes withSci-kit Learnlibrary but in practice, the data set often comes ascsvfiles so we could use code that looks like the following commented code:#myfile='MyFileName.csv'#full_data=pd.read_csv(myfile, thousands=',', delimiter=';', encoding='latin1', na_values="n/a")

Then we start loading our dataset into a pandas `DataFrame`

. The data that will be used here is the `Iris`

dataset, a classical dataset in machine learning and statistics. It is included in `scikit-learn`

within datasets module.

**from sklearn.datasets import** load_iris
iris=load_iris()
*#printing feature names*
print('features: **%s**'%iris['feature_names'])
*#printing species of iris*
print('target categories: **%s**'%iris['target_names'])
*#iris data shape*
print("Shape of data: **{}**".format(iris['data'].shape))

The output will be as follows:

**features**: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
**target categories**: ['setosa' 'versicolor' 'virginica']
**Shape of data**: (150, 4)

The data of iris flowers contains the numeric measurements of sepal length, sepal width, petal length, and petal width and is stored as `NumPy array`

, so to convert it to `pandas DataFrame`

, we will write the following code:

```
full_data=pd.DataFrame(iris['data'],columns=iris['feature_names'])
full_data['Classification']=iris['target']
full_data['Classification']=full_data['Classification'].apply(
```**lambda** x: iris['target_names'][x])

## Data Exploration

Before building a machine learning model, we must first analyze the dataset we are using for the problem, and we are always expected to assess for common issues that may require preprocessing. So data exploration is a necessary action that should be done before proceeding with our model implementation. This is done by showing a quick sample from the data, describing the type of data, knowing its shape or dimensions, and if needed, having basic statistics and information related to the loaded dataset, as well as exploring of input features and any abnormalities or interesting qualities about the data that may need to be addressed. Data exploration provides you with a deeper understanding of your datasets, including Dataset schemas, Value distributions, Missing values and Cardinality of categorical features.

To start exploring our dataset represented by the `iris`

object that is returned by `load_iris()`

and stored in `mydata `

, we could display the first few entries for examination using the *pandas *DataFrame`.`

function.**head**()

`display(full_data.head())`

To display more information about the structure of the data:

`full_data.`

orinfo()`full_data.`

dtypes

To check if there are any `null`

values present in the dataset:

`full_data.`**isnull**().**sum**()

Or use the following function to get more details about these `null`

values:

```
def NullValues(theData):
null_data = pd.DataFrame(
{'columns': theData.columns,
'Sum': theData.isnull().sum(),
'Percentage': theData.isnull().sum() * 100 / len(theData),
'zeros Percentage': theData.isin([0]).sum() * 100 / len(theData)
}
)
return null_data
```

From the above sample of our dataset, we can see that the dataset contains measurements for 150 different flowers. Individual items are called **examples**, instances or samples in machine learning, and their properties (the five columns) are called **features **(four features and one column is the `Target`

or the class for each instance or example). The shape of the data array is the number of samples multiplied by the number of features. This is a convention in `scikit-learn`

, and our data will always be assumed to be in this shape.

Given below is the detailed explanation of the features and the classes contained in our `dataset`

:

**sepal length**: Represents the length of the sepal of the specified iris flower in centimetres**sepal width**: Represents the width of the sepal of the specified iris flower in centimetres**petal length**: Represents the length of the petal of the specified iris flower in centimetres**petal width**: Represents the width of the petal of the specified iris flower in centimetres

For displaying statistics about the new `dataset`

, we could use:

`full_data.describe()`

You could compare min and max and see if scale is large and there’s a need for Scaling of numerical features.

Since we’re interested in the classification of iris flowers, i.e., we are observing only the classes or label of each flower based on the given measurements or features, we can remove the `Classification`

feature from this dataset and store it as its own separate variable `Labels`

. We will use these labels as our prediction targets. The code below will remove `Classification`

as a feature of the dataset and store it in `Labels`

.

```
Labels = full_data['Classification']
mydata = full_data.
```**drop**('Classification', axis = 1)**display**(mydata.**head**())**display**(Labels.**head**())

The `Classification`

feature is removed from the `DataFrame`

. Note that data (the iris flowers data) and `Labels`

(the labels or classifications of iris flowers) are now paired. That means for any iris flower `mydata.`

, they have the a classification or label **loc**[*i*]`Labels`

.**[***i***]**

To filter the input data by removing elements that do not match certain provided condition, the following function takes a data list as input and returns a filtered list as shown in the code below:

**def** filter_data(data, conditionField,conditionOperation,conditionValue):
*"""*
* Remove elements that do not match the condition provided.*
* Takes a data list as input and returns a filtered list.*
* Conditions passed as separate parameters for the field, operation and value.*
* *
* """*
*#field, op, value = condition.split(" ") # Example: ["field == 'value'", 'field < 18']*
field, op, value = conditionField,conditionOperation,conditionValue
*# convert value into number or strip excess quotes if string*
**try**:
value = float(value)
**except**:
value = value.strip("**\'\"**")
*# get booleans for filtering*
**if** op == ">":
matches = data[field] > value
**elif** op == "<":
matches = data[field] < value
**elif** op == ">=":
matches = data[field] >= value
**elif** op == "<=":
matches = data[field] <= value
**elif** op == "==":
matches = data[field] == value
**elif** op == "!=":
matches = data[field] != value
**else**: *# catch invalid operation codes*
**raise Exception**("Invalid comparison operator. Only >, <, >=, <=, ==, != allowed.")
*# filter data and outcomes*
data = data[matches].reset_index(drop = **True**)
**return** data

As an example, we could filter our data to be a list of all flowers that have the attribute `sepal width (cm)`

less than `3`

as follows:

```
filtered_data=filter_data(mydata, 'sepal width (cm)','<','3')
display(filtered_data.head())
```

## Data Visualization

Before building a machine learning model, it is a good idea to look at our data for more understanding about the relationships between the various components constituting it. Inspecting our data is a good way to find abnormalities and peculiarities. Maybe some of your irises were measured using inches and not centimetres, for example. In the real world, inconsistencies in the data and unexpected measurements are very common. One of the best ways to inspect data is to do some visualization over our data. We could use python

library or **matplotlib**

library for plotting and visualizing the data using different types of plots like **seaborn***bar plot*, *box plot*, *scatter plot*, etc.

The following function displays `BarGraph`

for any single arbitrary attribute from our data:

**def** BarGraph(theData,target,attributeName):
xValues = np.unique(target)
yValues=[]
**for** label **in** xValues:
yValues.append(theData.loc[target==label, attributeName].idxmax())
plt.bar(xValues,yValues)
plt.xticks(xValues, target)
plt.title(attributeName)
plt.show()

The code below tests the above function by displaying bar graph illustration for `sepal length (cm)`

as an example.

`BarGraph(mydata,Labels,'sepal length (cm)')`

Also, we could display multiple bar graphs together at the same time using the following function:

**def** BarGraphs(theData,target,attributeNamesList,graphsTitle='Attributes Classifications'):
xValues = np.unique(target)
fig, ax = plt.subplots(nrows=int(len(attributeNamesList)/2), ncols=2,figsize=(16, 8))
k=0
**for** row **in** ax:
**for** col **in** row:
yValues=[]
**for** label **in** xValues:
yValues.append(theData.loc[target==label, attributeNamesList[k]].idxmax())
col.set_title(attributeNamesList[k])
*#col.set(xlabel=" x Label", ylabel=' y Label')*
col.bar(xValues,yValues)
k=k+1
fig.suptitle(graphsTitle)
plt.show()

Example:

```
BarGraphs(mydata,Labels,['sepal length (cm)','sepal width (cm)',
'petal length (cm)','petal width (cm)'])
```

Visualizing value distribution across our dataset gives more understanding of attribute distributions inside the dataset to check the nature of that distribution if it is normal or uniform distribution and this could be done as follows:

**def** Distribution(theData,datacolumn,type='value'):
**if** type=='value':
print("Distribution for **{}** ".format(datacolumn))
theData[datacolumn].value_counts().plot(kind='bar')
**elif** type=='normal':
attr_mean=theData[datacolumn].mean() *# Mean of the attribute values*
attr_std_dev=full_data[datacolumn].std() *# Standard Deviation of the attribute values*
ndist=np.random.normal(attr_mean, attr_std_dev, 100)
norm_ax = sns.distplot(ndist, kde=**False** )
plt.show()
plt.close()
**elif** type=='uniform':
udist = np.random.uniform(-1,0,1000)
uniform_ax = sns.distplot(udist, kde=**False** )
plt.show()
**elif** type=='hist':
theData[datacolumn].hist()

To test visualizing value distribution for the `Classification`

attribute, we could write:

`Distribution(full_data, 'Classification')`

Another visualization example is to show the Histogram diagram for a certain attribute. We use the following:

`Distribution(full_data, 'sepal length (cm)',type='hist')`

`sns.`**distplot**(full_data['sepal length (cm)'])

It appears that the attribute `sepal length (cm)`

has a normal distribution.

To explore how these features are correlated to each other, we could use

in **heatmap***seaborn *library. We can see that Sepal Length and Sepal Width features are slightly correlated with each other.

`plt.`**figure**(figsize=(10,11))
sns.**heatmap**(mydata.corr(),annot=**True**)
plt.**plot**()

To observe relationships between features in our dataset, we could use a **scatter plot**. A *scatter plot* of the data puts one feature along the x-axis and another along the y-axis, and draws a dot for each data point. For scatter plotting how our data is distributed based on Sepal Length Width features, we could use the code below:

`sns.`**FacetGrid**(full_data,hue="Classification").**map**(plt.scatter,"sepal length (cm)",
"sepal width (cm)").**add_legend**()
plt.**show**()

To plot datasets with more than three features, we could use a `pair plot`

, which looks at all possible pairs of features. If you have a small number of features, such as the four we have here, this is quite reasonable. You should keep in mind, however, that a pair plot does not show the interaction of all features at once, so some interesting aspects of the data may not be revealed when visualizing it this way. We could use the

function in the **pairplot***seaborn *library as follows:

`sns.`**pairplot**(mydata)

Another way to do scatter plotting is to use `scatter matrix`

existed in `plotting`

module comes with `pandas`

library. the following code creates a scatter matrix from the dataframe, and colors will be by Classes or labels:

**from pandas.plotting import** scatter_matrix
colors = list()
palette = {0: "red", 1: "green", 2: "blue"}
**for** c **in** np.nditer(iris.target): colors.append(palette[int(c)])
grr = scatter_matrix(mydata, alpha=0.3,figsize=(10, 10),
diagonal='hist', color=colors, marker='o', grid=**True**)

From the plots, we can see that the three classes seem to be relatively well separated using the sepal and petal measurements. This means that a machine learning model will likely be able to learn to separate them.

To show density of the length and width in the species, we could use

of all the input variables against output variable which is Species.**violin plot**

```
plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.
```**violinplot**(x="Classification",y="sepal length (cm)",data=full_data)
plt.subplot(2,2,2)
sns.**violinplot**(x="Classification",y="sepal width (cm)",data=full_data)
plt.subplot(2,2,3)
sns.**violinplot**(x="Classification",y="petal length (cm)",data=full_data)
plt.subplot(2,2,4)
sns.**violinplot**(x="Classification",y="petal width (cm)",data=full_data)

The **thinner **part denotes that there is **less density** whereas the **fatter **part conveys **higher density**.

And similarly, we may use

to see how the **boxplot***categorical feature* `Classification`

is distributed with all other input and also, to check for Outliers variables:

```
plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.
```**boxplot**(x="Classification",y="sepal length (cm)",data=full_data)
plt.subplot(2,2,2)
sns.**boxplot**(x="Classification",y="sepal width (cm)",data=full_data)
plt.subplot(2,2,3)
sns.**boxplot**(x="Classification",y="petal length (cm)",data=full_data)
plt.subplot(2,2,4)
sns.**boxplot**(x="Classification",y="petal width (cm)",data=full_data)

To check Cardinality:

**def** count_unique_values(theData, categorical_columns_list):
cats = theData[categorical_columns_list]
rValue = pd.DataFrame({'cardinality': cats.nunique()})
**return** rValue

## Splitting Dataset Into Training and Testing Data

We cannot use the same data we used to build the model to evaluate it. This is because our model can always simply remember the whole training set, and will therefore always predict the correct label for any point in the training set. This **remembering** does not indicate to us whether our model will generalize well, i.e., whether it will also perform well on new data. So, Before using a machine learning model that can predict from unseen data, we should have some way to know whether it actually works or not. Hence, we need to split the labeled data into two parts. One part of the data is used to build our machine learning model, and is called the `training data`

or

. The rest of the data will be used to measure how well the model works; this is called the **training set**`test data`

, or

.**test set**

`scikit-learn`

contains a function that *shuffles *the dataset and *splits *it for you: the

function. This function extracts **train_test_split****75**% of the rows in the data as the `training set`

, together with the corresponding labels for this data. The remaining **25**% of the data, together with the remaining labels, is declared as the `test set`

.

In `scikit-learn`

, data is usually denoted with a capital X, while labels are denoted by a lowercase y. This is inspired by the standard formulation f(x)=y in mathematics, where x is the input to a function and y is the output. Following more conventions from mathematics, we use a capital X because the data is a two-dimensional array (a matrix) and a lowercase y because the target is a one-dimensional array (a vector). Let’s call `train_test_split`

on our data and assign the outputs using this following code:

**from sklearn.model_selection import** train_test_split
X_train, X_test, y_train, y_test =
**train_test_split**(mydata,Labels, random_state=0)print("X_train shape: **{}**".format(X_train.**shape**))
print("y_train shape: **{}**".format(y_train.**shape**))
print("X_test shape: **{}**".format(X_test.**shape**))
print("y_test shape: **{}**".format(y_test.**shape**))

Before making the split, the `train_test_split`

function shuffles the dataset using a pseudorandom number generator. If we just took the last 25% of the data as a test set, all the data points would have the label 2, as the data points are sorted by the label (see the output for `iris[‘target’]`

shown earlier). Using a test set containing only one of the three classes would not tell us much about how well our model generalizes, so we shuffle our data to make sure the test data contains data from all classes. To make sure that we will get the same output if we run the same function several times, we provide the pseudo random number generator with a fixed seed using the `random_state`

parameter. This will make the outcome deterministic, so this line will always have the same outcome. The output of the `train_test_split`

function is `X_train`

, `X_test`

, `y_train`

, and `y_test`

, which are all NumPy arrays. `X_train`

contains 75% of the rows of the dataset, and `X_test`

contains the remaining 25%.

## Build the Model

Now we can start building the actual machine learning model. There are many classification algorithms in `scikit-learn`

that we could use. Here, we will use a

, which is easy to understand. Building this model only consists of storing the training set. To make a prediction for a new data point, the algorithm finds the point in the training set that is closest to the new point. Then it assigns the label of this training point to the new data point.**k-nearest neighbors classifier**

The `k`

in `k-nearest neighbors`

signifies that instead of using only the closest neighbor to the new data point, we can consider any fixed number k of neighbors in the training (for example, the closest three or five neighbors). Then, we can make a prediction using the majority class among these neighbors. For simplification, we’ll use only a single neighbor.

All machine learning models in `scikit-learn`

are implemented in their own classes, which are called `Estimator`

classes. The **k-nearest neighbors classification algorithm** is implemented in the

in the **KNeighborsClassifier **class`neighbors`

module. Before we can use the model, we need to instantiate the class into an object. This is when we will set any parameters of the model. The most important parameter of `KNeighbor`

`sClassifier`

is the `number of neighbors`

, which we will set to `1`

:

**from sklearn.neighbors import** KNeighborsClassifier
knn = **KNeighborsClassifier**(n_neighbors=1)

The `knn`

* *object encapsulates the algorithm that will be used to build the model from the training data, as well the algorithm to make predictions on new data points. It will also hold the information that the algorithm has extracted from the training data. In the case of

, it will just store the training set. To build the model on the training set, we call the **KNeighborsClassifier**`fit`

method of the `knn`

object, which takes as arguments the `NumPy`

* *array `X_train`

containing the training data and the `NumPy`

* *array `y_train`

of the corresponding training labels:

`knn.`**fit**(X_train, y_train)

The

method returns the **fit**`knn`

object itself (and modifies it in place), so we get a string representation of our classifier. The representation shows us which parameters were used in creating the model. Nearly all of them are the default values, but you can also find `n_neighbors=1`

, which is the parameter that we passed. Most models in `scikit-learn`

have many parameters, but the majority of them are either speed optimizations or for very special use cases. You don’t have to worry about the other parameters shown in this representation. Printing a `scikit-learn`

model can yield very long strings, but don’t be intimidated by these. So, we will not show the output of fit because it doesn’t contain any new information.

We can now make predictions using this model on new data for which we might not know the correct labels. Imagine we found an iris in the wild with a sepal length of 5 cm, a sepal width of 2.9 cm, a petal length of 1 cm, and a petal width of 0.2 cm. What species of iris would this be? We can put this data into a NumPy array, again by calculating the shape — that is, the number of samples (1) multiplied by the number of features (4):

`X_new = np.`**array**([[5, 2.9, 1, 0.2]])
**print**("X_new.**shape**: **{}**".format(X_new.shape))

Note that we made the measurements of this single flower into a row in a twodimensional NumPy array, as scikit-learn always expects two-dimensional arrays for the data.

To make a prediction, we call the ** predict **method of the

`knn`

object:`prediction = knn.`**predict**(X_new)
print("Prediction: **{}**".format(prediction))

Our model predicts that this new iris belongs to the class or label or species `setosa`

. But how do we know whether we can trust our model? We don’t know the correct species of this sample, which is the whole point of building the model!

## Model Evaluation

For any project, we need to clearly define the metrics or calculations, we will use to measure performance of our model or the results in our project, i.e., to measure the performance of our predictions, we need a metric to score our predictions against the true classifications of the given examples. These calculations and metrics should be justified based on the characteristics of the problem and problem domain. In machine learning classification models, we usually use a variety of performance measure metrics. From the measures that are commonly applied to classification problems, we could mention

, **Accuracy **(A)

, **Precision **(P)

, and **Recall **(R)

, ... etc.**F1-Score**

Classification Performance metrics

## Accuracy

`Accuracy`

or the classification rate measures how often is the classifier correct and it is defined as the fraction of predictions our model got right. The accuracy score is computed from the following formula:

To know how *accurate* our predictions are, we will calculate the proportion of cases where our prediction of their diagnosis is correct. The code below will create our

function that calculates the **GetAccuracy**()*accuracy score*.

```
def GetAccuracy(datasetClasses, PredictedClasses):
""" Returns accuracy score for input Dataset Classes/labels and Predicted Classes/labels """
# Ensure that the number of predictions matches number of classes/labels
if len(datasetClasses) == len(PredictedClasses):
# Calculate and return the accuracy as a percent
return "Predictions have an accuracy of
{:.2f}%.".format((datasetClasses == PredictedClasses).mean()*100)
else:
return "Number of predictions does not match number of Labels/Classes!"
```

**Example:** *Out of the first five flowers, if we predict that all of them are the list `predictions = ['setosa','versicolor','versicolor','setosa','setosa']`

, so we would expect the accuracy of our predictions to be as follows:

*# Test the 'accuracy_score' function*
predictions = ['setosa','versicolor','versicolor','setosa','setosa']
print(GetAccuracy(Labels[:5], predictions))Predictions have an accuracy of 60.00%.

Also, as a second example, we could assume as a prediction that all flowers in our dataset are predicted as `setosa`

. So, the code below will always predict that all flowers in our dataset are `setosa`

.

**def** predictions_example(data):
predictions = []
**for** flower **in** data:
*# Predict the survival of 'passenger'*
predictions.append('setosa')
*# Return our predictions*
**return** pd.Series(predictions)*# Make the predictions*
predictions = predictions_example(Labels)

- To know how accurate a prediction would be that all of the flowers have species of
`setosa`

? The code below shows the accuracy of this prediction.

`print(GetAccuracy(Labels, predictions))Predictions have an accuracy of 33.33%.`

## Confusion Matrix

In the case of statistical **classification**, we could use the so called **confusion matrix**, also known as an **error matrix**. A confusion matrix is considered as a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly, the types of errors that are being made.

A **confusion matrix** is constructed as a table that is often used to describe the performance of a **classification** model (or “classifier”) on a set of test data for which the true/real/actual values are known. They are defined in terms of **Positive classes** which are the observations that are **positive** (for example: Chest X-ray image indicates presence of Pneumonia) and the **Negative Classes** which are the observations that are **not positive** (for example: Chest X-ray image indicates absence of Pneumonia, i.e., Normal). To compute performance metrics using confusion matrix, we should count the following four values:

**True Positives**(

): Observations that are**TP**`positive`

, and are predicted to be`positive`

.**False Positives**(

): Observations that are**FP**`negative`

, but are predicted`positive`

(**Type 1 Error**).**True Negatives**(

): Observations that are**TN**`negative`

, and are predicted to be`negative`

.**False Negatives**(

): Observations that are**FN**`positive`

, but are predicted`negative`

(**Type 2 Error**).

here means cases that correctly classified either positive or negative while **True**

indicates cases that are incorrectly classified as positive or negative.**False**

Confusion Matrix used typically in supervised learning algorithms (in unsupervised learning, it is usually called a matching matrix) and most performance measures are computed from the confusion matrix. Therefore, if we considered the

score, it would be calculated as ((TP+TN)/total) using the confusion matrix as follows:**accuracy**

There are problems with accuracy measure that it assumes equal costs for both kinds of errors. A 99% accuracy can be excellent, good, mediocre, poor or terrible depending upon the problem. It could be a reasonable initial measure if the classes in our dataset are all of similar sizes.

## Precision

**Precision** reflects the fraction of reported actual cases/classes that are so (i.e., the proportion of positive identifications was actually correct). It is calculated by dividing the total number of correctly classified positive examples by the total number of predicted positive examples(TP/Predicted Yes) as follows:

Precision is used when it predicts Yes, and measures how often is it correct. High Precision indicates an example labelled as positive is indeed positive (a small number of FP).

## Recall

We could define **Recall** as the fraction of true classes that are found by the classifier(TP/Actual Yes), i.e., the proportion of actual positives was identified correctly. It is sometimes called **Sensitivity** and it is the ratio of the total number of correctly classified positive examples divide to the total number of positive examples so is calculated as follows:

Recall is used when it’s actually yes, and measures how often does the classifier predicts yes. It is known as **Sensitivity** or **True Positive Rate(TPR)**. High Recall indicates the class is correctly recognized (a small number of FN). Sensitivity is a metric that tells us among ALL the positive cases in the dataset, how many of them are successfully identified by the algorithm, i.e., the true positive. In other words, it measures the proportion of accurately-identified positive cases. You can think of highly sensitive tests as being good for ruling out negative. If someone has a negative result on a highly sensitive algorithm, it is extremely likely that they don’t have the positive since a high sensitive algorithm has low false negative.

Note that:

High recall,low precision: This means that most of the positive examples are correctly recognized (low FN), but there are a lot of false positives.

Low recall,high precision: This shows that we miss a lot of positive examples (high FN), but those we predict as positive are indeed positive (low FP).

## F1-Score

Since we have two measures (Precision and Recall), it helps to have a measurement that represents both of them. We calculate an F-measure which uses **Harmonic Mean** in place of Arithmetic Mean as it punishes the extreme values more. The F-Measure will always be nearer to the smaller value of Precision or Recall. **F1-Score** is a measure of a test's accuracy and is expressed in terms of Precision and Recall (Harmonic Mean between precision and recall), it could be considered as a measure that punishes false negatives and false positives equally but weighted by their inverse fractional contribution to the full set to account for large class number hierarchies. It is computed as follows:

## Fall-out (False Positive Rate)

It is used when it’s actually No, and measures how often does the classifier predicts yes. I is computed as (FP/Actual No).

## Specificity

Specificity measures ALL the negative cases in the dataset, how many of them are successfully identified by the algorithm, i.e., the true negatives. In other words, it measures the proportion of accurately-identified negative cases. You can think of highly specific tests as being good for ruling in negative. If someone has a positive result on a highly specific test, it is extremely likely that they have the positive since a high specific algorithm has low false positive. It is know also as **True Negative Rate**, and is used when it's actually No, hence it measures how often the classifier predicts No.

To try calculating our previously mentioned evaluation metrics, this is where the test set that we created earlier comes in. This data was not used to build the model, but we do know what the correct species is for each iris in the test set. Therefore, we can make a prediction for each iris in the test data and compare it against its label (the known species). Use the following code:

`y_pred = knn.`**predict**(X_test)
**print**("Test set predictions:**\n {}**".**format**(y_pred))Test set predictions:
['virginica' 'versicolor' 'setosa' 'virginica' 'setosa' 'virginica'
'setosa' 'versicolor' 'versicolor' 'versicolor' 'virginica' 'versicolor'
'versicolor' 'versicolor' 'versicolor' 'setosa' 'versicolor' 'versicolor'
'setosa' 'setosa' 'virginica' 'versicolor' 'setosa' 'setosa' 'virginica'
'setosa' 'setosa' 'versicolor' 'versicolor' 'setosa' 'virginica'
'versicolor' 'setosa' 'virginica' 'virginica' 'versicolor' 'setosa'
'virginica']

We can measure how well the model works by computing the **accuracy**, which is the fraction of flowers for which the right species was predicted:

`print(`**GetAccuracy**(y_test, y_pred))Predictions have an accuracy of 97.37%.

We can also use the `score`

method of the `knn`

object, which will compute the test set **accuracy **for us:

`print("Test set score: `**{:.2f}**".format(knn.score(X_test, y_test)))Test set score: 0.97

For this model, the test set **accuracy **is about 0.97, which means we made the right prediction for 97% of the irises in the test set. Under some mathematical assumptions, this means that we can expect our model to be correct 97% of the time for new irises. For our application, this high level of accuracy means that our model may be trustworthy enough to use. In most of the cases, the initial model we have built is fine-tuned to improve its performance and we re-evaluate it again and again to get final accepted model for application.

Now, let’s use `SciKit-Learn`

to create our **Confusion Matrix**, and calculate the performance evaluation metrics:

**from sklearn.metrics import** confusion_matrix
**from sklearn.metrics import** accuracy_score
**from sklearn.metrics import** classification_report
**import itertools**CM = **confusion_matrix**(y_test, y_pred)print ('Confusion Matrix :')
print(CM)Confusion Matrix :
[[13 0 0]
[ 0 15 1]
[ 0 0 9]]print ('Accuracy Score :',
accuracy_score(y_test, y_pred) )Accuracy Score : 0.9736842105263158print ('Report : ')
print(**classification_report**(y_test, y_pred))**Report **:
**precision recall f1-score** support setosa 1.00 1.00 1.00 13
versicolor 1.00 0.94 0.97 16
virginica 0.90 1.00 0.95 9 accuracy 0.97 38
macro avg 0.97 0.98 0.97 38
weighted avg 0.98 0.97 0.97 38

To plot the confusion matrix:

**def** plot_confusion_matrix(cm, classes,normalize=**False**,title='Confusion matrix',cmap=plt.cm.Blues):
*"""*
*This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.*
*"""*
**if** normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
**else**:
print('Confusion matrix, without normalization')

```
print(cm) plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes) fmt = '.2f'
```**if** normalize **else** 'd'
thresh = cm.max() / 2.
**for** i, j **in** itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" **if** cm[i, j] > thresh **else** "black") plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()*# Plot confusion matrix*
plt.figure()
plot_confusion_matrix(CM, classes=iris['target_names'],
title='Confusion matrix, without normalization',normalize=**True**)
plt.show()

## History

- 14
^{th}May, 2020: Initial version