Step-by-Step Guide To Implement Machine Learning I - KNN
Easy to implement machine learning
Introduction
K-nearest neighbors(KNN) is a simple machine learning algorithm, the principle of which is to calculate the distance between the test object and the training set. Then, the object in training set is selected by the distances to add to a K-NN set until the K-NN set includes a predefined number of nearest neighbors, which can be expressed as:
where is the KNN set,
is given by:
KNN Model
KNN model consists of distance calculation, select K and classify.
Distance Calculation
Distance is used to measure the similarity between two objects in the feature space. There are many methods to calculate distance between two objects, such as Euclidean distance, Cosine distance, Edit distance, Manhattan distance, etc. The simplest method is Euclidean distance, which is calculated as:
where n
is the feature dimension.
Because the different scales of feature value, the bigger value has more effect on the distance. Thus, all the features need to be normalized. Specifically, there are two normalization methods. One is min-max normalization, which is given by:
def Normalization(self, data):
# get the max and min value of each column
minValue = data.min(axis=0)
maxValue = data.max(axis=0)
diff = maxValue - minValue
# normalization
mindata = np.tile(minValue, (data.shape[0], 1))
normdata = (data - mindata)/np.tile(diff, (data.shape[0], 1))
return normdata
The other is z-score normalization, which is given by:
where is the mean of
and
is the standard deviation of
.
def Standardization(self, data):
# get the mean and the variance of each column
meanValue = data.mean(axis=0)
varValue = data.std(axis=0)
standarddata = (data - np.tile(meanValue,
(data.shape[0], 1)))/np.tile(varValue, (data.shape[0], 1))
return standarddata
After normalization, the code of Euclidean distance is shown below:
train_num = train_data.shape[0]
# calculate the distances
distances = np.tile(input, (train_num, 1)) - train_data
distances = distances**2
distances = distances.sum(axis=1)
distances = distances**0.5
Select K
If we select a small K, the model is learned with a small neighbor which will lead to small "approximate error" while large "estimate error". In a word, small K will make the model complex and tend to overfit. On the contrary, if we select a large K, the model is learned with a large neighbor which will lead to large "approximate error" while small "estimate error". In a word, large K will make the model simple and tend to large computation.
Classifiy
After determine K, we can utilize vote for classification, which means the minority is subordinate to the majority, whose code is shown below:
disIndex = distances.argsort()
labelCount = {}
for i in range(k):
label = train_label[disIndex[i]]
labelCount[label] = labelCount.get(label, 0) + 1
prediction = sorted(labelCount.items(), key=op.itemgetter(1), reverse=True)
label = prediction[0][0]
In the above code, we first use argsorts()
to get the ordered index, then count each kind of label in the first K samples, finally labelCount
is sorted to get the label with the most votes, which is the prediction for the test object. The whole prediction function is shown below:
def calcuateDistance(self, input_sample, train_data, train_label, k):
train_num = train_data.shape[0]
# calculate the distances
distances = np.tile(input, (train_num, 1)) - train_data
distances = distances**2
distances = distances.sum(axis=1)
distances = distances**0.5
# get the labels of the first k distances
disIndex = distances.argsort()
labelCount = {}
for i in range(k):
label = train_label[disIndex[i]]
labelCount[label] = labelCount.get(label, 0) + 1
prediction = sorted(labelCount.items(), key=op.itemgetter(1), reverse=True)
label = prediction[0][0]
return label
Conclusion and Analysis
KNN is implemented by linear traverse in this articles. However, there exists more effective method for KNN, like kd tree. Moreover, it is valid to apply cross-validation to get a more suitable K. Finally, let's compare our KNN with the KNN in Sklearn and the detection performance is displayed below.
Form the figure, we can learn that the KNN in this article is better than sklearn knn in terms of accuracy. The runtime is nearly the same.
The related code and dataset in this article can be found in MachineLearning.