Step-by-Step Guide to Implement Machine Learning X - KMeans
Easy to implement machine learning
Introduction
KMeans is a simple clustering algorithm, which calculates the distance between samples and centroid to generate clusters. K is the number of clusters given by user. At the initial time, the K clusters are chosen randomly, they are adjusted at each iteration to get the optimal results. The centroid is the mean of samples in its corresponding cluster, thus, the algorithm is called K"Means".
KMeans Model
KMeans
The normal KMeans algorithm is really simple. Denote the K clusters as , and the number of samples in each cluster as
. The loss function of KMeans is:
.
Calculate the derivative of loss function:
Then, make the derivative equal to 0, we can obtain:
namely, the centroid is the means of samples in its corresponding cluster. The code of KMeans is shown below:
def kmeans(self, train_data, k):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
centers = self.createCenter(train_data)
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
where adjustCluster()
is to adjust centroid after determining the initial centroids, which aims to minimize the loss function. The code of adjustCluster
is:
def adjustCluster(self, centers, distances, train_data, k):
sample_num = len(train_data)
flag = True # If True, update cluster_center
while flag:
flag = False
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
# calculate the distance between each sample and each cluster center
d[:, i] = self.calculateDistance(train_data, centers[i])
# find the minimum distance between each sample and each cluster center
old_label = distances[:, 0].copy()
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
if np.sum(old_label - distances[:, 0]) != 0:
flag = True
# update cluster_center by calculating the mean of each cluster
for j in range(k):
current_cluster =
train_data[distances[:, 0] == j] # find the samples belong
# to the j-th cluster center
if len(current_cluster) != 0:
centers[j, :] = np.mean(current_cluster, axis=0)
return centers, distances
Bisecting KMeans
Because KMeans may get a local optimation result, to solve the problem, we introduce another KMeans algorithm called bisecting KMeans. In bisecting KMeans, all the samples are regarded as a cluster at initial, and the cluster is divided into two parts. Then, choose one part of the divided cluster to bisect again and again till the number of cluster is K. We bisect the cluster according to minimum Sum of Squared Error(SSE). Denote the current n clusters as:
We choose a cluster in
and bisect it into two parts
using normal KMeans. The SSE is:
We choose the which can get a minimum of SSE as the cluster to be bisected, namely:
Repeat the above processes till the number of clusters is K.
def biKmeans(self, train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
initial_center = np.mean(train_data, axis=0) # initial cluster #shape (1, feature_dim)
centers = [initial_center] # cluster list
# clustering with the initial cluster center
distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)
# generate cluster centers
while len(centers) < self.k:
# print(len(centers))
min_SSE = np.inf
best_index = None
best_centers = None
best_distances = None
# find the best split
for j in range(len(centers)):
centerj_data = train_data[distances[:, 0] == j] # find the samples belonging
# to the j-th center
split_centers, split_distances = self.kmeans(centerj_data, 2)
split_SSE = np.sum(split_distances[:, 1]) ** 2 # calculate the distance
# for after clustering
other_distances = distances[distances[:, 0] != j] # the samples don't belong
# to j-th center
other_SSE = np.sum(other_distances[:, 1]) ** 2 # calculate the distance
# don't belong to j-th center
# save the best split result
if (split_SSE + other_SSE) < min_SSE:
best_index = j
best_centers = split_centers
best_distances = split_distances
min_SSE = split_SSE + other_SSE
# save the spilt data
best_distances[best_distances[:, 0] == 1, 0] = len(centers)
best_distances[best_distances[:, 0] == 0, 0] = best_index
centers[best_index] = best_centers[0, :]
centers.append(best_centers[1, :])
distances[distances[:, 0] == best_index, :] = best_distances
centers = np.array(centers) # transform form list to array
return centers, distances
KMeans++
Because the initial centroid has great effect on the performance of KMeans, to solve the problem, we introduce another KMeans algorithm called KMeans++. Denote the current n clusters as:
When we choose the (n+1)-th centroid, the farther from the existing centroids the sample is, the more probable it will be chosen as the new centroid. First, we calculate the minimum distance between each sample and the existing centroids:
Then, calculate the probability of each sample to be chosen as the next centroid:
Then, we use roulette wheel selection to get the next centroid. After determining K clusters, run adjustCluster()
to adjust the result.
def kmeansplusplus(self,train_data):
sample_num = len(train_data)
distances = np.zeros([sample_num, 2]) # (index, distance)
# randomly select a sample as the initial cluster
initial_center = train_data[np.random.randint(0, sample_num-1)]
centers = [initial_center]
while len(centers) < self.k:
d = np.zeros([sample_num, len(centers)])
for i in range(len(centers)):
# calculate the distance between each sample and each cluster center
d[:, i] = self.calculateDistance(train_data, centers[i])
# find the minimum distance between each sample and each cluster center
distances[:, 0] = np.argmin(d, axis=1)
distances[:, 1] = np.min(d, axis=1)
# Roulette Wheel Selection
prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
index = self.rouletteWheelSelection(prob, sample_num)
new_center = train_data[index, :]
centers.append(new_center)
# adjust cluster
centers = np.array(centers) # transform form list to array
centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
return centers, distances
Conclusion and Analysis
Indeed, it is necessary to adjust clusters after determining how to choose the parameter 'K', there exist some algorithms. Finally, let's compare the performances of three kinds of clustering algorithms.
It can be found that KMeans++ has the best performance.
The related code and dataset in this article can be found in MachineLearning.
History
- 3rd June, 2019: Initial version