Introduction

In this article I will explain the implementation of K-Mean algorithm which is being used in Machine Learning. In the above figures left is unclustered data where as on the right is clustered in 10 clusters. For this I have created two files
1)pyDataCluster.py
2) clusterSimulation.py
File 1 contains an implementation class of K-Means and File 2 is a simulation file written with pyGame(a game library for python). pyDataCluster class returns the clustered data so data can be viewed in console too.
Background
Machine Learning is an advanced step in AI. Instead of creating complex algorithm, simple algorithm are used with large amount of previous data to get the optimized results. This process is base of Learning Algorithm. Clustering is process where data is grouped in classes. To group the data different parameters can be employed depending upon the situation. In K-Means algorithm we cluster the in groups by using the mean values of each Cluster. Which is computed by taking raw data and then processing it repeatedly until mean is not stable.
Basic Work Flow
The basic work Flow is as follow:
1)Get the data
2)Set the number of clusters you want
3)Create an empty 2d array to store the clustered data
4)For each Cluster get a random point value which will serve as initial means
5)For each point calculate the distance with respect to mean
6) Put the point in cluster with minimum distance
7)Recalculate the means for every cluster and update the means
8)Use this updated mean to step 5, repeat until mean from two consecutive repetitions become equal.
Using the code
Lets look at the code.
Firstly the clustering class:
To use this class in your code do this:
from pyDataCluster import *
data=[]
groups=10
for i in range(5000):
data.append([random.randint(1,500),random.randint(1,500)])
cluster = pyDataCluster(groups,data) This will randomly initialize the data and will create an object named "cluster" with 10 groups and data array.
finalCluster = cluster.finalCluster() # return the final cluster
clus = cluster.createCluster() # will return a cluster but not final
Initialization:
The class constructor will initialize the class variable.
def __init__(self,numberOfCluster,Data,initialPoints=[]):
'''
Constructor
'''
self.Kgroups=numberOfCluster
self.Data=Data
self.Cluster=[]
self.Kmeans=initialPoints
self.initialMeanPositions()
self.terminat=TrueEither pass the initial points or leave it initialMeanPositons() will initialize this for you.
Create Cluster:
def createCluster(self):
self.clusterSpace()
for i in self.Data:
point=[i[0],i[1]]
group=self.getClusterGroup(point)
self.Cluster[group].append(i)
self.setMeans()
return(self.Cluster)This function is the work Horse of the class. It will create the clusters of data on the given mean points. Repeatedly calling this function on the given data will result in better clusters.
Final Cluster:
To get the final cluster this will do the job
def finalCluster(self):
while self.terminat:
clus=self.createCluster()
return(clus) This function just go in a loop until termination signal is not give by the "setMeans" function
setMeans:
To set the mean this function will do the job as said in basic work flow:
def setMeans(self):
means=[]
x=0
y=0
for i in self.Cluster:
for j in i:
x=x+j[0]
y=y+j[1]
means.append([math.floor(x/len(i)),math.floor(y/len(i))])
x=0
y=0
if(self.Kmeans==means):
self.terminat=False
self.Kmeans=[]
self.Kmeans=meansAssigning the Cluster Group:
This function will return the group index where a given point is belong:
def getClusterGroup(self,point):
dist=[]
for i in self.Kmeans:
dist.append(math.fabs(point[0]-i[0])+math.fabs(point[1]-i[1]))
minIndex = dist.index(min(dist))
return minIndexEmpty Cluster:
For every run you will need an empty cluster this function will flush the old values if any and create an empty one:
def clusterSpace(self):
self.Cluster=[]
for i in range(self.Kgroups):
self.Cluster.append([])Up to this the Clustering is completed and now the Simulation Part.
clusterSimulation:
This require the PyGame library which can be Downloaded from there Site.
import pygame, sys, time
from pygame.locals import *
from pyDataCluster import *
data=[]
groups=10
for i in range(5000):
data.append([random.randint(1,500),random.randint(1,500)])
cluster = pyDataCluster(groups,data)
Color=[]
for i in range(groups):
while True:
cl=((random.randint(0,255)),(random.randint(0,255)),(random.randint(0,255)))
if cl not in Color:
Color.append(cl)
break
pygame.init()
WINDOWWIDTH = 500
WINDOWHEIGHT = 500
BASICFONT = pygame.font.Font('freesansbold.ttf',50)
windowSurface = pygame.display.set_mode((WINDOWWIDTH, WINDOWHEIGHT), 0, 32)
pygame.display.set_caption('Cluster Simulation')
BLACK = (0, 0, 0)
RED = (255, 0, 0)
GREEN = (0, 255, 0)
BLUE = (0, 0, 255)
WHITE=(255,255,255)
while cluster.terminat:
points=[]
clus=cluster.createCluster()
a=0
for i in clus:
for j in i:
points.append({'rect':pygame.Rect(j[0],j[1],4,4),'color':Color[a]})
a=a+1
for p in points:
pygame.draw.rect(windowSurface, p['color'], p['rect'])
pygame.display.update()
#time.sleep(0.05)
while True:
# check for the QUIT event
for event in pygame.event.get():
if event.type == QUIT:
pygame.quit()
sys.exit() Try changing the data amount and groups to see the effects.
Started Software and Web Development in 2010 at CEME NUST Pakistan. Interested in Artificial Intelligence, Web Technologies and Software development using popular platforms and languages.