Click here to Skip to main content
Click here to Skip to main content

K-Means Clustering Used in Intention Based Scoring Projects

By , 3 Jan 2009
 

Introduction

Intention based scoring (IBS) is a scoring system, its goal is to provide a score that more accurately assesses a student’s ability to solve a composition problem, and therefore assess direct effects of the student's programming ability. This process is dubbed “intention based” because, firstly, the programs are inspected by inferring what goals the student is trying to achieve, and secondly, what plan component is being attempted when a program statement is not syntactically correct.

The three steps necessary to perform IBS were introduced by [1]: inspection of an on-line protocol, bug identification, and scoring.

The final goal of this project is to create an automated Intention Based Score for each programming assignment, and use K-Mean Clustering to identify which students belong to which group.

K-Means algorithm

The K-Means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k < n. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function:

formula.png

where there are k clusters Si, i = 1, 2, ..., k, and µi is the centroid or mean point of all the points xj in Si. The K-Means clustering was invented in 1956. The most common form of the algorithm uses an iterative refinement heuristic known as Lloyd's algorithm. Lloyd's algorithm starts by partitioning the input points into k initial sets, either at random, or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then, the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively, the centroids are no longer changed).

Lloyd's algorithm and K-means are often used synonymously, but in reality, Lloyd's algorithm is a heuristic for solving the K-Means problem, as with certain combinations of starting points and centroids, Lloyd's algorithm can, in fact, converge to the wrong answer (i.e., a different and optimal answer to the minimization function above exists). Other variations exist, but Lloyd's algorithm has remained popular because it converges extremely quickly in practice. In fact, many have observed that the number of iterations is typically much less than the number of points. Recently, however, David Arthur and Sergei Vassilvitskii showed that there exist certain point sets on which K-Means takes super-polynomial time: 2Ω(√n) to converge approximate K-Means algorithms have been designed that make use of core sets: small subsets of the original data.

In terms of performance, the algorithm is not guaranteed to return a global optimum. The quality of the final solution depends largely on the initial set of clusters, and may, in practice, be much poorer than the global optimum. Since the algorithm is extremely fast, a common method is to run the algorithm several times and return the best clustering found. A drawback of the K-Means algorithm is that the number of clusters k is an input parameter. An inappropriate choice of k may yield poor results. The algorithm also assumes that the variance is an appropriate measure of cluster scatter.

algo.gif

The K-Means clustering algorithm was developed by J. MacQueen (1967) and then by J. A. Hartigan and M. A. Wong around 1975. Simply speaking, K-Means clustering is an algorithm to classify or to group your objects based on attributes/features, into K number of groups. K is a positive integer number. The grouping is done by minimizing the sum of squares of distances between data and the corresponding cluster centroid. Thus, the purpose of K-means clustering is to classify the data.

Pseudo code

Initialize the algorithm either heuristically or randomly. Iterate the following steps until convergence (stopping criteria met):

  1. For each data point x, compute its membership in clusters by choosing the nearest centroid.
  2. For each centroid, recompute its location according to members:
    var m = initialCentroids(x, K);
    var N = x.length;
    while (!stoppingCriteria) {
        var w = [][];
        // calculate membership in clusters
        for (var n = 1; n <= N; n++) {
        v = arg min (v0) dist(m[v0], x[n]);
        w[v].push(n);
    }
    // recompute the centroids
    for (var k = 1; k <= K; k++) {
        m[k] = avg(x in w[k]);
    }
}
return m;

For example, please see K-Means clustering tutorials.

Program

To cluster students, we set certain attributes inside an array for each student. For the purposes of this article, I have only used three variables for each student.

double[] Student = new double[] { Compiles, Errors, Time(Minutes), ETC…. };

In my sample, I used the following:

double[] StudentA = new double[] { 15, 32, 35.6 };
double[] StudentB = new double[] { 19, 54, 65.1 };
double[] StudentC = new double[] { 22, 95, 45.6 };

The number of attributes should all be equal; so, if a student does not have an assignment, the value will be zero. To add a collection of clusters, just add your data as:

double[,] data = { StudentA, StudentB, StudentC};
double[,] data = { { 15, 32, 35.6 }, { 19, 54, 65.1 }, {22,95,45.6 } };

To get the clusters for these three students, you set a K for the number of cluster you would want to partition them in to. For example, K=2:

clusters = KMeans.ClusterDataSet( K, data);

ClusterCollection clusters;
clusters = KMeans.ClusterDataSet(2, data);

The output of the ClusterDataSet function will be of type CollectionBase. This data is serialized so it can be converted to XML and so on as you wish. I have placed this in a DataGrid for viewing.

Other functions that can be used are:

//Test the Euclidean Distance calculation between two data points
double distance = KMeans.EuclideanDistance(StudentA, StudentB);
Response.Write("<br/>Euclidean Distance A and B:" + distance);

Euclidean distance or simply 'distance' examines the root of square differences between coordinates of a pair of objects.

//Test the Manhattan Distance calculation between two data points
double distances = KMeans.ManhattanDistance(StudentA, StudentB);
Response.Write("<br/>Manhattan Distance A and B:" + distances);

It is also known as Manhattan distance, boxcar distance, or absolute value distance. It represents the distance between points in a city road grid. It examines the absolute differences between coordinates of a pair of objects.

//Test the Cluster Mean calculation between two data points
double[,] cluster = { { 15, 32, 35.6 }, { 19, 54, 65.1 } };
double[] centroid = KMeans.ClusterMean(cluster);
Response.Write("<br/>Cluster mean Calc: "+ centroid.ToString());

Sources

  1. H. Chad Lane, Kurt VanLehn; Intention Based Scoring: An Approach to Measuring Success at Solving the Composition Problem; Technical Symposium on Computer Science Education. Proceedings of the 36th SIGCSE technical symposium on Computer Science education, 2005, St. Louis, Missouri, USA.
  2. K-Means clustering tutorials
  3. Wikipedia

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

saharkiz
Web Developer
Philippines Philippines
Member
My name : Aresh Saharkhiz.
Origin: Unknown
 
Education : BS Computer Science
MS Computer Science
Interests : Genetic Programming
Neural Networks
Game AI
Programming: (language == statementEnd(semicolon)
 

http://sites.google.com/site/docaresh
 
Skill:
Flash
Carrara 3D
PHP,ASP,ASP.NET
J2SE

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 4memberhiphopper012353mins ago 
thanx for the code
but how to show clustered data
QuestionI just want a 2-d array of all the points.memberrohit5766 Jul '12 - 23:56 
Hello,
 
I just wanted a 2-d array of all the points from the k-means. I can see the results when I hover on the clusters object, but can't find a way to access them in code. How do I do this?
Questioneuclidean distancememberMember 384042825 Aug '11 - 5:31 
for most cases a SSD should do, hence the final square root is not necessary. Also, a multiplication by self should be more efficient than calling "pow", and the ABS is not required in any case, therefore:
 
for (int i = 0; i < count; i++)
{
    sum = sum + Math.Pow(Math.Abs(X[i] - Y[i]),2);
}
distance = Math.Sqrt(sum);
 
becomes:
 
for (int i = 0; i &lt; count; i++)
{
    d = X[i] - Y[i];
    distance = distance + d * d;
}

General[My vote of 2] Plagiarismmemberjuggler19 Apr '11 - 1:17 
This is lifted wholesale from the excellent site at http://people.revoledu.com/kardi/tutorial/kMean/download.htm[^] without accrediting it.
GeneralRe: [My vote of 2] Plagiarismmembersaharkiz19 Jun '11 - 2:13 
have your eye sight checked
GeneralBugsmemberWernight29 Mar '11 - 2:28 
if (EuclidianDistance(newClusters[clusterIndex].ClusterMean, clusters[clusterIndex].ClusterMean) == 0))

 
Given that we're working with numerical floating points values, this should be:
 
if (Math.Abs(EuclidianDistance(newClusters[clusterIndex].ClusterMean, clusters[clusterIndex].ClusterMean)) < double.Epsilon)

 
And since the ClusterMean is only valid for non-empty clusters, this is still invalid. But instead of adding a check for that, it would be faster to check that points didn't move from one cluster to another. Since to avoid almost doubling the memory usage a different data structure would is required, once can still prefer adding that check.
 
                    if (newClusters[clusterIndex].Count == clusters[clusterIndex].Count
                        && (newClusters[clusterIndex].Count == 0 // Meaning it remains an empty cluster.
                            || Math.Abs(distance(newClusters[clusterIndex].ClusterMean, clusters[clusterIndex].ClusterMean)) < double.Epsilon))
 
This fix has to be applied also on clusterMean = clusters[cluster].ClusterMean; (left to the reader).
 

The block that serves to initially create random clusters is bugged and could also be greatly improved. First random.Next(min, max) returns a value in [min, max) (excluding max). That's the first bug. When the number of desired cluster 'clusterCount' is larger than the number of 'data' points it's another bug.
 
random.Next(0, rowCount - 1);
becomes
random.Next(0, rowCount);
 
and add a guard:
 
if (clusterCount > rowCount)
    throw new ApplicationException("There be no more clusters than data points.");
B.A.N.Z.A.I. ! !! !!!

Questionhow to use it to solve my problemmemberUnruled Boy26 Jan '11 - 21:29 
http://www.codeproject.com/Questions/150817/pattern-recognition-for-data-partition.aspx[^]
Regards,
unruledboy_at_gmail_dot_com
http://www.xnlab.com

Generalk-means clustering for large amount of data setsmemberGayathri Asokan5 Oct '10 - 5:26 
Dear Sir,
 
I saw the article and codings provided by you for the project , "K-Means Clustering Used in Intention Based Scoring Projects". Its very nice. I need one help from you. Now i am doing one project in the area of data mining.In my project K-means clustering algorithm is one of the module. So i need some codings for k-means algorithm , in that the clustering technique is used for clustering large amount of data sets . How can i do it in C language.If you have some idea about it means , please help me,... what i need is "implementation of k-means algorithm for clustering large amount of data points using C".
 
"Awaiting for your reply"
 

 

Regards
Gayathri
GeneralVery good article on K-means.groupTrevor Misfeldt18 Oct '09 - 18:24 
Have you thought about contrasting it more with other forms of clustering? K-means requires a lot less memory than hierarchical cluster analysis, for example. Non-negative matrix factorization (NMF) is getting some good traction in the biosciences fields.
 
--
CenterSpace Software
www.centerspace.net

Questioncan this cluster academic paper?memberkv400012 May '09 - 17:06 
3Q for ur share!
Now i want to cluster into different research direction(such as data mining, network,computer vision ,etc),is this feasible?
 
btw: the number of research direction is not clear,but the concrete direction is included in the academic paper!
 
3Q for ur reply in advance!
 
http://www.ibaima.com(Online Knowledge Question/Answer Platform)

QuestionStudent & Clustermemberseoulaja20 Apr '09 - 18:57 
How I know which student in which cluster?
for example :
 
which cluster is student A?
 
when i try to write:
 
clusters1(0).Item(0)
 
the output is System.Double[]
 
and i want to know the name of cluster, example 'clusters1(0)' must be 'Cluster 1', etc
 
thx
Questionthis can be applied to nXn datasetmemberhellohowur11 Apr '09 - 18:44 
Can this algo be applied to NxN dataset
Like for following dataset of voting
 
Class handicapped | infants | adoption | religious-groups
republican | n | y | y | y
democrat | y | n | n | y
GeneralVery Nice Article if you are into Unsupervised ClusteringmemberMember 46013655 Jan '09 - 23:49 
This article is very Good, but only if you want to Group lets say K Entities with something in common, like unsupervised Classification.
 
One of the more interesting features of the C-Means and the more precise and probablistic based clustering Fuzzy C-Means, is to classify Datasets.
 
Imagine in your case (is easy) you could classify students as [Bad, Average, Good, Very Good].
 
Very Shortly (maybe later i will post something Based on C# ive worked in this area mostly with Matlab).
 
The purpose of a Classifier would be DataUnclassified -> |Classifier| -> Classified Data.
 
This has a lot of work to be done until youve found a good classifier, especially if your dealling with Data with N-Dimensons, like Cancer Data etc.
 
Mostly you use part of an already classified Data to train your classifier, the best way you evaluate your classifier is using something like Cross-Validation etc...
 
Once your classifier is Prepared for a certain type of Data you could use it to determine, for example:

Imagine you have all Data related to customers on a supermarket, with a very good classifier you could determine the number of diferent types of customers, on your population of customers.
 
Using this value data to approach to each of your customers in diferent ways.
 
This was just to give a simple idea, you could check on the web fo Data Mining, Machine Learning one of their applications is for image recognition etc..
 
Cheers.
GeneralIntention based scoring and clusteringmemberjconwell5 Jan '09 - 9:05 
You started this article talking about IBS, and then switched to clustering. I'm curious as to how you are going to use a kmeans clustering with Intention based scoring. What problem will it help solve? What does the data look like.
 
More info on your project would be very interesting, and give context as to how people could use clustering to help solve other problems.
 
John Conwell

GeneralRe: Intention based scoring and clusteringmemberaresguya5 Jan '09 - 14:28 
K-Means Clustering is a very small portion of this project. the big part of this is to automate the scoring of the IBS. the problem that im going to tackle will be human and computer behavior, more exact is novice programmers. so the idea is suppose to go like this :
i grab a snapshot of each compilation of a novice programmer in attempting to solve a composition problem. using a program (in the process of creation) i will perform the 3 IBS tasks and im hoping the final output will be a score for the entire program; as i mentioned an intention based score. now in a course of a semester a student will have several programming tasks or assignments to accomplish, these scores taken from an IBS will be used to cluster students to my desired need.so the end result will be a clustering of students based on an IBS. using that data and psychology or some theory, there may or may NOT be a relationship to determining a fast learning student or a slow learning student. more or less words said, i hope its some what clear on what my final dissertation is about! Smile | :)
GeneralRe: Intention based scoring and clusteringmemberGayathri Asokan5 Oct '10 - 16:41 
Dear Sir,
 
I saw the article and codings provided by you for the project , "K-Means Clustering Used in Intention Based Scoring Projects". Its very nice. I need one help from you. Now i am doing one project in the area of data mining.In my project K-means clustering algorithm is one of the module. So i need some codings for k-means algorithm , in that the clustering technique is used for clustering large amount of data sets . How can i do it in C language.If you have some idea about it means , please help me,... what i need is "implementation of k-means algorithm for clustering large amount of data points using C".
 
"Awaiting for your reply"
 

 

Regards
Gayathri
GeneralNice Article.memberRazanPaul4 Jan '09 - 3:00 
Can you say the best practices to set the value of K?
GeneralRe: Nice Article.memberjconwell5 Jan '09 - 4:38 
It really depends on what kind of data your clustering algorithm is running against, how much data you have, and what you are trying to figure out by running clustering. If you have 200 data points or 2 million data points, you'll have drastically different values for K.
 
You could also do a pre-clustering step to dynamically figure out your value for k based on the data your gona cluster.
 
John Conwell

GeneralRe: Nice Article.memberjconwell5 Jan '09 - 9:07 
Also, you talked about kmeans clustering and performance. One thing to keep in mind is that most clustering algorithms are, what is called, embarrassingly parallelable. On a 8 core machine you could really boost your cluster stability time.
 
John Conwell

GeneralRe: Nice Article.memberaresguya5 Jan '09 - 12:55 
Thank you for your comments. however from my understanding the value of K is not based on the number of data you have, its just based on your own opinion of the data, in my example i picked K = 2, that is because my goal is to cluster students in to 2 categories such as "good" and "bad".
 
i would really like to know your opinions on the correctness of the implementation.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130516.1 | Last Updated 3 Jan 2009
Article Copyright 2009 by saharkiz
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid