Click here to Skip to main content
13,594,559 members
Click here to Skip to main content
Add your own
alternative version

Stats

18.2K views
592 downloads
44 bookmarked
Posted 28 Aug 2017
Licenced Public Domain

Intermediate Data Clustering with k-means

, 8 Nov 2017
Rate this:
Please Sign up or sign in to vote.
adds features to k-means for missing data, mixed data, and choosing the number of clusters

Summary

The simple k-means data clustering algorithm is extended to support missing data, mixed data, and to choose the number of clusters. A modified distance calculation finds solutions with a lower total error. A version of Principal Component Analysis with missing data support is included. Original algorithms based on non-Euclidean geometry are introduced. Full source code with no library dependencies is provided in FORTRAN with a full translation to C. Several example problems are discussed.

Contents

  • Background: Clustering in General
  • Package CLUELA
  • k-means History
  • Using the Code
  • Example Problems
  • Notes

Background

Introduction. K-means is the most commonly used method of data clustering, because it is simple, fast, and usually gives good results. The algorithm alternates between two main steps:

  1. Assign each point to its nearest center
  2. Let each center be the average of its points

One of the attractions of the method is how readily it may be generalized. A simple implementation [*1] defines a "point" as a vector of real numbers, "center" as the arithmetic mean, and "nearest" by distance according to the Pythagorean Theorem. More generally, the "points" can be various kinds of data, the "centers" can be various kinds of averages, "nearest" distance can be generalized to a dissimilarity function, of which there are a bewildering variety to choose from.

Data clustering is a fundamental operation: it joins together things that are alike, and splits apart things that are different. With the proper modifications, k-means is a powerful technique applicable to a wide variety of problems.

Kinds of data

A much more thorough discussion can be found in the chapter "Types of Data and How to Handle Them" in [Kaufman, 1990].

Vectors of real numbers. Suppose you have a weather monitoring station that records the temperature, barometric pressure, and humidity. Each observation be represented by a vector with three components [T, p, h]. Each of these measurements is a continuous variable. They do not need to have the same unit of measure.

Categories. Suppose instead that the data has only certain specific values it can take on. Instead of measuring temperature, you might record the weather as either sun, rain, or snow. A heart murmur may be either present or absent. A question may be answered yes or no. An animal may be male or female.

Ordinal. Of runners in a race, who came in first, second and third might be known, but their finish times might be unknown. In such cases, runner #1 is faster than runner #2, and runner #4 is faster than runner #5, but it is not generally the case that the difference #1-#2 is equal to the difference #4-#5. If the statistical distribution of running speeds is known, then it is possible to estimate the speed of a runner from the how he places. Less rigorously, an ordinal value may be replaced by its quantile, that is, the fraction of the data that it is greater than. This program handles ordinal data by replacing it with its quantile.

Time series. Suppose your weather monitoring station only measures temperature, but you take a measurement every hour for a year. Then the entire record for the year could be considered an observation, and it could be compared to the yearly records of other weather stations. (This program doesn't do time series.)

Bag-of-words. A document may be represented as a list of words that it contains, along with a count of how many times each word occurs. Because in any particular document most words in the language do not occur, the bag-of-words representation is more concise than a vector representation, and it lends itself to different dissimilarity functions. (This program doesn't do bag-of-words)

Histograms Perhaps instead of a single measurement, the object is a whole collection of measurements forming a statistical distribution. Special dissimilarities, such as the Earth Mover's distance, are appropriate to compare histograms. (This program doesn't do histograms.)

Multimedia. Perhaps you data is audio, images, or video. The problem becomes difficult, but measures have been invented to compare their dissimilarity. [Deza, 2012] (This program doesn't do multimedia.)

Kinds of Dissimilarities

The dissimilarity function is the heart of data clustering. It is how difference is quantified. A wide variety of dissimilarity functions have been devised over the years. Some have a firm theoretical foundation, some do not. [Goodall, 1966] [Kaufman, 1990] [Podani, 2000] [Sung‐Hyuk, 2007] [Deza, 2012]

an ellipsoidal data cloudForemost is the Euclidean distance, the kind taught in geometry, given by the Pythagorean theorem:

$d(X,Y) = \sqrt{\sum(X_i - Y_i)^2}$

The reason for its importance is that many real-world probability distributions are ellipsoidally contoured. The Euclidean distance is proportional to the negative logarithm of probability of a spherically contoured distribution. Thus, if the input data is suitably scaled so that the variables have similar variance, and if the input variables are mostly independent, then Euclidean distance approximates the unlikelihood that an object belongs to a group.

Manhattan dissimilarity. Some authors [Kaufman, 1990] recommend that instead of the Pythagorean Theorem, distance should be calculated according to the L1 metric (sometimes called Taxicab, City-Block, Manhattan):

$d(X,Y) = \sum\left|X_i - Y_i\right|$

The Manhattan metric is less sensitive to extreme (possibly wrong) values because the differences are not squared. However, it corresponds to a parallelogrammatically contoured distribution, which may not exist in nature. [*2]what an parallelogrammatical data cloud would look like

Correlation. Suppose that the data are time series of the discharge of some rivers for several years. Computing the standard correlation would show if there is a relationship between the flow in one river and another, independent of the average size of the river.

$ r = \frac{\sum xy}{\sqrt{(\sum x^2)(\sum y^2)}}$ where x has been centered by subtracting the mean value of X: $x_i = X_i - \bar{X}$ and: $y_i = Y_i - \bar{Y}$

The correlation is a similarity measure. It may be converted to a dissimilarity measure by taking the arccosine. [*3]

$d(X,Y) = \arccos(r)$

Cross-entropy Suppose that the data is some observations of a discrete variable. The usual example is that there are some jars of beans, with different proportions of red, black, and white beans in each jar. Then the cross-entropy between the sample and a particular jar:

$ H(S,M) = -\sum P_s \log(P_m) $

is the negative log-probability that the sample was drawn from that jar. Here, S stands for sample, and M stands for model. Notice that this dissimilarity is not symmetric:</p>

$ H(S,M) \neq H(M,S) $

A Dissimilarity Function for Missing Data The simplest way to handle missing data is to compensate for missing values by a simple linear extrapolation from the non-missing data. John Gower proposed a similarity coefficient in 1971 that was modified into the dissimilarity function [Kaufman, 1990]:

$ d(X,Y) = \frac{\sum (\delta(x,y) f(x,y))}{\sum \delta(x,y)} $

where δ(x, y) is an indicator function that has the value 1 if data is present for some variable for both objects X and Y, and has the value of 0 otherwise. The function f(x, y) depends on the kind of data. For continuous data, f(x, y), is the absolute value of the difference x-y. For categorical data, f(x, y) is 0 if the values match and 1 if they do not.

Minimizing Sum-of-Squared Distance. The goal of clustering is to partition the data so that objects that are near each other are in the same group. To assess how well this goal is attained, compute the residual error, defined as the sum of the squared distances from each point to the center to which it is assigned. It is frequently and incorrectly stated that the simple k-means method finds a local minimum of the error. [*4] This does not take account of the effect that changing the assignment of the point will have on the center. To find a local minimum of squared error, the dissimilarity function must be the squared Euclidean distance with a correction factor of (\frac{N}{N-1}) applied if the point belongs to the center, and a correction factor of (\frac{N}{N+1}) applied if the point does not belong to the center. [Hartigan, 1979]

Determining the Number of Clusters

A significant weakness of k-means is that it requires the number of clusters, k, to be specified beforehand. The usual approach is to run k-means for several values of k, and then inspect the results. The residual error may be measured as the total distance of each point to its center. As k is increased, this residual decreases. The intuition is that there will be a significant decrease in the residual when the right value of k is reached, and that further additional clusters will not lead to significant improvements. This general approach is sometimes called the Elbow Method. Many variations of this idea have been proposed over the years. A survey of 30 methods for choosing the number of clusters was done by Milligan & Cooper in 1985. A software package for the GNU "R" statistical language was released [Charrad, 2014] implementing 30 methods, some of them the same. There does not seem to be any consensus at present on the best variant of the elbow method.

Another approach is to guess an initial value for k, then run a variation of k-means that provides for changing the number of clusters. [Ball, 1965] [Pelleg, 2000] [Kulis, 2012] A disadvantage of this approach is that more parameters must be specified to decide when a cluster should be split, and when clusters should be merged.

It has been proposed that k-means may be run several times with different values of k, and a value of k should be chosen that yields the most consistent results. This approach can fail badly, as is shown in [Ben-David, 2005].

Package Cluela

simplified flowchart for the improved k-means algorithmCLUELA is CLUstering with ELAborations. It extends the basic k-means algorithm to tackle the problems of missing data, mixed data, and choosing the number of clusters. It brings together original and legacy code in plain, honest FORTRAN, with a complete C translation. It is intended to be efficient for use in applications, and clearly written enough that it can be modified by other programmers.

Missing data support

We need some way to represent whether a datum is present or missing. It is possible to set aside some special value ("sentinel value") to represent missing data, but this leads inevitably to collisions. To really adequately indicate what is present requires another array. (Similar to the implementation in the C Clustering Library. [de Hoon, 2010]) Each element in integer matrix NX should be set to 1 if the corresponding element in data matrix X is present, and 0 if it is missing.

The dissimilarity function is that of Gower, as modified by Kaufman & Rousseeuw, and extended to support importance weights for each variable [*5] and generalized to support the Manhattan and Euclidean metrics:

$ d(X,Y) = \frac{\sum(w_j)}{\sum (\delta(x_j,y_j) w_j)} \sum \delta(x_j,y_j) w_j \left|x_j - y_j\right|^L $

where δ(x,y) is an indicator function that has the value 1 if data is present for both X and Y, and is 0 otherwise.
L is 1 or 2, for the Manhattan or Euclidean metric, respectively.

If option ROBUST is set, then the Manhattan metric is used, otherwise the Euclidean metric is used. Option ROBUST also computes centers as medians instead of means, so that CLUELA implements the k-medians algorithm.

Mixed data

The dissimilarity function given above handles only continuous data. To handle categorical data, subroutine RSCODE is included to recode a categorical variable as several continuous variables. The most obvious and most common way to do this is with dummy coding. A categorical variable that with c possible values is rewritten as c continuous variables. The continuous variable that corresponds to the category of the object is set to 1, and the rest of the continuous variables are set to 0. Dummy coding contains redundant information, because if c - 1 of the dummy-coded values are known, the last value is fully determined. This is slightly inefficient, and can cause numerical problems if we wish to compute the covariance matrix (for example, to run PCA or LDA). It is preferable therefore to represent a categorical variable that takes on c possible values by c - 1 continuous variables. A regular simplex is an N-dimensional generalization of a regular polyhedron. A regular simplex of c vertices lies in a space of c - 1 dimensions. For example, the regular simplex with four vertices is a tetrahedron, and it is an ordinary 3-dimensional object. [*6] If option ROBUST is set, then the data is embedded in a Manhattan simplex, such that the vertices are at distance 1 according to the Manhattan metric, otherwise the data is embedded in ordinary Euclidean space. Since the recoding increases the number of variables, the importance weights must be scaled down. The recoded data then must be accordingly scaled up to keep the dissimilarities the same as if simple matching of the categorical variables had been used.

Subroutine RSCODE requires input in the form of REAL approximations of small consecutive integers, that is, 1.0, 2.0, 3.0 ... The raw data that to be studied will be in a much less convenient format. Two subroutines are provided for this conversion: JSCODE to recode arbitrary integers to consecutive small integers, and IREID to recode CHARACTER data as consecutive small integers. These work by sorting an array and comparing adjacent elements. They use passive sorting, because often we do not wish to change the order of the data, just to know what order it would be in. The C standard library includes qsort, but it does not include a passive sort. Legacy subroutines QSORTI, QSORTR, and HPSORT are included (and translated) to provide passive sorting.

Sometimes we may wish to sort an array, then put it back in its original order. This may be done with two permutation vectors. If JND is the permutation vector which was used to order the array, then to put the array back how it was before, use:

/* restore original order */
	for (i = 1; i <= n; ++i) ind[jnd[i]] = i;
	rorder (&x[1], &ind[1], n);
!               restore original order
       DO I = 1, N
         IND(JND(I)) = I
       END DO
       CALL RORDER (X, IND, N)

Choosing the number of clusters

The user must specify KMIN, the fewest acceptable clusters; KMAX, the most acceptable clusters; and K, the default number of clusters. The subroutine runs k-means for k from KMIN-1 to KMAX and notes the improvement in the residual at each step for k from KMIN to KMAX. A peculiar statistic is computed according to the formula of [Pham, 2005]. The value of k which yields the minimum of this statistic is chosen, unless a critical value is not reached, in which case the default number of clusters is accepted.

Other Features

The initial centers are chosen by the usual k-means++ method. For each value of k, k-means is run ten times with different initial centers. The result with the least residual is kept.

Setting option SOSDO enables the correction factor of (\frac{N}{N-1}) or (\frac{N}{N+1}) that finds a local minimum of the sum of squared distance objective. This option conflicts with option ROBUST. It also conflicts with building a binary tree (see below), and so is not the default.

External Validation. Sometimes you may want to compare a partition generated with k-means against some other partition. For example, you might wish to test the clustering algorithm on some problem with a known answer. Two subroutines are supplied for this purpose. ARIND computes the Adjusted Rand Index (ARI), which is a measure of accuracy that has a maximum of 1. It is 0 when the agreement between the clustering and the external standard is what would be expected by chance. It can take negative values, if the agreement is even worse than would be expected by chance. VINFO computes the Variation of Information. [Meilă, 2002] It measures distance between the clustering and the external standard, so lower values mean better agreement. Its minimum value is 0; its maximum depends on the size of the problem.

Principal Component Analysis: Bosses love graphs

If you have to communicate the results of your analysis, you will need some visual aids so that the audience will have something to look at. To visualize multivariate data, it must first be projected down to two dimensions. That can be done by Multi-Dimensional Scaling (MDS) [Laub, 2004], Singular Value Decomposition, or Principal Component Analysis. Subroutine PCA is included for projection with missing data support.

SVD and PCA are theoretically equivalent, and SVD is usually preferred, as it is said to be more numerically stable. SVD factors the data matrix:

$ X = U S V^T $

But it doesn't provide a way to deal with missing data. For PCA, we must first compute the covariance matrix:

$ \sigma_{jk} = \frac{\Sigma((x_{ji} - \bar{x_j})(x_{ki} - \bar{x_k}))}{N - 1} $

If some data is missing, it is omitted and N is adjusted. Find the eigenvalues and eigenvectors, factoring the covariance matrix (V \Lambda V^T). The projection is (R = X^T V), using only the columns of V corresponding to the M largest eigenvalues. Notice that R is transposed relative to X.

For fun, and to allow you to cook the data to suit your tastes, four kinds of PCA are implemented, based on the standard covariance matrix and three alternatives. The user may set the option LEVEL to -1, 0, 1, or 2. The options are numbered in order of increasing robustness.

LEVEL = 0 Computes regular PCA from the covariance matrix.

LEVEL = 2 Computes robust PCA from the comedian matrix. [Falk, 1997] The comedian

$ com(X,Y) = med{(X - med(X))(Y - med(Y))} $

is a generalization of the median absolute deviation. It is a highly robust alternative to the covariance that is easy to compute.

LEVEL = -1 and LEVEL = 1 require some explanation.

A criterion for the orthogonality of two vectors u and v is that the distance from u to v should equal the distance from u to -v. Applying the Euclidean metric yields:

$ \sum(u - v)^2 = \sum(u - (-v))^2 \\ \sum uv = 0 $

Which is just the formula for the covariance. Applying the L1 metric yields:

$ \sum \left|u - v\right| = \sum\left|u - (-v)\right| \\ \frac{1}{2} \sum(\left|u + v\right| - \left|u - v\right|) = 0 $

This is a new measure of the relation of two variables. Let it be called the codeviation. Applying the L∞ metric yields:

$ \max\left|u - v\right| = \max\left|u - (-v)\right| \\ \frac{1}{2} (\max\left|u + v\right| - \max\left|u - v\right|) = 0 $

The codeviation resembles the construction of alternative measures of dispersion in [Gnanadesikan, 1972], but is distinct. The Gnanadesikan-Kettenring construction is a robust estimate of covariance. The codeviation is an analogue of covariance in a non-Euclidean geometry, which is not necessarily robust. The L∞ codeviation is determined by the extreme values, and could be called a "flimsy statistic." Set LEVEL to 1 or -1 to select the L1 codeviation or L∞ codeviation, respectively. If standard PCA doesn't show what you need it to show, perhaps one of these alternatives will.

Since PCA supports missing data, it can be used to fill in missing values as a preprocessing step for routines that do not have missing data support, such as KMNS and NEUGAS included in this distribution.

The codeviation and other applications of non-Euclidean orthogonality are an original contribution.

Efficiency Improvements

k-means is usually described as fast, or at least faster than some other clustering algorithms. The computational cost of basic k-means is NPKi operations, where N is the number of objects, P is the number of variables, K is the number of clusters, and i is the number of iterations required for convergence. It is possible to modify the algorithm to be faster than this, and these efficiency improvements involve no loss of accuracy.

The bottleneck step of k-means is the computation of the distance from each point to each center. It is possible to find the nearest center to a point without measuring the distance to every center, by putting the centers into a binary tree data structure. [Chávez, 1999] CLUELA uses a generalized hyperplane tree. Any data set which has a dissimilarity between each datum may be split by choosing two objects, and assigning the remaining objects to a partition by whether they are nearer the first or the second. If the data and dissimilarity function satisfy the requirements of a metric space, then the triangle inequality may be used to efficiently search the tree for nearest neighbors. (see Appendix) There are four requirements of a metric space:

  1. Dissimilarities are not negative
  2. Dissimilarity is zero if and only if the objects are identical
  3. Dissimilarity between objects is symmetric
  4. The triangle inequality: The sum of dissimilarity of A to B plus the dissimilarity of B to C must be greater than or equal to the dissimilarity of A to C.

The Manhattan dissimilarity and the Euclidean dissimilarity are metrics. Some other dissimilarity functions of present interest are not metrics. In particular, squared Euclidean distance is not a metric. The Gower dissimilarity is not a metric when there is missing data. The dissimilarity function that minimizes the sum of squared distance is not a metric. For examples, see the Appendix.

The generalized hyperplane tree is most valuable when K is large and P is small. CLUELA builds a generalized hyperplane tree when the requirements of a metric space are satisfied. Otherwise, it exhaustively compares each point to each center.

In a metric space, if the distance to a point from its nearest center is less than or equal to half the distance of that center to any other center, the point can belong in no other cluster. In that case, no distances need to be computed. [Elkan, 2003] (See Appendix)

Hartigan & Wong introduced the concept of "live sets" of points and of centers to eliminate needless distance calculations. [Hartigan, 1979] The logic to my understanding is this: A center is in the live set if any points were added or removed from it on the last iteration. A point is in the live set if the distance to its nearest center has increased since the last iteration. For each point, live centers may have become nearer, and so each point must be compared to all live centers. For a live point, its present center is farther, and so the point must be compared to all centers dead and alive. If neither the point nor the center is in its respective live set, then no change is possible and the computation may be skipped. This reasoning does not depend on the properties of a metric space. CLUELA always makes use of the live sets.

It is possible with some kinds of data and dissimilarity functions for k-means to get stuck in a loop and never converge. The assignment array Z is hashed, and the hash is stored in a ring buffer. On each iteration, the present hash is compared against the previous hash values to detect if the algorithm is stuck in a loop. If it is stuck, it exits without an error. (It may not be possible to get stuck with the scalar data in CLUELA. The loop detection feature has been retained for the sake of programmers who may modify the subroutine to handle less nicely behaved data.)

To compute the median of each cluster, the data must be repacked into a work array. To avoid making K passes over the data matrix to fill the work array, passive sorting is used on the assignment array. Then it is possible to access the points for each cluster in order in a single pass over the data matrix. Since the assignment array is small integers, it can be sorted especially efficiently with the unstable radix sort, at a computational cost of N log(K).

Compensated addition is used in several places. [Klein, 2005] It is important when adding long arrays of small numbers, for example, if you wanted the partial sum of the first billion terms of the harmonic series. This is not an efficiency improvement, but I had the code already, so what the heck.

About the translation

The original code was all in "Legacy" FORTRAN -- that is, whatever will f2c will accept. That's standard FORTRAN 77 with a few extensions. f2c translated it, and then it was edited to remove dependencies on libf2c and make the matrices doubly subscripted. Double deference may or may not run faster, but it's a lot easier to read. There is no random.c or machine.c because machine constants and random numbers are part of standard C99. Likewise, there is no array.f because there is nothing special about declaring a matrix in FORTRAN. The arguments lists are often different. Some arguments in FORTRAN are needless in C and vice versa.

History

Much of this information comes from the review article by [Bock, 2008].

1950 Dalenius proposes a method to partition a 1-dimensional data set.
1953 Thorndike informally describes a clustering method using average distances. [Thorndike, 1953]
1956 Steinhaus proposes a k-means algorithm for the continuous case.
1957 Lloyd proposes a 1-dimensional k-means algorithm; his paper is not widely published until 1982.
1965 Forgy presents the basic k-means algorithm in a lecture and an abstract.
1965 Ball & Hall publish a more complex algorithm with provisions to merge and split clusters. [Ball, 1965]
1967 MacQueen proposes the formula to immediately update a center as soon as a point is moved.

Moral: Technology progresses in incremental steps with contributions from many people.

Using the Code

The solution to most clustering problems will require a pre-processing phase, the clustering itself, and then some additional post-processing.

Subroutine CLUELA accepts as input only vectors of real numbers. Data in the form of text labels or integers can be converted to small consecutive integers with calls to IREID or JSCODE, respectively. Ordinal data may be rescaled with ORDNAL. If you decide to standardize some or all of the variables, the mean and standard deviation may be computed with AVEDEV, or the median and median absolute deviation may be computed with MEDMAD.

The calling sequence for the main subroutine looks like this:

int cluela (float **x, int **nx, int n, int p, float *w, float **c, int **nc,
      int kmin, int *k, int kmax, int *z, int *pop, float *f, bool robust,
      bool sosdo, int *iwork, float *rwork, int **tree, float *u)
 SUBROUTINE CLUELA (X, NX, PLIM, P, N, W, C, NC, KMIN, K, KMAX,
$  Z, POP, F, ROBUST, SOSDO, IWORK, RWORK, U, IFAULT)

The user must specify N, the number of objects; P, the number of dimensions; X, a P by N single-precision data matrix; NX, a P by N integer matrix each component of which is 1 if the corresponding component of X is available and is 0 if the corresponding component of X is missing; W, a single precision vector of length P giving the importance of each variable; KMIN, the fewest acceptable clusters; K, the default number of clusters; KMAX, the most acceptable clusters; ROBUST, which selects the k-medians algorithm instead of k-means; and SOSDO, which selects minimization of the sum-of-squares criterion. K is both input and output, and so must not be declared a constant. Options ROBUST and SOSDO mutually exlclusive.

The classical k-means implementation KMNS and the alternative Neural Gas clustering algorithm NEUGAS are included in the distribution for the sake of comparison. Subroutine CLUSTE is a variation of CLUELA that skips repeats and uses a quick initialization for faster theoretical efficiency.

In the post-processing phase, PCA may be called to project the data a lower dimension so that it can be graphed. (PCA may also be used in the preprocessing phase to fill in missing data so that KMNS or NEUGAS may be used, as they lack missing data support.) Write the 2-D results to a spreadsheet with WCSVSP.

Subroutines of interest to the applications programmer are:

SEIGV: compute eigenvalues and eigenvectors of a symmetric matrix, by Martin et. al.
QSORTI: sort integers, by Richard Singleton et. al.
QSORTR: sort single precision real numbers, by Richard Singleton et. al.
HPSORT: sort strings, by Richard Singleton et. al.
IORDER: apply a permutation to an arry of integers, by Robert Renka
RORDER: apply a permutation to an array of single precision real numbers, by Robert Renka
HPPERM: apply a permutation to an array of strings, by McClain and Rhoads
KMNS: k-means clustering, by Hartigan and Wong
ADD: compensated addition according to Kahan
NEUGAS: the Neural Gas data clustering algorithm
MEDIAN: the median of a vector of single precision real numbers, by Alan Miller
JSCODE: recode an array of arbitrary integers to an array of small consecutive integers such that the smallest value in the input becomes 1, the second smallest becomes 2, and so on.
IREID: recode an array of strings to an array of small consecutive integers, so that the first string to occur in the input becomes 1, the second string to appear in the input becomes 2, and so on.
ORDNAL: recode a REAL array so that each entry is replaced by its quantile score.
RSCODE: recode a matrix containing categorical variables to real coordinates from the regular simplex, so that the dissimilarities are preserved.
AVEDEV: compute the mean and standard deviation of a data set.
MEDMAD: compute the median and median absolute deviation of a data set.
SAFDIV: check whether division will overflow
ALMEQ: check whether two real numbers are almost equal
SHUFFL: make a random permutation of integers 1...N
DINIT: choose initial cluster centers according to the k-means++ method
KCLUST: k-means or k-medians clustering, with missing data support and the possibility to insert or delete clusters
HCLUST: choose initial point assignments by hyperplane partitioning
CLUSTE: a fast interface to KCLUST that calls HCLUST and omits restarts
WCSVSP: write an array of (x,y) coordinates to a .csv spreadsheet file
PCA: Compute the principal components of a data set
DIVEV: Change the results of PCA into the matrix U obtained from SVD
ARIND: Compute the Adjusted Rand Index, measuring the agreement of clustering result with a given partition.
VINFO: Compute variation of Information, a metric of the difference between a clustering result and a given partition.

 

For calling sequences of theses subroutines, please refer to the source code. All the important subroutines have prologues that explain how to use them.

 

Example problems

Most of these problems come from the University of California at Irvine Machine Learning Repository. [Lichman, 1987] They were originally posed as classification problems. The known classes can be used as an external validation measure of the clustering result. Training sets and test sets have been combined as soybean.both and horse-colic.both. This is permissible, because clustering does no training. All the prior assumptions are built into the dissimilarity function and the importance weights.

Importance weights for the Wine problem came from came from optimizing a spectral clustering residual, and did not involve use of the class labels. Weights for the credit approval problem came from standardizing the input, and did not involve use of the class labels. The round problem, the horse-colic problem, and the House votes '84 problem use equal weights. The Maple Leaf Rag data set is mine and I can abuse it any way I like. As for the soybean data, the weights were obtained by some kind of linear transform, and I can't remember if it involves looking at the class labels, so please don't take them too seriously.

All the data sets are included in the distribution.

The Soybean problem

Some ill plants are suffering from 19 diseases, manifested by 35 symptoms, which are a mixture of continuous, categorical, and ordinal variables. The symptoms and the diseases have been identified by human experts, which should make us suspicious, since this is not a double-blind experiment. It is possible that the agronomist's diagnoses have influenced his description of the symptoms. Let's see what we can make of this mess anyhow.

Combining training and test sets, the number of objects N is 683. The number of classes G is 19. Consider possible numbers of clusters from KMIN=1 to KMAX=50. The number of original variables O is 35. Many of these will have to be recoded. The hard-coded array CAT(O) gives the number of categories for each input variable. For continuous and ordinal variables, write zero. Set CATMAX to the largest value in CAT. Each categorical variable will have to be recoded with CAT(H)-1 continuous variables. The data matrix that is input to the clustering routine is X(P,N) To find P, start with O and add CAT(H)-2 for each categorical variable. The subroutine does not do this automatically; it must be hard-coded in the main program. That is because it is important for the analyst to examine the data and think about the problem. The first three attributes are:

!___Attribute_________Type__________#___Values_______________________
!   date              Continuous        april,may,june,july,august,
!                                         september,october,?
!   plant-stand       Categorical   2   normal,lt-normal,?
!   precip            Ordinal           lt-norm,norm,gt-norm,?

Studying the input, we find that P=54. The attribute values in soybean.both are integers 0, 1, 2... for available data and "?" for missing data. For available data, X is set to 1.0 + the input, because RSCODE requires categories to be numbered starting with 1.0; and NX is set to 1. For missing data, X is set to XBOGUS, and NX is set to 0.

Next, subroutine ORDNAL is called to rescale the ordinal variables. The first O rows of X are filled already. Call RSCODE to recode the categorical variables with vertices of the regular simplex. This takes up more memory. RSCODE starts at the last row and works backwards, moving the data to the end of the array, returning X(P,N) of all continuous data. The class labels were given as text strings; they must be converted to integers 1, 2, 3... with a call to IREID. Set a default K = 20 and we are finally ready to call the main subroutine. On return, the results are compared to the class labels, the data is projected into 2-dimensional array R, and written to soy_clusters.csv and soy_classes.csv. The F statistic of [Pham, 2005] is returned in array F() for each possible number of clusters KMIN...KMAX. Small values of F are supposed to indicate a good number of clusters.

Running the program with option ROBUST = .FALSE. prints to the terminal:

 cluela returns # 0  made  2  clusters
 Returned residual:  15121.53
 Adjusted Rand Index:  0.09886136
 Variation of Information: 3.9897144
STOP program complete

This value of the Adjusted Rand Index (ARI) indicates that the clustering is less than 10% better than guessing. Set ROBUST = .TRUE. and see if it helps any:

 cluela returns # 0  made  2  clusters
 Returned residual:  17791.316
 Adjusted Rand Index:  0.14419138
 Variation of Information: 3.1488326
STOP program complete

Examining the graph of the clustering results, we see that CLUELA has produced a reasonable-looking partition of two clusters.

two clusters of soybean diseases

The classes are distributed in a much more complex way:

A plot of the F values shows a strong tendency for two clusters, and nothing at 19.

This is grounds for us to either be suspicious of our approach to the Soybean data, or of the F statistic. We will see in the next problem that the F statistic is helpful sometimes but not always.

The Wine problem

The Wine data set is all continuous variables. N=178, P=13, K=3. Let us focus on trying to find the correct number of clusters. In [Pham, 2005] it was reported that their proposed method chose the correct number of clusters.

As a first attempt, running CLUELA on the raw data returns pretty lousy accuracy. Reading the journal article a little more carefully, we notice that we need to standardize the input. The plot of the F statistic now shows a minimum at 2, not 3.

Did I make a mistake or did Pham et. al. make a mistake? Reading the journal article considerably more carefully, Pham et. al. were running their own peculiar variant of k-means that we do not care to attempt to replicate. Instead, we will use special (very special) importance weights. These were obtained by optimizing the residual of the Bach-Jordan spectral clustering algorithm [Bach, 2003] with subroutine PRAXIS by Richard Brent. This is not cheating, because it doesn't use the class labels. On the other hand, it's a lot of strain on the CPU. The F values now show a minimum at 3 that is slightly less than the value at 2.

Moral: a good statistician can prove anything. And an evil one can too.

Using the special weights, the classes are now mostly separable:

The Credit approval problem

This is a two-class classification problem, of 15 attributes which have been deliberately obscured. The data is read and recoded much as before. N=690, O=15, G=2, P=38. The weights have been chosen to scale down the large continuous variables so that the attributes will have roughly equal importance. The accuracy of the clustering result in terms of classification is terrible. Clustering is not classification, so it doesn't really matter. Let's instead compare the projections we get with different robustness levels of PCA.

With LEVEL=-1 everything is on a line.

With LEVEL=0 the objects are dispersed more.

With LEVEL=1 the objects are spread out, but there is hardly any separation between the classes.

With LEVEL=2 there is still no class separation.

The clusters found by CLUELA are this:

Moral: Use whatever graph looks the prettiest. None of them convey any useful information. They affect the viewer with the illusion of understanding.

Problem "round" - the unit circle revisited

The t_round program generates 30,000 points uniformly at random on the unit circle. We saw this problem in "Beginning Data Clustering" [*1] and solved it with the simple k-means algorithm presented there. Now that we have CLUELA we can compare those results with the results from using the ROBUST and SOSDO options.

The ROBUST option invokes the Manhattan metric. This is a non-Euclidean geometry, and we need to be prepared to accept that the plot of the results in this approximately Euclidean region of space-time on your computer monitor are going to look kind of weird.

Pham, et al's method of choosing K assumes Euclidean space. [Pham, 2005] It is no surprise that it shows no clustering tendency in Manhattan space.

Setting the SOSDO option instead returns the familiar result.

The F values show a tendency to form three clusters, which replicates the result of Pham, et. al.

The Horse Colic Problem

This data set is somewhat difficult. First there are some possible mistakes. The young/adult field was supposed to be coded {1,2} but it is instead {1,9}. That's easy to fix, but still strange. The field "capillary refill" is supposed to be coded {1,2}, but it contains a value of 3. This is probably a mistake, but RSCODE will fail if we don't set CAT=3 for this variable. Some of the ordinal variables are in the wrong order. For example,

!       extremities temp   Ordinal             1 = Normal
!                                              2 = Warm
!                                              3 = Cool
!                                              4 = Cold

So we need to swap 1 and 2 so that the sequence is Warm, Normal, Cool, Cold.

Three attributes in the horse-colic.both are diagnosis codes. These are not categorical variables, because diagnoses are not mutually exclusive. A horse can have more than one illness. To represent this for RSCODE, it needs to be converted to a categorical variable. First, we have to count the distinct diagnoses. There are 62 of them. Each possible diagnosis is treated as a binary variable. So we use the first 24 attributes, and convert the next three attributes into 62 variables, and then ignore attribute 28. So N=368, O=24+62=86.

The attribute hospital number takes on 346 different values. This is an extravagant number to have to recode into dummy variables. We can see that recoding is not feasible when the number of categories becomes very large. Some analysts choose to combine categories in these situations. It would be reasonable to omit this variable entirely. It is also possible to write a dissimilarity function that handles categorical data directly, and thus have no need to recode it. We shall suffer the consequences of recoding, and get P=440.

Next, we recode the data and call the clustering subroutine much as before. CLUELA searches for a good number of clusters from KMIN=1 to KMAX=20, so this takes some CPU time. There appears to be a preference for two clusters.

Now let's expose some advanced functionality. We call the internal subroutine DINIT to apply the k-means++ initialization, then call KCLUST directly with KMIN=1, K=3, and KMAX=20, and the value of THRESH found by DINIT. When we use the CLUELA interface, KCLUST is constrained to not split or merge any clusters, and it is repeated ten times for each trial value of k. Calling KCLUST directly allows for clusters to be inserted or deleted, and skips the repeats. This gives us an alternate estimate for k This graph shows results for a run that produced K=5.

Often 3 or 4 is returned. The output is a matter of chance!

Now project the data down to a lower dimension M=28 to lighten up the computational burden and fill in missing values so that we can compare CLULELA to the classical KMNS and to the rival Neural Gas algorithm. It is not necessary to call PCA again for M=2. The first two projected dimensions are the same, regardless of how many more dimensions we request.

We then initialize centers and call KMNS. K is fixed in KMNS, so it is arbitrarily chosen here as K=18. Next call CLUELA with option SOSDO and K fixed at 18, for comparison.

Interface CLUSTE was written for CLUstering with Superior Theoretical Efficiency. By using the binary tree, the computational expense of the assignment step is reduced from NPK to perhaps as low as NP log(K) depending on the effectiveness of the tree. By sorting the data, the cost of computing the centers is NP, the size of the data matrix; or N log(K), the cost of sorting. That leaves as the bottleneck step the k-means++ initialization. An alternative initialization is to use hyperplane partitioning. This is much like putting the data into a Generalized Hyperplane Tree, except forgetting all the details of the tree structure. This should have a cost of NP log(K) and so the whole process is less than O(NPK)

The Neural Gas algorithm [Martinetz, 1991] is equivalent to data clustering. It was developed with a focus on learning a representation of the data space, rather than partitioning a data set. We will accept it on its own terms, and not compute any cluster assignments. The authors used K=200 in their examples. The same is done here.

The output to the terminal is:

Original problem:
cluela returns #           0  in  47.6960030       seconds made    2  clusters with residual    2957.10156
kclust returns #           0  in   0.216000006     seconds made    3  clusters with residual    2658.35571

Projected problem:
kmns returns in    1.99999996E-02 seconds. made           18  clusters with residual    53876.5430
cluela seeking min. sum-of-squares returns # 0 in 0.175999999 seconds. made 18  clusters with residual 53599.6836
cluste returns # 0 in 0.172000006 seconds. made 18  clusters with residual 68906.8984
neugas returns in 1.44400001 seconds

These results show that CLUELA is as accurate as KMNS when using the SOSDO option, but that KMNS is faster by a factor of 9. CLUSTE has none of the performance we had hoped for, and Neural Gas is computationally expensive.

House votes '84

The House Votes data set is a record of the votes cast on some bills by the U.S. House of Representatives during its 1984 session. The goal is to classify each representative as a Democrat or a Republican. N=435, P=16, G=2. The votes are 'y', 'n', and '?'. We first attempt to treat the question marks as missing data. CLUELA returns with IFAULT=10, because there is a null object in the data. Somebody in the Congress didn't cast a vote on any of these bills. The k-means algorithm has no way to deal with null objects. [*7]

In this case there is an easy solution. Instead of treating the votes as categorical data, treat it as continuous data, and let n⇒0.0 ?⇒0.5 and y⇒1.0

The clustering results split nicely.

And agree tolerably well with the class labels

The F values show a strong tendency to form two clusters, as they should.

Maple Leaf Rag

This data set was extracted from MIDI files by the present writer from five performances of the "Maple Leaf Rag" by Scott Joplin, Bill Flynt, Gary Rogers, Alessandro Simonetto, and Mario Verehrer. Each note in the sheet music was assigned a class label. Each performed note was either matched to a note in the score, or else marked as stray.

The attributes are:

!___Attribute________Type_________________#___Values____________________
!   index fraction   Continuous               0...1
!   onset time       Continuous               [seconds]
!   tactus phase     Cyclic                   [beats]
!   channel          Categorical          1   1
!   source file      Contra-categorical   5   1, 2, 3, 4, 5
!   pitch            Continuous               0...127
!   velocity         Continuous               0...127
!   duration         Continuous               [seconds]

Here we encounter the strange data types "Cyclic" and "Contra-Categorical."

Phase is a cyclic variable, because the phase difference from 0.95 beats to 0.05 beats is 0.1, not 0.9. No provision has been made in package CLUELA to deal with cyclic data. [*8] Categorical variable channel does not vary. Categorical variables with only one category are not allowed. It could be coded as continuous, but it makes more sense to omit it. Attribute "source file" is contra-categorical. That is, separate events that come from the same performance should be less likely to correspond to the same ideal note. The effects of including such a variable will be studied. After recoding the data, we have N=4382, P=10, and G=975.

The first run of the program was done without rescaling time. This led to the strange result: It looks as if the notes drift away from each other into their separate source files as time progresses. Let us rescale time so that all performances have a duration of 50 seconds.

The importance weights were set with simulated annealing to reduce the variation of information. This is cheating, of course, but obtaining these weights was my whole purpose in collecting the data. Leaving the other weights as they are, set the weight for attribute "File ID" to 0.

With weight for File ID set to 600., the source files are separating from each other.

With weight for File ID set to 700., the source files are in four (?!) distinct bands.

Running the program with some File ID weights from -5 to 100 shows that accuracy degrades steadily with increasing weight on the contra-categorical variable. There is no prohibition in CLUELA against setting negative weights, but this can cause convergence problems. It also violates the triangle inequality, so that we cannot use the generalized hyperplane tree anymore.

As before with the Horse Colic problem, we call CLUELA, CLUSTE, KMNS, and NEUGAS and compare the results. Let CLUELA search for K over the range KMIN=974 to KMAX=978. The output to the terminal is this:

cluela returns # 0  in 98.2480011 seconds, made 975 clusters
Returned residual: 8458.90039  Adjusted Rand Index: 0.709671259  Variation of Information: 0.883388221
cluste returns # 0 in 0.492000014 seconds, made 952  clusters
Returned residual: 11198.1299  Adjusted Rand Index: 0.760961771  Variation of Information: 0.805582583
kmns returns # 0  in 4.49599981 seconds, made 975 clusters
Returned residual: 8612.56738  Adjusted Rand Index: 0.700010896  Variation of Information: 0.908325672
neugas returns in 1.16799998 seconds.

These results show that CLUELA is as accurate as KMNS, but it is slow. CLUSTE performs much better here, running in its preferred conditions of small P and large K. On this problem, it is faster than KMNS by a factor of 9. Although its residual of the solution found by CLUSTE is worse than KMNS, its accuracy is better on both Adjusted Rand Index and Variation of Information.

The plot of the F values shows no tendency to form clusters at any of the values tested.

The Neural Gas algorithm does what it is meant to do, spreading its centers across the data cloud:

If you don't have to prove anything to anybody, you can train on the test set. All sets are training sets. All sets are test sets. It all depends on what you have to prove to whom. Also, a lower residual is not always better.

Regrets

The perfect is the enemy of the good, and so I am pushing out this software with a couple of known shortcomings.

The routines in prep.f and prep.c search for duplicates by sorting and comparing adjacent elements. To be really efficient, this should be done with hash tables.

The Generalized Hyperplane Tree uses a formula to decide which branches of the tree do not need to be searched. A related concept, the Bisector Tree, stores in each node the distance to the farthest point it contains, and uses this to decide which branches do not need to be searched. These criteria could both be used to prune the tree.

History

1.2  8 November 2017  Fixed mistake in USORTU that returned an invalid index when N equals 1.

1.1   7 October 2017   Changed loop in KCLUST for deleting clusters to negative step.  This should fix mistakes that could occur when more that one cluster is to be deleted.

1.0   29 August 2017   Initial release.

What's Next

The path straight ahead in data clustering leads to Expectation Maximization. The notion of dissimilarity is given a precise meaning as conditional probability.

Alternatively, you might be more interested in classification, embedding, or robust statistics.

Appendix

point Q is within a critical distance of center OHalf-distance Lemma: [Elkan, 2003] If the distance to a point from its nearest center is less than or equal to half the distance of that center to any other center, the point can belong in no other cluster.

Proof: Let Q be the point. Let O be its current center. Let P be the nearest center to O.

Given: $\begin{aligned} & \left|OQ\right| \leq \frac{1}{2} \left|OP\right| \\ & 2 \left|OQ\right| \leq \left|OP\right| \end{aligned}$ By the triangle inequality, $\begin{aligned} & \left|OP\right| \leq \left|OQ\right| + \left|PQ\right| \\ & 2 \left|OQ\right| \leq \left|OP\right| \leq \left|OQ\right| + \left|PQ\right| \\ & \left|OQ\right| \leq \left|PQ\right| \end{aligned}$

a criterion for climbing a GHTGeneralized Hyperplane tree formula: [Chávez, 1999] QS ≥ ½ |PQ - OQ|

Proof: The pivots are O and P, and a query point is Q. S is hypothetical point on the boundary surface. The distance |QS| must be bounded to compare it to r, the radius of the ball containing the known neighbors of Q.

Construct triangle PQS, then by the triangle inequality: $ \left|QS\right| \geq \left|PQ\right| - \left|PS\right|$ Construct triangle OQS, then by the triangle inequality: $ \left|QS\right| \geq \left|OS\right| - \left|OQ\right|$ Add $ 2 \left|QS\right| \geq (\left|PQ\right| - \left|OQ\right|) + (\left|OS\right| - \left|PS\right|)$ |OS| = |PS| by definition, so $ \left|QS\right| \geq \frac{1}{2} (\left|PQ\right| - \left|OQ\right|)$

Missing data is Non-Metrica triangle inequality violation

Proof by counterexample: Consider the Manhattan dissimilarity function for missing data:

$ d(X,Y) = \frac{\sum W_{total}}{\sum W_{avail}} \sum \left|X_i - Y_i\right| $

Consider the points O = (0,0), P = (3,4), and Q = (1,?), and let the weights be equal. The dissimilarities are:
d(O,P) = (2/2) (|3-0| + |4-0|) = 7
d(O,Q) = (2/1) (|1-0| + ∅) = 2
d(P,Q) = (2/1) (|3-1| + ∅) = 4
2 + 4 < 7 and the triangle inequality is violated.

Decrease and Increase of Sum of Squared Distance: [Hartigan, 1979] Removing a point X from a cluster C of N points reduces the sum of squared distances by

$ d^2(X,C) \frac{N}{N-1} $

Placing a point in a cluster increases the sum of squared distances by

$ d^2(X,C) \frac{N}{N+1} $

Proof for removing a point: Let X' denote the point transferred. Let C' denote the new position of the cluster center. Let Xi, 1 ≤ i ≤ N be the initial points in the cluster. Then the change in squared distance U is:

$ \Delta U_{del} = \sum(X_i-C')^2 - \sum(X_i-C)^2 - (X'-C')^2 $ Expand: $ -2(\sum X_i)(C'-C) + \sum(C'^2-C^2) - (X'-C')^2 $ Doing some algebra shows that: $\begin{aligned} & \sum X_i = NC \\ & C'-C = -\frac{X'-C}{N-1} \\ & C'^2-C^2 = \frac{(X'+C-2NC)(X'-C)}{(N-1)^2} \\ & X'-C' = \frac{N}{N-1}(X'-C) \end{aligned}$ Substitute: $ \Delta U_{del} = -2NC\frac{-(X'-C)}{N-1} + N\frac{(X'+C-2NC)(X'-C)}{(N-1)^2} - (\frac{N}{N-1})^2 (X'-C)^2 $ Simplify to: $ \Delta U_{del} = -(X'-C)^2 \frac{N}{N-1} $

Derivation of the formula for inserting a point is left to the interested reader.

The Sum-Of-Squares Dissimilarity is Non-Metric

Proof by counterexample: Suppose that there is a data point X, and two identical centers C1 and C2
Suppose that X is an element of C1 and not of C2. Then

$d(X,C_1) > d(X,C_2)$ but $d(C_1,C_2) = 0 \\ d(X,C_1) > d(X,C_2) + d(C_1,C_2) $

and the triangle inequality is violated.

The matrices D and QT for the Iris data are:

D = 
2081.015539   0.000000  -0.000000   0.000000
  -0.000000 329.986937   0.000000  -0.000000
  -0.000000   0.000000  60.225401  -0.000000
  -0.000000  -0.000000   0.000000  28.561721

Q^T = 
   0.407838   0.199425   0.294615   0.098122
   0.193311   0.293968  -0.339500  -0.173221
  -0.224983   0.508836  -0.072564   0.193616
   0.151865  -0.114316  -0.135671   0.598148

The skeptical reader may solve for P, and verify for himself that these matrices have L1 orthonormal columns.

Notes

  1. For an introduction to k-means, please consult Wikipedia, or the earlier article "Beginning Data Clustering" on Codeproject, which gives a simple implementation.
  2. A univariate Laplace distribution does have the L1 geometry. What is called a multivariate Laplace distribution [Kotz, 2001] is elliptically contoured. In one dimension the Euclidean and Manhattan metrics coincide anyway, so there may not be a good theoretical justification for the Manhattan metric.
  3. There are transformations from similarity to dissimilarity which are quicker to compute, but the arccosine is the arc length between two points on the unit hypersphere. This is a metric dissimilarity, which is important for some applications.
  4. For instance, Wikipedia said as of 26 August 2017, "The standard algorithm aims at minimizing the WCSS objective, and thus assigns by 'least sum of squares', which is exactly equivalent to assigning by the smallest Euclidean distance." This mistake stood for over four years.
  5. If all the data is continuous, then the importance of each variable can be represented by multiplying the data by a scalar. Since this cannot be done with categorical variables, it is necessary to have a separate array to show the importance of each variable.
  6. It has been suggested to achieve dummy coding with c - 1 variables by omitting the last continuous dummy variable. [Bowles, 2015] This introduces a bias, since the last category is less different from the other categories than the other categories are from each other.
  7. Graph clustering algorithms can handle null objects by assigning them an arbitrary dissimilarity.
  8. It is possible to cluster cyclic variables, but it would make the whole package more complicated, so support was deleted.

References

  1. Robert L. Thorndike, "Who Belongs in the Family?" Psychometrika, v.18 n.4, December 1953
  2. Geoffrey H. Ball and David J. Hall, Isodata, a Novel Method of Data Analysis and Pattern Classification Stanford Research Institute Technical Report, Project 5533, April 1965.
  3. David W. Goodall, "A New Similarity Index Based on Probability", Biometrics v.22 n.4 pp.882-907, December 1966.
  4. J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations" Berkeley Symposium on Mathematical Statistics and Probability, 1967
  5. R. Gnanadesikan and J. R. Kettenring, "Robust estimates, residuals, and outlier detection with multiresponse data" Biometrics v.28 pp.81-124, 1972.
  6. John A. Hartigan and Manchek A. Wong, "Algorithm AS 136: A K-Means Clustering Algorithm" Applied Statistics v.28 n.1 pp.100-108, 1979.
  7. R.S. Michalski and R.L. Chilausky, "Learning by Being Told and Learning from Examples: An Experimental Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing an Expert System for Soybean Disease Diagnosis", International Journal of Policy Analysis and Information Systems, Vol. 4, No. 2, 1980.
  8. I. Kalantari and G. McDonald, "A data structure and an algorithm for the nearest point problem" IEEE Transactions on Software Engineering, v.9 p.5, 1983
  9. Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Washington, D.C., 1985.
  10. Glenn W. Milligan and Martha C. Cooper, "An Examination of Procedures for Determining the Number of Clusters in a Data Set" Psychometrika v.50 n.2 pp.159-179, June 1985.
  11. M. Lichman (Admin.), UCI Machine Learning Repository http://archive.ics.uci.edu/ml, Irvine, CA: University of California, School of Information and Computer Science, 1987-present
  12. Ross Quinlan, "Simplifying decision trees" Int J Man-Machine Studies v.27, pp. 221-234, December 1987
  13. Mary McLeish & Matt Cecile, "Horse Colic Data Set", Department of Computer Science, University of Guelph, Guelph, Ontario, Canada N1G 2W1, [c.1989]
  14. Leonard Kaufman and Peter J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, 1990, reissued 2005.
  15. M. Forina, et al, PARVUS - An Extendable Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.[c.1991]
  16. Jeffrey K. Uhlmann, "Satisfying General Proximity/Similarity Queries with Metric Trees" Information Processing Letters, v.40 pp.175-179, November 1991
  17. Thomas Martinetz and Klaus Schulten, 'A "Neural-Gas" Network Learns Topologies' in Artificial Neural Networks, Ed. T. Kohonen, K. Makisara, O. Simula, and J. Kangas. Holland: Elsevier, 1991.
  18. Michael Falk, "On MAD and Comedians" Ann. Inst. Statist. Math. v.49 n.4 pp.615-644, 1997.
  19. Edgar Chávez, Gonzallo Navarro, and José Luis Marroquín, "Searching in Metric Spaces" Association for Computing Machinery, 1999
  20. Dan Pelleg and Andrew Moore, "X-means: Extending K-means with Efficient Estimation of the Number of Clusters." Draft report, 2000.
  21. János Podani, Introduction to the Exploration of Multivariate Biological Data Leiden: Backhuys Publishers, 2000. online:
  22. Samuel Kotz, Tomasz J. Kozubowski, and Krzysztof Podgórski, The Laplace Distribution and Generalizations: A Revisit with New Applications Internet draft copy, 24 January 2001
  23. Marina Meilă, "Comparing Clusterings" University of Washington, technical report, 2002.
  24. Charles Elkan, "Using the Triangle Inequality to Accelerate k-Means" Proceedings of the Twentieth International Conference on Machine Learning, 2003.
  25. Francis R. Bach and Michael I. Jordan, "Learning Spectral Clustering" University of California, draft report, 2003.
  26. Julian Laub, Non-Metric Pairwise Proximity Data Dipl. Ing. Phys. Thesis, Technical University of Berlin, 10 December 2004.
  27. Andreas Klein, "A Generalized Kahan-Babuška-Summation-Algorithm", 21 April 2005
  28. D T Pham, S S Dimov, and C D Nguyen, "Selection of K in K-means Clustering" J. of Mechanical Engineering Science, 2005.
  29. Shai Ben-David, Ulrike von Luxburg, and Dávid Pál, "A Sober Look at Clustering Stability", [c.2005]
  30. Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy, "The Effectiveness of Lloyd-Type Methods for the k-means Problem", Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, 2006.
  31. David Arthur and Sergei Vassilvitskii, "k-means++: the advantages of careful seeding", Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.
  32. Sung-Hyuk Cha, "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions" International Journal of Mathematical Models and Methods in Applied Sciences v.1 i.4, 2007.
  33. Hans-Hermann Bock, "Origins and extensions of the k-means algorithm in cluster analysis" Electronic Journ@l for History of Probability and Statistics v.4 n.2 December 2008 www.jehps.net
  34. Michiel de Hoon, Seiya Imoto, and Satoru Miyano. Users' manual for: The C Clustering Library for cDNA Microarray Data, University of Tokyo, Institute of Medical Science, Human Genome Center, 27 December 2010.
  35. Michel Deza and Elena Deza, Distances in PR, Audio and Image Séminaire Brillouin, IRCAM, 31 May 2012.
  36. Brian Kulis and Michael I. Jordan, "Revisiting K-means: New Algorithms via Bayesian Nonparametrics" Proceedings of the 29th International Conference on Machine Learning, 2012.
  37. Malika Charrad, Nadia Ghazzali, Veronique Boiteau, and Adam Niknafs, "NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set" available from CRAN Journal of Statistical Software v.61 i.6 pp.1-36, 2014.
  38. Michael Bowles, Machine Learning in Python: Essential Techniques for Predictive Analysis Wiley, 2015.

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication

Share

About the Author

Andy Allinger
Engineer Kruger Optical
United States United States
I work on an assembly line and have been doing amateur programming since 1992. Those rare times away from the computer I spend bicycling, canoeing the Columbia River, or cheering on the Seattle Seahawks.

You may also be interested in...

Pro
Pro

Comments and Discussions

 
QuestionFull reference to Bock Pin
Tachyonx12-Nov-17 3:37
memberTachyonx12-Nov-17 3:37 
GeneralRe: Full reference to Bock Pin
Andy Allinger15-Nov-17 15:44
memberAndy Allinger15-Nov-17 15:44 
QuestionThanks Message Pin
Member 1349966810-Nov-17 19:47
memberMember 1349966810-Nov-17 19:47 
GeneralMessage Closed Pin
8-Nov-17 19:30
memberMember 135050738-Nov-17 19:30 
PraiseMajor contribution to my continuing education Pin
Member 1017996013-Oct-17 6:00
memberMember 1017996013-Oct-17 6:00 
GeneralRe: Major contribution to my continuing education Pin
Andy Allinger15-Nov-17 15:49
memberAndy Allinger15-Nov-17 15:49 
QuestionWow....just...wow Pin
SteveQ569-Oct-17 1:26
memberSteveQ569-Oct-17 1:26 
QuestionGreat article! Pin
eslipak1-Sep-17 10:34
professionaleslipak1-Sep-17 10:34 
AnswerRe: Great article! Pin
Andy Allinger10-Sep-17 22:28
memberAndy Allinger10-Sep-17 22:28 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web03-2016 | 2.8.180621.3 | Last Updated 8 Nov 2017
Article Copyright 2017 by Andy Allinger
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid