### Summary

The simple k-means data clustering algorithm is extended to support missing data, mixed data, and to choose the number of clusters. A modified distance calculation finds solutions with a lower total error. A version of Principal Component Analysis with missing data support is included. Original algorithms based on non-Euclidean geometry are introduced. Full source code with no library dependencies is provided in `FORTRAN`

with a full translation to `C`

. Several example problems are discussed.

### Contents

- Background: Clustering in General
- Package CLUELA
- k-means History
- Using the Code
- Example Problems
- Notes

## Background

**Introduction.** K-means is the most commonly used method of data clustering, because it is simple, fast, and usually gives good results. The algorithm alternates between two main steps:

- Assign each point to its nearest center
- Let each center be the average of its points

One of the attractions of the method is how readily it may be generalized. A simple implementation [*1] defines a "point" as a vector of real numbers, "center" as the arithmetic mean, and "nearest" by distance according to the Pythagorean Theorem. More generally, the "points" can be various kinds of data, the "centers" can be various kinds of averages, "nearest" distance can be generalized to a dissimilarity function, of which there are a bewildering variety to choose from.

Data clustering is a fundamental operation: it joins together things that are alike, and splits apart things that are different. With the proper modifications, k-means is a powerful technique applicable to a wide variety of problems.

### Kinds of data

*A much more thorough discussion can be found in the chapter "Types of Data and How to Handle Them" in [Kaufman, 1990].*

**Vectors of real numbers.** Suppose you have a weather monitoring station that records the temperature, barometric pressure, and humidity. Each observation be represented by a vector with three components [T, p, h]. Each of these measurements is a continuous variable. They do not need to have the same unit of measure.

**Categories.** Suppose instead that the data has only certain specific values it can take on. Instead of measuring temperature, you might record the weather as either sun, rain, or snow. A heart murmur may be either present or absent. A question may be answered yes or no. An animal may be male or female.

**Ordinal.** Of runners in a race, who came in first, second and third might be known, but their finish times might be unknown. In such cases, runner #1 is faster than runner #2, and runner #4 is faster than runner #5, but it is not generally the case that the difference #1-#2 is equal to the difference #4-#5. If the statistical distribution of running speeds is known, then it is possible to estimate the speed of a runner from the how he places. Less rigorously, an ordinal value may be replaced by its quantile, that is, the fraction of the data that it is greater than. This program handles ordinal data by replacing it with its quantile.

**Time series.** Suppose your weather monitoring station only measures temperature, but you take a measurement every hour for a year. Then the entire record for the year could be considered an observation, and it could be compared to the yearly records of other weather stations. (This program doesn't do time series.)

**Bag-of-words.** A document may be represented as a list of words that it contains, along with a count of how many times each word occurs. Because in any particular document most words in the language do not occur, the bag-of-words representation is more concise than a vector representation, and it lends itself to different dissimilarity functions. (This program doesn't do bag-of-words)

**Histograms** Perhaps instead of a single measurement, the object is a whole collection of measurements forming a statistical distribution. Special dissimilarities, such as the Earth Mover's distance, are appropriate to compare histograms. (This program doesn't do histograms.)

**Multimedia.** Perhaps you data is audio, images, or video. The problem becomes difficult, but measures have been invented to compare their dissimilarity. [Deza, 2012] (This program doesn't do multimedia.)

### Kinds of Dissimilarities

The dissimilarity function is the heart of data clustering. It is how difference is quantified. A wide variety of dissimilarity functions have been devised over the years. Some have a firm theoretical foundation, some do not. [Goodall, 1966] [Kaufman, 1990] [Podani, 2000] [Sung‐Hyuk, 2007] [Deza, 2012]

Foremost is the **Euclidean distance**, the kind taught in geometry, given by the Pythagorean theorem:

$d(X,Y) = \sqrt{\sum(X_i - Y_i)^2}$

The reason for its importance is that many real-world probability distributions are ellipsoidally contoured. The Euclidean distance is proportional to the negative logarithm of probability of a *spherically* contoured distribution. Thus, if the input data is suitably scaled so that the variables have similar variance, and if the input variables are mostly independent, then Euclidean distance approximates the unlikelihood that an object belongs to a group.

**Manhattan dissimilarity.** Some authors [Kaufman, 1990] recommend that instead of the Pythagorean Theorem, distance should be calculated according to the L1 metric (sometimes called Taxicab, City-Block, Manhattan):

$d(X,Y) = \sum\left|X_i - Y_i\right|$

The Manhattan metric is less sensitive to extreme (possibly wrong) values because the differences are not squared. However, it corresponds to a parallelogrammatically contoured distribution, which may not exist in nature. [*2]

**Correlation.** Suppose that the data are time series of the discharge of some rivers for several years. Computing the standard correlation would show if there is a relationship between the flow in one river and another, independent of the average size of the river.

$ r = \frac{\sum xy}{\sqrt{(\sum x^2)(\sum y^2)}}$ where x has been centered by subtracting the mean value of X: $x_i = X_i - \bar{X}$ and: $y_i = Y_i - \bar{Y}$

The correlation is a similarity measure. It may be converted to a dissimilarity measure by taking the arccosine. [*3]

$d(X,Y) = \arccos(r)$

**Cross-entropy** Suppose that the data is some observations of a discrete variable. The usual example is that there are some jars of beans, with different proportions of red, black, and white beans in each jar. Then the cross-entropy between the sample and a particular jar:

$ H(S,M) = -\sum P_s \log(P_m) $

is the negative log-probability that the sample was drawn from that jar. Here, S stands for sample, and M stands for model. Notice that this dissimilarity is not symmetric:</p>

$ H(S,M) \neq H(M,S) $

**A Dissimilarity Function for Missing Data** The simplest way to handle missing data is to compensate for missing values by a simple linear extrapolation from the non-missing data. John Gower proposed a similarity coefficient in 1971 that was modified into the dissimilarity function [Kaufman, 1990]:

$ d(X,Y) = \frac{\sum (\delta(x,y) f(x,y))}{\sum \delta(x,y)} $

where δ(x, y) is an indicator function that has the value 1 if data is present for some variable for both objects X and Y, and has the value of 0 otherwise. The function f(x, y) depends on the kind of data. For continuous data, f(x, y), is the absolute value of the difference x-y. For categorical data, f(x, y) is 0 if the values match and 1 if they do not.

**Minimizing Sum-of-Squared Distance.** The goal of clustering is to partition the data so that objects that are near each other are in the same group. To assess how well this goal is attained, compute the residual error, defined as the sum of the squared distances from each point to the center to which it is assigned. It is frequently and incorrectly stated that the simple k-means method finds a local minimum of the error. [*4] This does not take account of the effect that changing the assignment of the point will have on the center. To find a local minimum of squared error, the dissimilarity function must be the squared Euclidean distance with a correction factor of \(\frac{N}{N-1}\) applied if the point belongs to the center, and a correction factor of \(\frac{N}{N+1}\) applied if the point does not belong to the center. [Hartigan, 1979]

### Determining the Number of Clusters

A significant weakness of k-means is that it requires the number of clusters, *k*, to be specified beforehand. The usual approach is to run k-means for several values of *k*, and then inspect the results. The residual error may be measured as the total distance of each point to its center. As *k* is increased, this residual decreases. The intuition is that there will be a significant decrease in the residual when the right value of *k* is reached, and that further additional clusters will not lead to significant improvements. This general approach is sometimes called the Elbow Method. Many variations of this idea have been proposed over the years. A survey of 30 methods for choosing the number of clusters was done by Milligan & Cooper in 1985. A software package for the GNU "R" statistical language was released [Charrad, 2014] implementing 30 methods, some of them the same. There does not seem to be any consensus at present on the best variant of the elbow method.

Another approach is to guess an initial value for *k*, then run a variation of k-means that provides for changing the number of clusters. [Ball, 1965] [Pelleg, 2000] [Kulis, 2012] A disadvantage of this approach is that more parameters must be specified to decide when a cluster should be split, and when clusters should be merged.

It has been proposed that k-means may be run several times with different values of *k*, and a value of *k* should be chosen that yields the most consistent results. This approach can fail badly, as is shown in [Ben-David, 2005].

## Package Cluela

`CLUELA`

is CLUstering with ELAborations. It extends the basic k-means algorithm to tackle the problems of missing data, mixed data, and choosing the number of clusters. It brings together original and legacy code in plain, honest `FORTRAN`

, with a complete `C`

translation. It is intended to be efficient for use in applications, and clearly written enough that it can be modified by other programmers.

### Missing data support

We need some way to represent whether a datum is present or missing. It is possible to set aside some special value ("sentinel value") to represent missing data, but this leads inevitably to collisions. To really adequately indicate what is present requires another array. (Similar to the implementation in the C Clustering Library. [de Hoon, 2010]) Each element in integer matrix `NX`

should be set to 1 if the corresponding element in data matrix `X`

is present, and 0 if it is missing.

The dissimilarity function is that of Gower, as modified by Kaufman & Rousseeuw, and extended to support importance weights for each variable [*5] and generalized to support the Manhattan and Euclidean metrics:

$ d(X,Y) = \frac{\sum(w_j)}{\sum (\delta(x_j,y_j) w_j)} \sum \delta(x_j,y_j) w_j \left|x_j - y_j\right|^L $

where δ(x,y) is an indicator function that has the value 1 if data is present for both X and Y, and is 0 otherwise.

L is 1 or 2, for the Manhattan or Euclidean metric, respectively.

If option `ROBUST`

is set, then the Manhattan metric is used, otherwise the Euclidean metric is used. Option `ROBUST`

also computes centers as medians instead of means, so that `CLUELA`

implements the **k-medians** algorithm.

### Mixed data

The dissimilarity function given above handles only continuous data. To handle categorical data, subroutine `RSCODE`

is included to recode a categorical variable as several continuous variables. The most obvious and most common way to do this is with **dummy coding**. A categorical variable that with *c* possible values is rewritten as *c* continuous variables. The continuous variable that corresponds to the category of the object is set to 1, and the rest of the continuous variables are set to 0. Dummy coding contains redundant information, because if *c* - 1 of the dummy-coded values are known, the last value is fully determined. This is slightly inefficient, and can cause numerical problems if we wish to compute the covariance matrix (for example, to run PCA or LDA). It is preferable therefore to represent a categorical variable that takes on *c* possible values by *c* - 1 continuous variables. A **regular simplex** is an N-dimensional generalization of a regular polyhedron. A regular simplex of *c* vertices lies in a space of *c* - 1 dimensions. For example, the regular simplex with four vertices is a tetrahedron, and it is an ordinary 3-dimensional object. [*6] If option `ROBUST`

is set, then the data is embedded in a Manhattan simplex, such that the vertices are at distance 1 according to the Manhattan metric, otherwise the data is embedded in ordinary Euclidean space. Since the recoding increases the number of variables, the importance weights must be scaled down. The recoded data then must be accordingly scaled up to keep the dissimilarities the same as if simple matching of the categorical variables had been used.

Subroutine `RSCODE`

requires input in the form of `REAL`

approximations of small consecutive integers, that is, 1.0, 2.0, 3.0 ... The raw data that to be studied will be in a much less convenient format. Two subroutines are provided for this conversion: `JSCODE`

to recode arbitrary integers to consecutive small integers, and `IREID`

to recode `CHARACTER`

data as consecutive small integers. These work by sorting an array and comparing adjacent elements. They use **passive sorting**, because often we do not wish to change the order of the data, just to know what order it would be in. The `C`

standard library includes `qsort`

, but it does not include a passive sort. Legacy subroutines `QSORTI`

, `QSORTR`

, and `HPSORT`

are included (and translated) to provide passive sorting.

Sometimes we may wish to sort an array, then put it back in its original order. This may be done with two permutation vectors. If `JND`

is the permutation vector which was used to order the array, then to put the array back how it was before, use:

for (i = 1; i <= n; ++i) ind[jnd[i]] = i;
rorder (&x[1], &ind[1], n);

DO I = 1, N
IND(JND(I)) = I
END DO
CALL RORDER (X, IND, N)

### Choosing the number of clusters

The user must specify `KMIN`

, the fewest acceptable clusters; `KMAX`

, the most acceptable clusters; and `K`

, the default number of clusters. The subroutine runs k-means for *k* from `KMIN-1`

to `KMAX `

and notes the improvement in the residual at each step for *k* from `KMIN`

to `KMAX`

. A peculiar statistic is computed according to the formula of [Pham, 2005]. The value of *k* which yields the minimum of this statistic is chosen, unless a critical value is not reached, in which case the default number of clusters is accepted.

### Other Features

The initial centers are chosen by the usual k-means++ method. For each value of *k*, k-means is run ten times with different initial centers. The result with the least residual is kept.

Setting option `SOSDO`

enables the correction factor of \(\frac{N}{N-1}\) or \(\frac{N}{N+1}\) that finds a local minimum of the sum of squared distance objective. This option conflicts with option `ROBUST`

. It also conflicts with building a binary tree (see below), and so is not the default.

**External Validation**. Sometimes you may want to compare a partition generated with k-means against some other partition. For example, you might wish to test the clustering algorithm on some problem with a known answer. Two subroutines are supplied for this purpose. `ARIND`

computes the Adjusted Rand Index (ARI), which is a measure of accuracy that has a maximum of 1. It is 0 when the agreement between the clustering and the external standard is what would be expected by chance. It can take negative values, if the agreement is even worse than would be expected by chance. `VINFO`

computes the Variation of Information. [Meilă, 2002] It measures distance between the clustering and the external standard, so lower values mean better agreement. Its minimum value is 0; its maximum depends on the size of the problem.

### Principal Component Analysis: Bosses love graphs

If you have to communicate the results of your analysis, you will need some visual aids so that the audience will have something to look at. To visualize multivariate data, it must first be projected down to two dimensions. That can be done by Multi-Dimensional Scaling (MDS) [Laub, 2004], Singular Value Decomposition, or Principal Component Analysis. Subroutine `PCA`

is included for projection with missing data support.

SVD and PCA are theoretically equivalent, and SVD is usually preferred, as it is said to be more numerically stable. SVD factors the data matrix:

$ X = U S V^T $

But it doesn't provide a way to deal with missing data. For PCA, we must first compute the covariance matrix:

$ \sigma_{jk} = \frac{\Sigma((x_{ji} - \bar{x_j})(x_{ki} - \bar{x_k}))}{N - 1} $

If some data is missing, it is omitted and N is adjusted. Find the eigenvalues and eigenvectors, factoring the covariance matrix \(V \Lambda V^T\). The projection is \(R = X^T V\), using only the columns of V corresponding to the M largest eigenvalues. Notice that R is transposed relative to X.

For fun, and to allow you to cook the data to suit your tastes, four kinds of PCA are implemented, based on the standard covariance matrix and three alternatives. The user may set the option `LEVEL`

to -1, 0, 1, or 2. The options are numbered in order of increasing robustness.

`LEVEL = 0`

Computes regular PCA from the covariance matrix.

`LEVEL = 2`

Computes robust PCA from the comedian matrix. [Falk, 1997] The comedian

$ com(X,Y) = med{(X - med(X))(Y - med(Y))} $

is a generalization of the median absolute deviation. It is a highly robust alternative to the covariance that is easy to compute.

`LEVEL = -1`

and `LEVEL = 1`

require some explanation.

A criterion for the orthogonality of two vectors u and v is that the distance from u to v should equal the distance from u to -v. Applying the Euclidean metric yields:

$ \sum(u - v)^2 = \sum(u - (-v))^2 \\ \sum uv = 0 $

Which is just the formula for the covariance. Applying the L1 metric yields:

$ \sum \left|u - v\right| = \sum\left|u - (-v)\right| \\ \frac{1}{2} \sum(\left|u + v\right| - \left|u - v\right|) = 0 $

This is a new measure of the relation of two variables. Let it be called the **codeviation**. Applying the L∞ metric yields:

$ \max\left|u - v\right| = \max\left|u - (-v)\right| \\ \frac{1}{2} (\max\left|u + v\right| - \max\left|u - v\right|) = 0 $

The codeviation resembles the construction of alternative measures of dispersion in [Gnanadesikan, 1972], but is distinct. The Gnanadesikan-Kettenring construction is a robust estimate of covariance. The codeviation is an analogue of covariance in a non-Euclidean geometry, which is not necessarily robust. The L∞ codeviation is determined by the extreme values, and could be called a "flimsy statistic." Set `LEVEL`

to 1 or -1 to select the L1 codeviation or L∞ codeviation, respectively. If standard PCA doesn't show what you need it to show, perhaps one of these alternatives will.

Since `PCA`

supports missing data, it can be used to fill in missing values as a preprocessing step for routines that do not have missing data support, such as `KMNS`

and `NEUGAS`

included in this distribution.

The codeviation and other applications of non-Euclidean orthogonality are an original contribution.

### Efficiency Improvements

k-means is usually described as fast, or at least faster than some other clustering algorithms. The computational cost of basic k-means is NPKi operations, where N is the number of objects, P is the number of variables, K is the number of clusters, and i is the number of iterations required for convergence. It is possible to modify the algorithm to be faster than this, and these efficiency improvements involve no loss of accuracy.

The bottleneck step of k-means is the computation of the distance from each point to each center. It is possible to find the nearest center to a point without measuring the distance to every center, by putting the centers into a binary tree data structure. [Chávez, 1999] `CLUELA`

uses a generalized hyperplane tree. Any data set which has a dissimilarity between each datum may be split by choosing two objects, and assigning the remaining objects to a partition by whether they are nearer the first or the second. If the data and dissimilarity function satisfy the requirements of a metric space, then the triangle inequality may be used to efficiently search the tree for nearest neighbors. (see Appendix) There are four requirements of a metric space:

- Dissimilarities are not negative
- Dissimilarity is zero if and only if the objects are identical
- Dissimilarity between objects is symmetric
- The triangle inequality: The sum of dissimilarity of A to B plus the dissimilarity of B to C must be greater than or equal to the dissimilarity of A to C.

The Manhattan dissimilarity and the Euclidean dissimilarity are metrics. Some other dissimilarity functions of present interest are not metrics. In particular, *squared* Euclidean distance is *not* a metric. The Gower dissimilarity is not a metric when there is missing data. The dissimilarity function that minimizes the sum of squared distance is not a metric. For examples, see the Appendix.

The generalized hyperplane tree is most valuable when K is large and P is small. `CLUELA`

builds a generalized hyperplane tree when the requirements of a metric space are satisfied. Otherwise, it exhaustively compares each point to each center.

In a metric space, if the distance to a point from its nearest center is less than or equal to half the distance of that center to any other center, the point can belong in no other cluster. In that case, no distances need to be computed. [Elkan, 2003] (See Appendix)

Hartigan & Wong introduced the concept of "live sets" of points and of centers to eliminate needless distance calculations. [Hartigan, 1979] The logic to my understanding is this: A center is in the live set if any points were added or removed from it on the last iteration. A point is in the live set if the distance to its nearest center has increased since the last iteration. For each point, live centers may have become nearer, and so each point must be compared to all live centers. For a live point, its present center is farther, and so the point must be compared to all centers dead and alive. If neither the point nor the center is in its respective live set, then no change is possible and the computation may be skipped. This reasoning does not depend on the properties of a metric space. `CLUELA`

always makes use of the live sets.

It is possible with some kinds of data and dissimilarity functions for k-means to get stuck in a loop and never converge. The assignment array `Z`

is hashed, and the hash is stored in a ring buffer. On each iteration, the present hash is compared against the previous hash values to detect if the algorithm is stuck in a loop. If it is stuck, it exits without an error. (It may not be possible to get stuck with the scalar data in `CLUELA`

. The loop detection feature has been retained for the sake of programmers who may modify the subroutine to handle less nicely behaved data.)

To compute the median of each cluster, the data must be repacked into a work array. To avoid making K passes over the data matrix to fill the work array, passive sorting is used on the assignment array. Then it is possible to access the points for each cluster in order in a single pass over the data matrix. Since the assignment array is small integers, it can be sorted especially efficiently with the **unstable radix sort**, at a computational cost of N log(K).

**Compensated addition** is used in several places. [Klein, 2005] It is important when adding long arrays of small numbers, for example, if you wanted the partial sum of the first billion terms of the harmonic series. This is not an efficiency improvement, but I had the code already, so what the heck.

### About the translation

The original code was all in "Legacy" `FORTRAN`

-- that is, whatever will f2c will accept. That's standard `FORTRAN 77`

with a few extensions. f2c translated it, and then it was edited to remove dependencies on `libf2c`

and make the matrices doubly subscripted. Double deference may or may not run faster, but it's a lot easier to read. There is no random.c or machine.c because machine constants and random numbers are part of standard `C99`

. Likewise, there is no array.f because there is nothing special about declaring a matrix in `FORTRAN`

. The arguments lists are often different. Some arguments in `FORTRAN`

are needless in `C`

and vice versa.

## History

*Much of this information comes from the review article by [Bock, 2008]*.

1950 Dalenius proposes a method to partition a 1-dimensional data set.

1953 Thorndike informally describes a clustering method using average distances. [Thorndike, 1953]

1956 Steinhaus proposes a k-means algorithm for the continuous case.

1957 Lloyd proposes a 1-dimensional k-means algorithm; his paper is not widely published until 1982.

1965 Forgy presents the basic k-means algorithm in a lecture and an abstract.

1965 Ball & Hall publish a more complex algorithm with provisions to merge and split clusters. [Ball, 1965]

1967 MacQueen proposes the formula to immediately update a center as soon as a point is moved.

Moral: Technology progresses in incremental steps with contributions from many people.

## Using the Code

The solution to most clustering problems will require a pre-processing phase, the clustering itself, and then some additional post-processing.

Subroutine `CLUELA`

accepts as input only vectors of real numbers. Data in the form of text labels or integers can be converted to small consecutive integers with calls to `IREID`

or `JSCODE`

, respectively. Ordinal data may be rescaled with `ORDNAL`

. If you decide to standardize some or all of the variables, the mean and standard deviation may be computed with `AVEDEV`

, or the median and median absolute deviation may be computed with `MEDMAD`

.

The calling sequence for the main subroutine looks like this:

int cluela (float **x, int **nx, int n, int p, float *w, float **c, int **nc,
int kmin, int *k, int kmax, int *z, int *pop, float *f, bool robust,
bool sosdo, int *iwork, float *rwork, int **tree, float *u)

SUBROUTINE CLUELA (X, NX, PLIM, P, N, W, C, NC, KMIN, K, KMAX,
$ Z, POP, F, ROBUST, SOSDO, IWORK, RWORK, U, IFAULT)

The user must specify `N`

, the number of objects; `P`

, the number of dimensions; `X`

, a `P`

by `N`

single-precision data matrix; `NX`

, a `P`

by `N`

integer matrix each component of which is 1 if the corresponding component of `X`

is available and is 0 if the corresponding component of `X`

is missing; `W`

, a single precision vector of length `P`

giving the importance of each variable; `KMIN`

, the fewest acceptable clusters; `K`

, the default number of clusters; `KMAX`

, the most acceptable clusters; `ROBUST`

, which selects the k-medians algorithm instead of k-means; and `SOSDO`

, which selects minimization of the sum-of-squares criterion. `K`

is both input and output, and so must not be declared a constant. Options `ROBUST`

and `SOSDO`

mutually exlclusive.

The classical k-means implementation `KMNS`

and the alternative Neural Gas clustering algorithm `NEUGAS`

are included in the distribution for the sake of comparison. Subroutine `CLUSTE`

is a variation of `CLUELA`

that skips repeats and uses a quick initialization for faster theoretical efficiency.

In the post-processing phase, `PCA`

may be called to project the data a lower dimension so that it can be graphed. (`PCA`

may also be used in the preprocessing phase to fill in missing data so that `KMNS`

or `NEUGAS`

may be used, as they lack missing data support.) Write the 2-D results to a spreadsheet with `WCSVSP`

.

Subroutines of interest to the applications programmer are:

`SEIGV`

: compute eigenvalues and eigenvectors of a symmetric matrix, by Martin et. al.

`QSORTI`

: sort integers, by Richard Singleton et. al.

`QSORTR`

: sort single precision real numbers, by Richard Singleton et. al.

`HPSORT`

: sort strings, by Richard Singleton et. al.

`IORDER`

: apply a permutation to an arry of integers, by Robert Renka

`RORDER`

: apply a permutation to an array of single precision real numbers, by Robert Renka

`HPPERM`

: apply a permutation to an array of strings, by McClain and Rhoads

`KMNS`

: k-means clustering, by Hartigan and Wong

`ADD`

: compensated addition according to Kahan

`NEUGAS`

: the Neural Gas data clustering algorithm

`MEDIAN`

: the median of a vector of single precision real numbers, by Alan Miller

`JSCODE`

: recode an array of arbitrary integers to an array of small consecutive integers such that the smallest value in the input becomes 1, the second smallest becomes 2, and so on.

`IREID`

: recode an array of strings to an array of small consecutive integers, so that the first string to occur in the input becomes 1, the second string to appear in the input becomes 2, and so on.

`ORDNAL`

: recode a REAL array so that each entry is replaced by its quantile score.

`RSCODE`

: recode a matrix containing categorical variables to real coordinates from the regular simplex, so that the dissimilarities are preserved.

`AVEDEV`

: compute the mean and standard deviation of a data set.

`MEDMAD`

: compute the median and median absolute deviation of a data set.

`SAFDIV`

: check whether division will overflow

`ALMEQ`

: check whether two real numbers are almost equal

`SHUFFL`

: make a random permutation of integers 1...N

`DINIT`

: choose initial cluster centers according to the k-means++ method

`KCLUST`

: k-means or k-medians clustering, with missing data support and the possibility to insert or delete clusters

`HCLUST`

: choose initial point assignments by hyperplane partitioning

`CLUSTE`

: a fast interface to `KCLUST`

that calls `HCLUST`

and omits restarts

`WCSVSP`

: write an array of (x,y) coordinates to a .csv spreadsheet file

`PCA`

: Compute the principal components of a data set

`DIVEV`

: Change the results of PCA into the matrix U obtained from SVD

`ARIND`

: Compute the Adjusted Rand Index, measuring the agreement of clustering result with a given partition.

`VINFO`

: Compute variation of Information, a metric of the difference between a clustering result and a given partition.

For calling sequences of theses subroutines, please refer to the source code. All the important subroutines have prologues that explain how to use them.

## Example problems

Most of these problems come from the University of California at Irvine Machine Learning Repository. [Lichman, 1987] They were originally posed as classification problems. The known classes can be used as an external validation measure of the clustering result. Training sets and test sets have been combined as soybean.both and horse-colic.both. This is permissible, because clustering does no training. All the prior assumptions are built into the dissimilarity function and the importance weights.

Importance weights for the Wine problem came from came from optimizing a *spectral clustering* residual, and did not involve use of the class labels. Weights for the credit approval problem came from standardizing the input, and did not involve use of the class labels. The round problem, the horse-colic problem, and the House votes '84 problem use equal weights. The Maple Leaf Rag data set is mine and I can abuse it any way I like. As for the soybean data, the weights were obtained by some kind of linear transform, and I can't remember if it involves looking at the class labels, so please don't take them too seriously.

All the data sets are included in the distribution.

### The Soybean problem

Some ill plants are suffering from 19 diseases, manifested by 35 symptoms, which are a mixture of continuous, categorical, and ordinal variables. The symptoms and the diseases have been identified by human experts, which should make us suspicious, since this is not a double-blind experiment. It is possible that the agronomist's diagnoses have influenced his description of the symptoms. Let's see what we can make of this mess anyhow.

Combining training and test sets, the number of objects N is 683. The number of classes G is 19. Consider possible numbers of clusters from KMIN=1 to KMAX=50. The number of original variables O is 35. Many of these will have to be recoded. The hard-coded array `CAT(O)`

gives the number of categories for each input variable. For continuous and ordinal variables, write zero. Set `CATMAX`

to the largest value in `CAT`

. Each categorical variable will have to be recoded with `CAT(H)-1`

continuous variables. The data matrix that is input to the clustering routine is `X(P,N)`

To find `P`

, start with `O`

and add `CAT(H)-2`

for each categorical variable. The subroutine does *not* do this automatically; it must be hard-coded in the main program. That is because it is important for the analyst to examine the data and think about the problem. The first three attributes are:

Studying the input, we find that P=54. The attribute values in soybean.both are integers 0, 1, 2... for available data and "?" for missing data. For available data, `X`

is set to 1.0 + the input, because `RSCODE`

requires categories to be numbered starting with 1.0; and `NX`

is set to 1. For missing data, `X`

is set to `XBOGUS`

, and `NX`

is set to 0.

Next, subroutine `ORDNAL`

is called to rescale the ordinal variables. The first `O`

rows of `X`

are filled already. Call `RSCODE`

to recode the categorical variables with vertices of the regular simplex. This takes up more memory. `RSCODE`

starts at the last row and works backwards, moving the data to the end of the array, returning `X(P,N)`

of all continuous data. The class labels were given as text strings; they must be converted to integers 1, 2, 3... with a call to `IREID`

. Set a default `K = 20`

and we are finally ready to call the main subroutine. On return, the results are compared to the class labels, the data is projected into 2-dimensional array `R`

, and written to soy_clusters.csv and soy_classes.csv. The F statistic of [Pham, 2005] is returned in array `F()`

for each possible number of clusters `KMIN`

...`KMAX`

. Small values of F are supposed to indicate a good number of clusters.

Running the program with option `ROBUST = .FALSE.`

prints to the terminal:

cluela returns # 0 made 2 clusters
Returned residual: 15121.53
Adjusted Rand Index: 0.09886136
Variation of Information: 3.9897144
STOP program complete

This value of the Adjusted Rand Index (ARI) indicates that the clustering is less than 10% better than guessing. Set `ROBUST = .TRUE.`

and see if it helps any:

cluela returns # 0 made 2 clusters
Returned residual: 17791.316
Adjusted Rand Index: 0.14419138
Variation of Information: 3.1488326
STOP program complete

Examining the graph of the clustering results, we see that `CLUELA`

has produced a reasonable-looking partition of two clusters.

The classes are distributed in a much more complex way:

A plot of the F values shows a strong tendency for two clusters, and nothing at 19.

This is grounds for us to either be suspicious of our approach to the Soybean data, or of the F statistic. We will see in the next problem that the F statistic is helpful sometimes but not always.

### The Wine problem

The Wine data set is all continuous variables. N=178, P=13, K=3. Let us focus on trying to find the correct number of clusters. In [Pham, 2005] it was reported that their proposed method chose the correct number of clusters.

As a first attempt, running `CLUELA`

on the raw data returns pretty lousy accuracy. Reading the journal article a little more carefully, we notice that we need to standardize the input. The plot of the F statistic now shows a minimum at 2, not 3.

Did I make a mistake or did Pham et. al. make a mistake? Reading the journal article considerably more carefully, Pham et. al. were running their own peculiar variant of k-means that we do not care to attempt to replicate. Instead, we will use special (*very* special) importance weights. These were obtained by optimizing the residual of the Bach-Jordan spectral clustering algorithm [Bach, 2003] with subroutine PRAXIS by Richard Brent. This is not cheating, because it doesn't use the class labels. On the other hand, it's a lot of strain on the CPU. The F values now show a minimum at 3 that is slightly less than the value at 2.

Moral: a good statistician can prove anything. And an evil one can too.

Using the special weights, the classes are now mostly separable:

### The Credit approval problem

This is a two-class classification problem, of 15 attributes which have been deliberately obscured. The data is read and recoded much as before. N=690, O=15, G=2, P=38. The weights have been chosen to scale down the large continuous variables so that the attributes will have roughly equal importance. The accuracy of the clustering result in terms of classification is terrible. Clustering is not classification, so it doesn't really matter. Let's instead compare the projections we get with different robustness levels of PCA.

With `LEVEL=-1`

everything is on a line.

With `LEVEL=0`

the objects are dispersed more.

With `LEVEL=1`

the objects are spread out, but there is hardly any separation between the classes.

With `LEVEL=2`

there is still no class separation.

The clusters found by `CLUELA`

are this:

Moral: Use whatever graph looks the prettiest. None of them convey any useful information. They affect the viewer with the illusion of understanding.

### Problem "round" - the unit circle revisited

The `t_round`

program generates 30,000 points uniformly at random on the unit circle. We saw this problem in "Beginning Data Clustering" [*1] and solved it with the simple k-means algorithm presented there. Now that we have `CLUELA`

we can compare those results with the results from using the `ROBUST`

and `SOSDO`

options.

The `ROBUST`

option invokes the Manhattan metric. This is a non-Euclidean geometry, and we need to be prepared to accept that the plot of the results in this approximately Euclidean region of space-time on your computer monitor are going to look kind of weird.

Pham, et al's method of choosing K assumes Euclidean space. [Pham, 2005] It is no surprise that it shows no clustering tendency in Manhattan space.

Setting the `SOSDO`

option instead returns the familiar result.

The F values show a tendency to form three clusters, which replicates the result of Pham, et. al.

### The Horse Colic Problem

This data set is somewhat difficult. First there are some possible mistakes. The young/adult field was supposed to be coded {1,2} but it is instead {1,9}. That's easy to fix, but still strange. The field "capillary refill" is supposed to be coded {1,2}, but it contains a value of 3. This is probably a mistake, but `RSCODE`

will fail if we don't set `CAT=3`

for this variable. Some of the ordinal variables are in the wrong order. For example,

So we need to swap 1 and 2 so that the sequence is Warm, Normal, Cool, Cold.

Three attributes in the horse-colic.both are diagnosis codes. These are not categorical variables, because diagnoses are not mutually exclusive. A horse can have more than one illness. To represent this for `RSCODE`

, it needs to be converted to a categorical variable. First, we have to count the distinct diagnoses. There are 62 of them. Each possible diagnosis is treated as a binary variable. So we use the first 24 attributes, and convert the next three attributes into 62 variables, and then ignore attribute 28. So N=368, O=24+62=86.

The attribute hospital number takes on 346 different values. This is an extravagant number to have to recode into dummy variables. We can see that recoding is not feasible when the number of categories becomes very large. Some analysts choose to combine categories in these situations. It would be reasonable to omit this variable entirely. It is also possible to write a dissimilarity function that handles categorical data directly, and thus have no need to recode it. We shall suffer the consequences of recoding, and get P=440.

Next, we recode the data and call the clustering subroutine much as before. `CLUELA`

searches for a good number of clusters from `KMIN=1`

to `KMAX=20`

, so this takes some CPU time. There appears to be a preference for two clusters.

Now let's expose some advanced functionality. We call the internal subroutine `DINIT`

to apply the k-means++ initialization, then call `KCLUST`

directly with `KMIN=1`

, `K=3`

, and `KMAX=20`

, and the value of `THRESH`

found by `DINIT`

. When we use the `CLUELA`

interface, `KCLUST`

is constrained to not split or merge any clusters, and it is repeated ten times for each trial value of *k*. Calling `KCLUST`

directly allows for clusters to be inserted or deleted, and skips the repeats. This gives us an alternate estimate for *k* This graph shows results for a run that produced `K=5`

.

Often 3 or 4 is returned. The output is a matter of chance!

Now project the data down to a lower dimension `M=28`

to lighten up the computational burden and fill in missing values so that we can compare `CLULELA`

to the classical `KMNS`

and to the rival Neural Gas algorithm. It is not necessary to call `PCA`

again for `M=2`

. The first two projected dimensions are the same, regardless of how many more dimensions we request.

We then initialize centers and call `KMNS`

. K is fixed in `KMNS`

, so it is arbitrarily chosen here as `K=18`

. Next call `CLUELA`

with option `SOSDO`

and K fixed at 18, for comparison.

Interface `CLUSTE`

was written for CLUstering with Superior Theoretical Efficiency. By using the binary tree, the computational expense of the assignment step is reduced from NPK to perhaps as low as NP log(K) depending on the effectiveness of the tree. By sorting the data, the cost of computing the centers is NP, the size of the data matrix; or N log(K), the cost of sorting. That leaves as the bottleneck step the k-means++ initialization. An alternative initialization is to use hyperplane partitioning. This is much like putting the data into a Generalized Hyperplane Tree, except forgetting all the details of the tree structure. This should have a cost of NP log(K) and so the whole process is less than O(NPK)

The **Neural Gas** algorithm [Martinetz, 1991] is equivalent to data clustering. It was developed with a focus on learning a representation of the data space, rather than partitioning a data set. We will accept it on its own terms, and not compute any cluster assignments. The authors used `K=200`

in their examples. The same is done here.

The output to the terminal is:

Original problem:
cluela returns # 0 in 47.6960030 seconds made 2 clusters with residual 2957.10156
kclust returns # 0 in 0.216000006 seconds made 3 clusters with residual 2658.35571
Projected problem:
kmns returns in 1.99999996E-02 seconds. made 18 clusters with residual 53876.5430
cluela seeking min. sum-of-squares returns # 0 in 0.175999999 seconds. made 18 clusters with residual 53599.6836
cluste returns # 0 in 0.172000006 seconds. made 18 clusters with residual 68906.8984
neugas returns in 1.44400001 seconds

These results show that `CLUELA`

is as accurate as `KMNS`

when using the `SOSDO`

option, but that `KMNS`

is faster by a factor of 9. `CLUSTE`

has none of the performance we had hoped for, and Neural Gas is computationally expensive.

### House votes '84

The House Votes data set is a record of the votes cast on some bills by the U.S. House of Representatives during its 1984 session. The goal is to classify each representative as a Democrat or a Republican. N=435, P=16, G=2. The votes are 'y', 'n', and '?'. We first attempt to treat the question marks as missing data. `CLUELA`

returns with `IFAULT=10`

, because there is a null object in the data. Somebody in the Congress didn't cast a vote on any of these bills. The k-means algorithm has no way to deal with null objects. [*7]

In this case there is an easy solution. Instead of treating the votes as categorical data, treat it as continuous data, and let n⇒0.0 ?⇒0.5 and y⇒1.0

The clustering results split nicely.

And agree tolerably well with the class labels

The F values show a strong tendency to form two clusters, as they should.

### Maple Leaf Rag

This data set was extracted from MIDI files by the present writer from five performances of the "Maple Leaf Rag" by Scott Joplin, Bill Flynt, Gary Rogers, Alessandro Simonetto, and Mario Verehrer. Each note in the sheet music was assigned a class label. Each performed note was either matched to a note in the score, or else marked as stray.

The attributes are:

Here we encounter the strange data types "Cyclic" and "Contra-Categorical."

Phase is a cyclic variable, because the phase difference from 0.95 beats to 0.05 beats is 0.1, not 0.9. No provision has been made in package CLUELA to deal with cyclic data. [*8] Categorical variable channel does not vary. Categorical variables with only one category are not allowed. It could be coded as continuous, but it makes more sense to omit it. Attribute "source file" is contra-categorical. That is, separate events that come from the same performance should be *less* likely to correspond to the same ideal note. The effects of including such a variable will be studied. After recoding the data, we have N=4382, P=10, and G=975.

The first run of the program was done without rescaling time. This led to the strange result: It looks as if the notes drift away from each other into their separate source files as time progresses. Let us rescale time so that all performances have a duration of 50 seconds.

The importance weights were set with simulated annealing to reduce the variation of information. This is cheating, of course, but obtaining these weights was my whole purpose in collecting the data. Leaving the other weights as they are, set the weight for attribute "File ID" to 0.

With weight for File ID set to 600., the source files are separating from each other.

With weight for File ID set to 700., the source files are in four (?!) distinct bands.

Running the program with some File ID weights from -5 to 100 shows that accuracy degrades steadily with increasing weight on the contra-categorical variable. There is no prohibition in `CLUELA`

against setting negative weights, but this can cause convergence problems. It also violates the triangle inequality, so that we cannot use the generalized hyperplane tree anymore.

As before with the Horse Colic problem, we call `CLUELA`

, `CLUSTE`

, `KMNS`

, and `NEUGAS`

and compare the results. Let `CLUELA`

search for *K* over the range `KMIN=974`

to `KMAX=978`

. The output to the terminal is this:

cluela returns # 0 in 98.2480011 seconds, made 975 clusters
Returned residual: 8458.90039 Adjusted Rand Index: 0.709671259 Variation of Information: 0.883388221
cluste returns # 0 in 0.492000014 seconds, made 952 clusters
Returned residual: 11198.1299 Adjusted Rand Index: 0.760961771 Variation of Information: 0.805582583
kmns returns # 0 in 4.49599981 seconds, made 975 clusters
Returned residual: 8612.56738 Adjusted Rand Index: 0.700010896 Variation of Information: 0.908325672
neugas returns in 1.16799998 seconds.

These results show that `CLUELA`

is as accurate as `KMNS`

, but it is slow. `CLUSTE`

performs much better here, running in its preferred conditions of small *P* and large *K*. On this problem, it is faster than `KMNS`

by a factor of 9. Although its residual of the solution found by `CLUSTE`

is worse than `KMNS`

, its accuracy is better on both Adjusted Rand Index and Variation of Information.

The plot of the F values shows no tendency to form clusters at any of the values tested.

The Neural Gas algorithm does what it is meant to do, spreading its centers across the data cloud:

If you don't have to prove anything to anybody, you can train on the test set. All sets are training sets. All sets are test sets. It all depends on what you have to prove to whom. Also, a lower residual is not always better.

### Regrets

The perfect is the enemy of the good, and so I am pushing out this software with a couple of known shortcomings.

The routines in prep.f and prep.c search for duplicates by sorting and comparing adjacent elements. To be really efficient, this should be done with hash tables.

The Generalized Hyperplane Tree uses a formula to decide which branches of the tree do not need to be searched. A related concept, the Bisector Tree, stores in each node the distance to the farthest point it contains, and uses this to decide which branches do not need to be searched. These criteria could both be used to prune the tree.

### History

1.2 8 November 2017 Fixed mistake in USORTU that returned an invalid index when N equals 1.

1.1 7 October 2017 Changed loop in KCLUST for deleting clusters to negative step. This should fix mistakes that could occur when more that one cluster is to be deleted.

1.0 29 August 2017 Initial release.

### What's Next

The path straight ahead in data clustering leads to Expectation Maximization. The notion of dissimilarity is given a precise meaning as conditional probability.

Alternatively, you might be more interested in classification, embedding, or robust statistics.

### Appendix

**Half-distance Lemma:** [Elkan, 2003] If the distance to a point from its nearest center is less than or equal to half the distance of that center to any other center, the point can belong in no other cluster.

*Proof:* Let Q be the point. Let O be its current center. Let P be the nearest center to O.

Given: $\begin{aligned} & \left|OQ\right| \leq \frac{1}{2} \left|OP\right| \\ & 2 \left|OQ\right| \leq \left|OP\right| \end{aligned}$ By the triangle inequality, $\begin{aligned} & \left|OP\right| \leq \left|OQ\right| + \left|PQ\right| \\ & 2 \left|OQ\right| \leq \left|OP\right| \leq \left|OQ\right| + \left|PQ\right| \\ & \left|OQ\right| \leq \left|PQ\right| \end{aligned}$

**Generalized Hyperplane tree formula:** [Chávez, 1999] QS ≥ ½ |PQ - OQ|

*Proof:* The pivots are O and P, and a query point is Q. S is hypothetical point on the boundary surface. The distance |QS| must be bounded to compare it to r, the radius of the ball containing the known neighbors of Q.

Construct triangle PQS, then by the triangle inequality: $ \left|QS\right| \geq \left|PQ\right| - \left|PS\right|$ Construct triangle OQS, then by the triangle inequality: $ \left|QS\right| \geq \left|OS\right| - \left|OQ\right|$ Add $ 2 \left|QS\right| \geq (\left|PQ\right| - \left|OQ\right|) + (\left|OS\right| - \left|PS\right|)$ |OS| = |PS| by definition, so $ \left|QS\right| \geq \frac{1}{2} (\left|PQ\right| - \left|OQ\right|)$

**Missing data is Non-Metric**

*Proof by counterexample:* Consider the Manhattan dissimilarity function for missing data:

$ d(X,Y) = \frac{\sum W_{total}}{\sum W_{avail}} \sum \left|X_i - Y_i\right| $

Consider the points O = (0,0), P = (3,4), and Q = (1,?), and let the weights be equal. The dissimilarities are:

d(O,P) = (2/2) (|3-0| + |4-0|) = 7

d(O,Q) = (2/1) (|1-0| + ∅) = 2

d(P,Q) = (2/1) (|3-1| + ∅) = 4

2 + 4 < 7 and the triangle inequality is violated.

**Decrease and Increase of Sum of Squared Distance:** [Hartigan, 1979] Removing a point X from a cluster C of N points reduces the sum of squared distances by

$ d^2(X,C) \frac{N}{N-1} $

Placing a point in a cluster increases the sum of squared distances by

$ d^2(X,C) \frac{N}{N+1} $

*Proof for removing a point:* Let X' denote the point transferred. Let C' denote the new position of the cluster center. Let X_{i}, 1 ≤ i ≤ N be the initial points in the cluster. Then the change in squared distance U is:

$ \Delta U_{del} = \sum(X_i-C')^2 - \sum(X_i-C)^2 - (X'-C')^2 $ Expand: $ -2(\sum X_i)(C'-C) + \sum(C'^2-C^2) - (X'-C')^2 $ Doing some algebra shows that: $\begin{aligned} & \sum X_i = NC \\ & C'-C = -\frac{X'-C}{N-1} \\ & C'^2-C^2 = \frac{(X'+C-2NC)(X'-C)}{(N-1)^2} \\ & X'-C' = \frac{N}{N-1}(X'-C) \end{aligned}$ Substitute: $ \Delta U_{del} = -2NC\frac{-(X'-C)}{N-1} + N\frac{(X'+C-2NC)(X'-C)}{(N-1)^2} - (\frac{N}{N-1})^2 (X'-C)^2 $ Simplify to: $ \Delta U_{del} = -(X'-C)^2 \frac{N}{N-1} $

Derivation of the formula for inserting a point is left to the interested reader.

**The Sum-Of-Squares Dissimilarity is Non-Metric**

*Proof by counterexample:* Suppose that there is a data point X, and two identical centers C_{1} and C_{2}

Suppose that X is an element of C_{1} and not of C_{2}. Then

$d(X,C_1) > d(X,C_2)$ but $d(C_1,C_2) = 0 \\ d(X,C_1) > d(X,C_2) + d(C_1,C_2) $

and the triangle inequality is violated.

**The matrices D and Q**^{T} for the Iris data are:

D =
2081.015539 0.000000 -0.000000 0.000000
-0.000000 329.986937 0.000000 -0.000000
-0.000000 0.000000 60.225401 -0.000000
-0.000000 -0.000000 0.000000 28.561721
Q^T =
0.407838 0.199425 0.294615 0.098122
0.193311 0.293968 -0.339500 -0.173221
-0.224983 0.508836 -0.072564 0.193616
0.151865 -0.114316 -0.135671 0.598148

The skeptical reader may solve for P, and verify for himself that these matrices have L1 orthonormal columns.

## Notes

- For an introduction to k-means, please consult Wikipedia, or the earlier article "Beginning Data Clustering" on Codeproject, which gives a simple implementation.
- A univariate Laplace distribution does have the L1 geometry. What is called a multivariate Laplace distribution [Kotz, 2001] is elliptically contoured. In one dimension the Euclidean and Manhattan metrics coincide anyway, so there may not be a good theoretical justification for the Manhattan metric.
- There are transformations from similarity to dissimilarity which are quicker to compute, but the arccosine is the arc length between two points on the unit hypersphere. This is a metric dissimilarity, which is important for some applications.
- For instance, Wikipedia said as of 26 August 2017, "The standard algorithm aims at minimizing the WCSS objective, and thus assigns by 'least sum of squares', which is exactly equivalent to assigning by the smallest Euclidean distance." This mistake stood for over four years.
- If all the data is continuous, then the importance of each variable can be represented by multiplying the data by a scalar. Since this cannot be done with categorical variables, it is necessary to have a separate array to show the importance of each variable.
- It has been suggested to achieve dummy coding with
*c* - 1 variables by omitting the last continuous dummy variable. [Bowles, 2015] This introduces a bias, since the last category is less different from the other categories than the other categories are from each other. - Graph clustering algorithms can handle null objects by assigning them an arbitrary dissimilarity.
- It is possible to cluster cyclic variables, but it would make the whole package more complicated, so support was deleted.

### References

- Robert L. Thorndike, "Who Belongs in the Family?" Psychometrika, v.18 n.4, December 1953
- Geoffrey H. Ball and David J. Hall, Isodata, a Novel Method of Data Analysis and Pattern Classification Stanford Research Institute Technical Report, Project 5533, April 1965.
- David W. Goodall, "A New Similarity Index Based on Probability", Biometrics v.22 n.4 pp.882-907, December 1966.
- J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations" Berkeley Symposium on Mathematical Statistics and Probability, 1967
- R. Gnanadesikan and J. R. Kettenring, "Robust estimates, residuals, and outlier detection with multiresponse data" Biometrics v.28 pp.81-124, 1972.
- John A. Hartigan and Manchek A. Wong, "Algorithm AS 136: A K-Means Clustering Algorithm" Applied Statistics v.28 n.1 pp.100-108, 1979.
- R.S. Michalski and R.L. Chilausky, "Learning by Being Told and Learning from Examples: An Experimental Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing an Expert System for Soybean Disease Diagnosis", International Journal of Policy Analysis and Information Systems, Vol. 4, No. 2, 1980.
- I. Kalantari and G. McDonald, "A data structure and an algorithm for the nearest point problem" IEEE Transactions on Software Engineering, v.9 p.5, 1983
- Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Washington, D.C., 1985.
- Glenn W. Milligan and Martha C. Cooper, "An Examination of Procedures for Determining the Number of Clusters in a Data Set" Psychometrika v.50 n.2 pp.159-179, June 1985.
- M. Lichman (Admin.), UCI Machine Learning Repository http://archive.ics.uci.edu/ml, Irvine, CA: University of California, School of Information and Computer Science, 1987-present
- Ross Quinlan, "Simplifying decision trees" Int J Man-Machine Studies v.27, pp. 221-234, December 1987
- Mary McLeish & Matt Cecile, "Horse Colic Data Set", Department of Computer Science, University of Guelph, Guelph, Ontario, Canada N1G 2W1, [c.1989]
- Leonard Kaufman and Peter J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, 1990, reissued 2005.
- M. Forina, et al, PARVUS - An Extendable Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.[c.1991]
- Jeffrey K. Uhlmann, "Satisfying General Proximity/Similarity Queries with Metric Trees" Information Processing Letters, v.40 pp.175-179, November 1991
- Thomas Martinetz and Klaus Schulten, 'A "Neural-Gas" Network Learns Topologies' in Artificial Neural Networks, Ed. T. Kohonen, K. Makisara, O. Simula, and J. Kangas. Holland: Elsevier, 1991.
- Michael Falk, "On MAD and Comedians" Ann. Inst. Statist. Math. v.49 n.4 pp.615-644, 1997.
- Edgar Chávez, Gonzallo Navarro, and José Luis Marroquín, "Searching in Metric Spaces" Association for Computing Machinery, 1999
- Dan Pelleg and Andrew Moore, "X-means: Extending K-means with Efficient Estimation of the Number of Clusters." Draft report, 2000.
- János Podani, Introduction to the Exploration of Multivariate Biological Data Leiden: Backhuys Publishers, 2000. online:
- Samuel Kotz, Tomasz J. Kozubowski, and Krzysztof Podgórski, The Laplace Distribution and Generalizations: A Revisit with New Applications Internet draft copy, 24 January 2001
- Marina Meilă, "Comparing Clusterings" University of Washington, technical report, 2002.
- Charles Elkan, "Using the Triangle Inequality to Accelerate k-Means" Proceedings of the Twentieth International Conference on Machine Learning, 2003.
- Francis R. Bach and Michael I. Jordan, "Learning Spectral Clustering" University of California, draft report, 2003.
- Julian Laub, Non-Metric Pairwise Proximity Data Dipl. Ing. Phys. Thesis, Technical University of Berlin, 10 December 2004.
- Andreas Klein, "A Generalized Kahan-Babuška-Summation-Algorithm", 21 April 2005
- D T Pham, S S Dimov, and C D Nguyen, "Selection of K in K-means Clustering" J. of Mechanical Engineering Science, 2005.
- Shai Ben-David, Ulrike von Luxburg, and Dávid Pál, "A Sober Look at Clustering Stability", [c.2005]
- Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy, "The Effectiveness of Lloyd-Type Methods for the k-means Problem", Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, 2006.
- David Arthur and Sergei Vassilvitskii, "k-means++: the advantages of careful seeding", Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 2007.
- Sung-Hyuk Cha, "Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions" International Journal of Mathematical Models and Methods in Applied Sciences v.1 i.4, 2007.
- Hans-Hermann Bock, "Origins and extensions of the k-means algorithm in cluster analysis" Electronic Journ@l for History of Probability and Statistics v.4 n.2 December 2008 www.jehps.net
- Michiel de Hoon, Seiya Imoto, and Satoru Miyano. Users' manual for: The C Clustering Library for cDNA Microarray Data, University of Tokyo, Institute of Medical Science, Human Genome Center, 27 December 2010.
- Michel Deza and Elena Deza, Distances in PR, Audio and Image Séminaire Brillouin, IRCAM, 31 May 2012.
- Brian Kulis and Michael I. Jordan, "Revisiting K-means: New Algorithms via Bayesian Nonparametrics" Proceedings of the 29th International Conference on Machine Learning, 2012.
- Malika Charrad, Nadia Ghazzali, Veronique Boiteau, and Adam Niknafs, "NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set" available from CRAN Journal of Statistical Software v.61 i.6 pp.1-36, 2014.
- Michael Bowles, Machine Learning in Python: Essential Techniques for Predictive Analysis Wiley, 2015.