For the latest version of the code, which may contain the latest enhancements and corrections, please download the latest version of the Accord.NET Machine Learning and Image Processing Framework.
For a new and much improved version of a handwritten digit recognizer, see the article Handwriting Recognition Revisited: Kernel Support Vector Machines[^].
Linear discriminant analysis (LDA) is a method used in statistics and machine learning to find a linear combination of features which best characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
Linear Discriminant Analysis is closely related to principal component analysis (PCA) in the sense that both look for linear combinations of variables which best explains the data. LDA explicitly attempts to model the difference between the classes of data, maximizing the following objective function:
are the Between-Class Scatter Matrix and Within-Class Scatter Matrix, respectively. The optimal solution can be found by computing the Eigen values of SB-1SW and taking the Eigen vectors corresponding to the largest Eigen values to form a new basis for the data.
A detailed explanation for the full source code for Linear Discriminant Analysis is beyond the scope of this article, but can be found here.
KDA is an extension of LDA to non-linear distributions, just as KPCA is to PCA. The objective of KDA is also to find a transformation maximizing the between-class variance and minimizing the within-class variance. It can be shown that, with Kernels, the original objective function can be expressed as:
where Kc is the Kernel matrix for class c, uc is the column means vector for Kc, I is the identity matrix, lc is the number of samples in class c, and 1lc is a lc x lc matrix with all entries 1/lc (i.e., I - 1/lc is the centering matrix of size lc).
Again, the complete explanation for KDA is beyond the scope of this article. But it is available on another article, entitled Non-linear Kernel Discriminant Analysis in C#, which can be found here.
The Kernel trick is a very interesting and powerful tool. It is powerful because it provides a bridge from linearity to non-linearity to any algorithm that solely depends on the dot product between two vectors. It comes from the fact that, if we first map our input data into a infinite-dimensional space, a linear algorithm operating in this space will behave non-linearly in the original input space.
Now, the Kernel trick is really interesting because that mapping does not need to be ever computed. If our algorithm can be expressed only in terms of an inner product between two vectors, all we need is to replace this inner product with the inner product from some other suitable space. That is where resides the "trick": wherever a dot product is used, it is replaced with a Kernel function. The Kernel function denotes an inner product in feature space, and is usually denoted as:
K(x,y) = <φ(x),φ(y)>
Using the Kernel function, the algorithm can then be carried into a higher-dimension space without explicitly mapping the input points into this space. To see why this is highly desirable, you may check the introductory text in Kernel Principal Component Analysis in C# and Kernel Support Vector Machines for Classification and Regression in C# which includes a video demonstration of the Kernel trick.
Some common Kernel functions include the Linear kernel, the Polynomial kernel, and the Gaussian kernel. Below is a simple list with their most interesting characteristics.
|Linear Kernel ||The Linear kernel is the simplest Kernel function. It is given by the common inner product <x,y> plus an optional constant c. Kernel algorithms using a linear kernel are often equivalent to their non-kernel counterparts, i.e., KPCA with a Linear kernel is equivalent to standard PCA. ||
|Polynomial Kernel ||The Polynomial kernel is a non-stationary kernel. It is well suited for problems where all data is normalized. ||
|Gaussian Kernel ||
The Gaussian kernel is by far one of the most versatile kernels. It is a radial basis function kernel, and is the preferred kernel when we don't know much about the structure of the data we are trying to model.
For more Kernel functions, check Kernel functions for Machine Learning Applications.
The KDA code presented here is part of a bigger framework I've been building over the years. It aggregates various topics on statistics and machine learning I needed through my past researches. At the time this article was written, it supported PCA, KPCA, LDA, KDA, PLS, SVMs, HMMs, LM-ANN, HCRFs and a lot of other acronyms; though as of now, it has become one of the most complete frameworks for scientific computing in .NET. This framework is Accord.NET, originally an extension framework for AForge.NET, another popular framework for computer vision and machine learning. The project can be found in GitHub at https://github.com/accord-net/framework/ and has been used on a wide range of scientific and consumer applications[^].
The source code available in this article contains only a subset of the framework necessary for KDA. The actually needed classes are shown in the picture below.
KernelDiscriminantAnalysis class inherits from
LinearDiscriminantAnalysis, extending its core algorithm with kernels. The source originally included with this article more than 20 kernels to choose from, though nowadays the framework offers around 35 kernel functions. In most applications, the Gaussian kernel will be one of the most suitable kernel functions to be chosen, specially when there is not enough information about the classification problem we are trying to solve.
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. One of the available datasets is the Optical Recognition of the Handwritten Digits Data Set (a.k.a. the Optdigits dataset).
In the raw Optdigits data, digits are represented as 32x32 matrices. They are also available in a pre-processed form in which digits have been divided into non-overlapping blocks of 4x4 and the number of on pixels have been counted in each block. This generated 8x8 input matrices where each element is an integer in the range 0..16.
Dimensionality reduction is a very essential step if we are going to use classifiers which suffer from the Curse of Dimensionality. Kernel methods in general, however, have no problems processing large dimensionality problems because they do not suffer from such limitations.
Sample digits extracted from the raw Optdigits dataset.
Kernel methods are appealing because they can be applied directly to problems which would require significant thought on data pre-processing and extensive knowledge about the structure of the data being modeled. Even if we know little about the data, a blind application of Kernel methods often finds interesting results. Achieving optimality using Kernel methods can become, however, a very difficult task because we have an infinite choice of Kernel functions to choose from - and for each kernel, an infinite parameter space to tweak.
The following code shows how a Kernel Discriminant Analysis is instantiated. Note how the inputs are given as full vectors of 1024 positions. This would be impractical if we were going to use Neural Networks, for example.
int samples = 500;
double[,] input = new double[samples, 1024];
int output = new int[samples];
IKernel kernel = new Gaussian((double)numSigma.Value);
kda = new KernelDiscriminantAnalysis(input, output, kernel);
kda.Threshold = (double)numThreshold.Value;
kda.Regularization = (double)numRegularization.Value;
dgvPrincipalComponents.DataSource = kda.Discriminants;
dgvFeatureVectors.DataSource = new ArrayDataView(kda.DiscriminantMatrix);
dgvClasses.DataSource = kda.Classes;
As mentioned previously, Kernel Discriminant Analysis does not suffer from the curse of dimensionality. This means it can be applied to the raw 32x32 digit dataset without any kind of preprocessing. The only processing required will be the transposition of the 32x32 matrices into vectors of 1024 length.
However, KDA is not a very efficient method, nor the most suitable to be used in a large dataset. Despite the increase in dimensions having little to no effect in the analysis running time, KDA scales as O(n³) to the number of training samples. Besides, its solutions are not sparse as is the case with Support Vector Machines (SVMs). This means a significant amount of memory space will be used to accommodate the full Kernel matrix during classification.
To avoid long processing times in this demo, we will use just a subset of the training dataset from the Optdigits' set. The first 500 entries will be used for training, while the other 500 will be used for testing.
Classification using KDA is often performed considering the minimal distance between a data point projected into the feature space and the class mean in the feature space. We can see the motivation behind this approach in the following example.
Example Yin Yang classification problem. Left image shows the original problem in the input space, right image shows the projection resultant of a Kernel Discriminant Analysis performed with a Gaussian kernel with sigma set as 1.0.
The red dot in the right image marks the projection of the red dot of the left image. Notice how the dot is close to the blue class in both images. In the analysis' feature space, however, the classes have been unfurled into more linear representations. In this scenario, the Euclidean distance (or, equivalently, the Malahanobis distance) to the class mean in feature space becomes a very suitable measure of class closeness.
The following code demonstrates the classification of new instances using an already computed Kernel Discriminant Analysis.
double input = canvas.GetDigit();
int num = kda.Classify(input);
lbClassification.Text = num.ToString();
The test application accompanying the source code can perform the recognition of handwritten digits using Kernel Discriminant Analysis. To do so, open the application (preferably outside Visual Studio, for better performance). Click on the menu File and select Open. This will load some entries from the Optdigits dataset into the application.
Left: Optdigits data loaded into the application. Right: Detailed information about the distinct classes of handwritten digits.
To perform the analysis, click the Run Analysis button. Please be aware that it may take some time. After the analysis is complete, the other tabs in the sample application will be populated with the analysis' information. The level of importance of each factor found during the discriminant analysis is plotted in a pie graph for easy visual inspection.
Factors found in the Discriminant Analysis and their relative importance. Notice how, from the initial input space of 1024 dimensions, only 9 have been selected as important to the classification problem.
Once the analysis is complete, we can test its classification ability in the testing data set. The green rows have been correctly identified by the discriminant space Euclidean distance classifier. We can see that it correctly identifies 92% of the testing data. Remember that the testing and training data set are disjoint and independent.
Using the default values in the application, it achieves 92% accuracy.
After the analysis has been completed and validated, we can use it to classify the new digits drawn directly in the application. The bars on the right show the relative response for each of the discriminant functions. Each class has a discriminant function that outputs a closeness measure for the input point. The classification is based on which function produces the maximum output.
We can see the analysis also performs rather well on completely new and previously unseen data.
In this article, we detailed and explored how Kernel Discriminant Analysis can be used in the problem of handwritten digit recognition with satisfying results. KDA, however, suffers from some limitations. Unlike Support Vector Machines (SVMs), its solutions are not sparse. It scales poorly to the input sample size as O(n³) despite having no problems dealing with high-dimensionality input vectors. It has one advantage, though: it generalizes naturally to the multi-class case. This is not the case with SVM, where different approaches can be used for multiclass classification.
Another problem that raises from Kernel methods is the proper choice of the Kernel function (and the tuning of its parameters). This problem is often tractable with grid search and cross-validation, which are by themselves very expensive operations, both in terms of processing power and training data available.
On the other hand, most Kernel-based methods have a very pleasant advantage over more traditional methods such as Neural Networks: they do not suffer from local minima. This effectively means that any solution found will also be the best solution possible for the given values of the learning parameters. This is not the case with Neural Networks, for example, in which one must try many different starting points and multiple training instances to ensure consistent results.
- Sebastian Mika, G. Rätsch, J. Weston, B. Schölkopf, K.-R. Müller; Fisher discriminant analysis with kernels (1999). Available in: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.35.9904.
- Yongmin Li, Shaogang Gong and Heather Liddell; Kernel Discriminant Analysis. Available in: http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/LI1/kda/index.html.
- Welling, Max; Fisher Linear Discriminant Analysis. Availabe in: http://www.ics.uci.edu/~welling/classnotes/papers_class/Fisher-LDA.pdf.
- Gutierrez-Osuna, Ricardo; Introduction to Pattern Analysis. Available in: http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf.
- Asuncion, A. & Newman, D.J.; UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science (2007).