Kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are done in a reproducing kernel Hilbert space with a non-linear mapping.

KPCA is an extension of PCA to non-linear distributions. Instead of directly doing a PCA, the original data points are mapped into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function φ through a mathematical process called the “kernel trick”.

The Kernel Trick

The kernel trick transforms any algorithm that solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced with the kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. This non-linear algorithm is equivalent to the linear algorithm operating in the range space of φ. However, because kernels are used, the φ function is never explicitly computed. This is desirable, because the high-dimensional space may be infinite-dimensional (as is the case when the kernel is a Gaussian).

Interesting video by Udi Aharoni demonstrating how points that are not linearly separable in a 2D space can almost always be linearly separated in a higher dimensional (3D) space.

Principal Components in Kernel Space

Like in PCA, the overall idea is to perform a transformation that will maximize the variance of the captured variables while minimizing the overall covariance between those variables. Using the kernel trick, the covariance matrix is substituted by the Kernel matrix and the analysis is carried analogously in feature space. An Eigen value decomposition is performed and the eigenvectors are sorted in ascending order of Eigen values, so those vectors may form a basis in feature space that explain most of the variance in the data on its first dimensions.

However, because the principal components are in feature space, we will not be directly performing dimensionality reduction. Suppose that the number of observations m exceeds the input dimensionality n. In linear PCA, we can find at most n nonzero Eigen values. On the other hand, using Kernel PCA, we can find up to m nonzero Eigen values because we will be operating on a m x m kernel matrix.

A more detailed discussion can be seen in the works of Mika et al. (Kernel PCA and De-Noising in Feature Spaces) and Scholkopf et al. (Nonlinear Component Analysis as a Kernel Eigenvalue Problem). A more accessible explanation of the topic is also available in the work from Ian Fasel.

The Pre-image Problem

In standard PCA, it was possible to revert the transformation, retrieving the original data points and enabling the use of PCA for compression* and denoising. However, as the results provided by kernel PCA live in some high dimensional feature space, they need not have pre-images in input space. In fact, even if it exists, it also need be not unique.

Thus, in KPCA, it is only possible to obtain an approximate pre-image for a given set of patterns projected on the feature space. This problem can (and has) been tackled by many approaches, some of them of iterative nature, but some closed-form solutions also exists. Typically, those solutions make use of the fact that distances in feature space are related to distances in input space. Thus, those solutions try to achieve an optimal transformation that can embed those feature points in input space respecting those distances relationships.

One of the first solutions of this form were proposed by Kwok and Tsang in their paper “The Pre-Image Problem in Kernel Methods". A better approach is also described by Bakir on his thesis “Extensions to Kernel Dependency Estimation”, alongside with a nice comprehensive description on various other methods for handling the pre-image problem.

Note: Actually, the use of KPCA for compression is a bit limited, since one needs to store the original data set to compute transformations and its approximate reversion.

Code Overview

Below is the main code behind KPCA. This is a direct, naive implementation of the algorithm that may not scale very well for very large datasets. It is implemented on top of the previous standard PCA code and thus on top of the AForge.NET Framework.

/// <summary>Computes the Kernel Principal Component Analysis algorithm.</summary>
public override void Compute()
{
    int rows = Source.GetLength(0);
    int cols = Source.GetLength(1);

    // Center (adjust) the source matrix
    sourceCentered = adjust(Source);


    // Create the Gram (Kernel) Matrix
    double[,] K = new double[rows, rows];
    for (int i = 0; i < rows; i++)
    {
        for (int j = i; j < rows; j++)
        {
            double k = kernel.Kernel(sourceCentered.GetRow(i), sourceCentered.GetRow(j));
            K[i, j] = k; // Kernel matrix is symmetric
            K[j, i] = k;
        }
    }

    // Center the Gram (Kernel) Matrix
    if (centerFeatureSpace)
        K = centerKernel(K);

    // Perform the Eigen Value Decomposition (EVD) of the Kernel matrix
    EigenValueDecomposition evd = new EigenValueDecomposition(K);

    // Gets the eigenvalues and corresponding eigenvectors
    double[] evals = evd.RealEigenValues;
    double[,] eigs = evd.EigenVectors;

    // Sort eigen values and vectors in ascending order
    int[] indices = new int[rows];
    for (int i = 0; i < rows; i++) indices[i] = i;
    Array.Sort(evals, indices, new AbsoluteComparer(true));
    eigs = eigs.Submatrix(0, rows - 1, indices);

    // Normalize eigenvectors
    if (centerFeatureSpace)
    {
        for (int j = 0; j < rows; j++)
        {
            double eig = System.Math.Sqrt(System.Math.Abs(evals[j]));
            for (int i = 0; i < rows; i++)
                eigs[i, j] = eigs[i, j] / eig;
        }
    }

    // Set analysis properties
    this.SingularValues = new double[rows];
    this.EigenValues = evals;
    this.ComponentMatrix = eigs;

    // Project the original data into principal component space
    double[,] result = new double[rows, rows];
    for (int i = 0; i < rows; i++)
        for (int j = 0; j < rows; j++)
            for (int k = 0; k < rows; k++)
                result[i, j] += K[i, k] * eigs[k, j];

    this.Result = result;


    // Computes additional information about the analysis and creates the
    //  object-oriented structure to hold the principal components found.
    createComponents();
}

/// <summary>Projects a given matrix into the principal component space.</summary>
/// The matrix to be projected. The matrix should contain
/// variables as columns and observations of each variable as rows.
public override double[,] Transform(double[,] data, int components)
{
    int rows = data.GetLength(0);
    int cols = data.GetLength(1);
    int N = sourceCentered.GetLength(0);

    // Center the data
    data = adjust(data);

    // Create the Kernel matrix
    double[,] K = new double[rows, N];
    for (int i = 0; i < rows; i++)
        for (int j = 0; j < N; j++)
            K[i, j] = kernel.Kernel(data.GetRow(i), sourceCentered.GetRow(j));

    // Center the Gram (Kernel) Matrix
    if (centerFeatureSpace)
        K = centerKernel(K);

    // Project into the kernel principal components
    double[,] result = new double[rows, components];
    for (int i = 0; i < rows; i++)
        for (int j = 0; j < components; j++)
            for (int k = 0; k < rows; k++)
                result[i, j] += K[i, k] * ComponentMatrix[k, j];

    return result;
}
///

The main code behind the reversion of the projection. It is a direct, naive implementation of the closed-form solution algorithm as proposed by Kwok and Tsang. Currently, it is completely not optimized, but it gives an better overview of the algorithm on its essential form. The source code available to download may be better optimized in the future.

/// <summary>
///   Reverts a set of projected data into it's original form. Complete reverse
///   transformation is not always possible and is not even guaranteed to exist.
/// </summary>
/// <remarks>
///   This method works using a closed-form MDS approach as suggested by
///   Kwok and Tsang. It is currently a direct implementation of the algorithm
///   without any kind of optimization.
///   
///   Reference:
///   - http://cmp.felk.cvut.cz/cmp/software/stprtool/manual/kernels/preimage/list/rbfpreimg3.html
/// </remarks>
/// <param name="data">The kpca-transformed data.</param>
/// <param name="neighbors">The number of nearest neighbors to use while constructing the pre-image.</param>
public double[,] Revert(double[,] data, int neighbors)
{
    int rows = data.GetLength(0);
    int cols = data.GetLength(1);

    double[,] reversion = new double[rows, sourceCentered.GetLength(1)];

    // number of neighbors cannot exceed the number of training vectors.
    int nn = System.Math.Min(neighbors, sourceCentered.GetLength(0));

    // For each point to be reversed
    for (int p = 0; p < rows; p++)
    {
        // 1. Get the point in feature space
        double[] y = data.GetRow(p);

        // 2. Select nn nearest neighbors of the feature space
        double[,] X = sourceCentered;
        double[] d2 = new double[Result.GetLength(0)];
        int[] inx = new int[Result.GetLength(0)];

        // 2.1 Calculate distances
        for (int i = 0; i < X.GetLength(0); i++)
        {
            inx[i] = i;
            d2[i] = kernel.Distance(y, Result.GetRow(i).Submatrix(y.Length));
        }

        // 2.2 Order them
        Array.Sort(d2, inx);

        // 2.3 Select nn neighbors
        inx = inx.Submatrix(nn);
        X = X.Submatrix(inx).Transpose(); // X is in input space
        d2 = d2.Submatrix(nn);       // distances in input space

        // 3. Create centering matrix
        // H = eye(nn, nn) - 1 / nn * ones(nn, nn);
        double[,] H = Matrix.Identity(nn).Subtract(Matrix.Create(nn, 1.0 / nn));

        // 4. Perform SVD
        //    [U,L,V] = svd(X*H);
        SingularValueDecomposition svd = new SingularValueDecomposition(X.Multiply(H));
        double[,] U = svd.LeftSingularVectors;
        double[,] L = Matrix.Diagonal(nn, svd.Diagonal);
        double[,] V = svd.RightSingularVectors;

        // 5. Compute projections
        //    Z = L*V';
        double[,] Z = L.Multiply(V.Transpose());

        // 6. Calculate distances
        //    d02 = sum(Z.^2)';
        double[] d02 = Matrix.Sum(Matrix.DotPow(Z,2));

        // 7. Get the pre-image using z = -0.5*inv(Z')*(d2-d02)
        double[,] inv = Matrix.PseudoInverse(Z.Transpose());
        double[] z = (-0.5).Multiply(inv).Multiply(d2.Subtract(d02));

        // 8. Project the pre-image on the original basis
        //    using x = U*z + sum(X,2)/nn;
        double[] x = (U.Multiply(z)).Add(Matrix.Sum(X.Transpose()).Multiply(1.0 / nn));

        // 9. Store the computed pre-image.
        for (int i = 0; i < reversion.GetLength(1); i++)
            reversion[p, i] = x[i];
    }

    // if the data has been standardized or centered,
    //  we need to revert those operations as well
    if (this.Method == AnalysisMethod.Correlation)
    {
        // multiply by standard deviation and add the mean
        for (int i = 0; i < reversion.GetLength(0); i++)
            for (int j = 0; j < reversion.GetLength(1); j++)
                reversion[i, j] = (reversion[i, j] * StandardDeviations[j]) + Means[j];
    }
    else
    {
        // only add the mean
        for (int i = 0; i < reversion.GetLength(0); i++)
            for (int j = 0; j < reversion.GetLength(1); j++)
                reversion[i, j] = reversion[i, j] + Means[j];
    }

    return reversion;
}

Using the Code

As with PCA, code usage is simple. First, choose a kernel implementing the IKernel interface, then pass it to the KernelPrincipalComponentAnalysis constructor together with the data points.

    // Create a Gaussian kernel with sigma = 10.0
    Gaussian kernel = new Gaussian(10.0);

    // Creates a new Kernel Principal Component Analysis object
    var kpca = new KernelPrincipalComponentAnalysis(data, kernel);

Then, simply call Compute() to perform the analysis. Once done, data about the analysis will be available through the object properties and transformations on additional data can be computed by calling Transform().

    // Compute the Kernel Principal Component Analysis
    kpca.Compute();

    // Perform the transformation of the data using two components
    double[,] result = kpca.Transform(data, 2);

Sample Application

To demonstrate the use of KPCA, I created a simple Windows Forms Application which performs simple statistical analysis and KPCA transformations. It can also perform projection reversion by solving the approximate pre-image problem.

On left: Scholkopf Toy Example. Middle: Kernel principal components detected by the KPCA. Right: The two first dimensions obtained by the two first Kernel principal components.

Left: Wikipedia example. Middle: Projection using a non-centered Gaussian kernel. Right: Reconstruction of the original data using all Kernel principal components, showing some denoising ability.

Linear PCA as a Special Case

Additionally, we may check that linear PCA is indeed a special case of kernel PCA by using a linear kernel on the example by Lindsay Smith on his A Tutorial On Principal Component Analysis.

References

BAKIR, Gokhan. Extensions to Kernel Dependency Estimation. Doctorate Thesis.
Available in: http://opus.kobv.de/tuberlin/volltexte/2006/1256/pdf/bakir_goekhan.pdf

FASEL, Ian. Scholkopf, Smola and Muller: Kernel PCA.
Available in: http://cseweb.ucsd.edu/classes/fa01/cse291/kernelPCA_article.pdf

HOFFMANN, Heiko. Unsupervised Learning of Visuomotor Associations. Dissertation.
Available in: http://www.heikohoffmann.de/htmlthesis/hoffmann_diss.html.

KWOK, James; TSANG, Ivor. The Pre-Image Problem in Kernel Methods. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
Available in: http://www.hpl.hp.com/conferences/icml2003/papers/345.pdf

KWOK, James; TSANG, Ivor. The Pre-Image Problem in Kernel Methods. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
Available in: http://www.cse.ust.hk/~jamesk/papers/icml03_slides.pdf

MIKA, Sebastian; SCHOLKOPF, Benhard; et al. Kernel PCA and De-Noising in Feature Spaces.
Available in: http://www.cs.dartmouth.edu/farid/teaching/cs88/nips99.pdf

WIKIPEDIA; Kernel Principal Component Analysis.
Available in: http://en.wikipedia.org/wiki/Kernel_principal_component_analysis

SMITH, Lindsay. A Tutorial on Principal Component Analysis. 2002.
Available in: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

See Also