Another and (this depends on who you ask, and also on the application context) superior (ahem) way to do it is using Hidden Conditional Random Fields. Despite the fancy name, HCRFs can be seen as generalizations of the hidden Markov model classifier. They can be used in classification in the same fashion, and are able to learn from instances almost in the same way. They are also applicable to much more general structures and are not constrained to chained observation. But on this article we will be focusing on linear-chain HCRFs, which have an almost equivalence with the previously seen hidden Markov models.
I have an example for you.
The example I gave above is an extreme example of generative learning used for discrimination. In the real world it is not as extreme as I made it to be, but it serves well to point out some difficulties on executing it properly. On a generative learning setting, we create one model after each class we are trying to distinguish. But we do not tell the models they will be used to distinguish things. They are not aware they are going to be used in for classification.
It is interesting to note that this comes in direct confront with a key learning principle:
Please note, however, that I am not saying generative learning should always be avoided. It will always depend on the application and the problem we are trying to solve. Learning quite a simple discriminative rule is not always what we want, and in those cases, we may be willing to learn quite general descriptions of our data before blindingly trying to solve a classification problem. As everything in life, any extreme should likely be avoided. An interesting case of a combination of both generative and discriminative learning which has yielded quite surprising results lately had been conceived as the deep learning paradigm. But I will leave this topic for a future article :-P
The (linear-chain) Conditional Random Field is the discriminative counterpart of the Markov model. Whilst I had not discussed about (visible) Markov models in the previous article, they are not much different in nature. An observable Markov Model assumes the sequences of states y to be visible, rather than hidden. Thus they can be used in a different set of problems than the hidden Markov models.
Those models are often used for sequence component labeling, also known as part-of-sequence tagging. After a model has been trained, they are mostly used to tag parts of a sequence using the Viterbi algorithm. This is very handy to perform, for example, classification of parts of a speech utterance, such as classifying phonemes inside an audio signal.
A general formulation for a Conditional Random Field can be stated using the horrifying-at-first formulae:

where


Now, the conditional random field has a particular interesting form. While it may seen complicated at first, its linear-chain with single maximum-clique formulation is identical to a hidden Markov model. It has just been buried with some mathematical transformations. Let's start by writing the HMM formula, and then I will gradually walk towards a linear-chain CRF formulation so we can see this more clearly. So, let's begin:

Now let's write the emission and transition probabilities in terms of the matrices A and B

Now we can apply an exponential-logarithm identity. Note that nothing has changed.

Now we take that the logarithm of the product is the sum of the logarithms:

We proceed simplifying (or perhaps complicating even further) the expression.

Note that the two parts of the formula have an extremely similar form. How could we remove this duplication? We can think of this as if we were refactoring the formula to reduce duplicated code. Remember that before we had two matrices with parameters to specify our model:
If we "linearize" those matrices into a single parameter vector, we will only have one structure to maintain:

But now, how could we access individual elements of this vector as if it were the original matrices? Well, we may do so by using indicator functions. Those are feature functions that activate only when their inputs are equal to some pre-defined values. Let's consider two classes of functions, one that activates for elements in the transition matrix and one that activate for elements of the emissions matrix.

Next, we instantiate a member from those functions for each corresponding parameter in our parameter vector:

And now, when things seems to have gotten extremely hairier than before....

.... we suddenly realize the structures are so similar, we can process then in the same way. As if we had created a common "interface" for accessing the matrix values:

Up to now, nothing had changed in the formula. This form is still completely equivalent to an observable Markov model. With this model, we may be interested in detecting, for example, the most likely sequence of states y given an observation sequence x. Because the parameter values are not constrained anymore to form probabilities, we need to marginalize over all possible state sequences y, which requires some extra effort. And this is where the partition function Z(x)
comes in:

Finally we have a model completely equivalent to an observable Markov model, but in a more general form, with a single parameter vector which does not necessarily need to be regarded as probabilities. We can see that every Markov model is a CRF, but not every CRF is a Markov model, as its parameters are not necessarily probabilities.
Now suppose we would like to devise a CRF model for classification. In order to do so, the first logical step would be to incorporate class variables in the CRF formulation. We could do it, for example, by expanding the aforementioned model, the potential functions and the partition function to accommodate a new input, the class label w


Now, in order to perform sequence classification as we were doing with the Hidden Markov Models, we can again set the sequence of states y to be hidden. By doing this, we assume those variables are latent, and we achieve a model called the Latent-variable Conditional Random Field, or simply a Hidden Conditional Random Field.
Again we can marginalize the joint probability over y by summing over all possible variations of y, as shown above. However, in the case the graphical model obeys some known structures, such as the case of a linear-chain HCRF, we can use the same techniques shown in the previous article for HMMs to get those probabilities. Besides, since linear-chain HCRFs and HMM classifiers share the same structures and parameters, we can always use the parameters of HMM to initialize a HCRF. But the reverse is not always possible, since learned HCRFs are not required to always form probabilities as the HMMs are.
Let's take a look on the class diagrams for a linear-chain Hidden Conditional Random Field. The diagram below shows how a HCRF is associated with a single potential function.
However, this potential function could be (and typically is) divided into several potential factors, each of them defined over a maximum clique of a graph. In the linear-chain case, those are analogue to the individual hidden Markov models associated with each class label in a HMM classifier.

The potential function has a set of weights and features, with each element in the Weight array corresponding to an element in the Features array. Those arrays are also segmented, and each segment of those arrays is associated with a single factor potential.
The overall model visualization is shown below.

Note, however, that this structure is not always obeyed by the framework. For performance purposes, the framework provides specialized classes for evaluating some factors, such as factors modeled against Markov models.

In the last article about HMMs we have seen some examples to perform sequence classification. Let's start reproducing the sample examples, however using HCRFs. In order to reproduce the functionality of a discrete hidden Markov classifier, we can use the <a href="http://accord-framework.net/docs/html/T_Accord_Statistics_Models_Fields_Functions_MarkovDiscreteFunction.htm">MarkovDiscreteFunction</a>
as the potential function for our HCRF, as given in the example below.
int[][] inputSequences =
{
new[] { 0, 1, 1, 1, 0 },
new[] { 0, 0, 1, 1, 0, 0 },
new[] { 0, 1, 1, 1, 1, 0 },
new[] { 2, 2, 2, 2, 1, 1, 1, 1, 1 },
new[] { 2, 2, 1, 2, 1, 1, 1, 1, 1 },
new[] { 2, 2, 2, 2, 2, 1, 1, 1, 1 },
new[] { 0, 0, 1, 1, 3, 3, 3, 3 },
new[] { 0, 0, 0, 3, 3, 3, 3 },
new[] { 1, 0, 1, 2, 2, 2, 3, 3 },
new[] { 1, 1, 2, 3, 3, 3, 3 },
new[] { 0, 0, 1, 1, 3, 3, 3, 3 },
new[] { 2, 2, 0, 3, 3, 3, 3 },
new[] { 1, 0, 1, 2, 3, 3, 3, 3 },
new[] { 1, 1, 2, 3, 3, 3, 3 },
};
int[] outputLabels =
{
0, 0, 0,
1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2
};
var function = new MarkovDiscreteFunction(states: 3, symbols: 4, outputClasses: 3);
var classifier = new HiddenConditionalRandomField<int>(function);
var teacher = new HiddenResilientGradientLearning<int>(classifier)
{
Iterations = 50
};
teacher.Run(inputSequences, outputLabels);
int y1 = classifier.Compute(new[] { 0, 1, 1, 1, 0 });
int y2 = classifier.Compute(new[] { 0, 0, 1, 1, 0, 0 });
int y3 = classifier.Compute(new[] { 2, 2, 2, 2, 1, 1 });
int y4 = classifier.Compute(new[] { 2, 2, 1, 1 });
int y5 = classifier.Compute(new[] { 0, 0, 1, 3, 3, 3 });
int y6 = classifier.Compute(new[] { 2, 0, 2, 2, 3, 3 });
Note that in the above example, we have used the Resilient Backpropagation method to learn our models. However, it would also be possible to use the standard Stochastic Gradient Descent and even other optimization algorithms. The Resilient Backpropagation, however, has been found as one of the best algorithms for this task.
As the HCRF is a generalization of the HMM classifier, we can use any discrete, continuous, or even a mixture of feature functions in our model. This allows the HCRF to replicate any Markov model, even continuous ones. In this section, we will see how we can use the HCRF to perform mouse gesture recognition, as we did in our last article.

After a hidden Markov model is learned, we can use its probabilities as a initial guess for initializing our HCRFs. Note that since we are initializing the field with a continuous density hidden Markov model, all our feature functions will operate directly on the observed sequence points, here represented as double[] arrays of size two, in which the first element is the X coordinate of the point and the second element is the Y point coordinate.
hcrf = new HiddenConditionalRandomField<double[]>(new MarkovMultivariateFunction(hmm));
After the model have been created, we can create a learning algorithm to train the classifier.
var teacher = new HiddenResilientGradientLearning<double[]>(hcrf)
{
Iterations = iterations,
Tolerance = tolerance
};
And this is all which is needed to create our recognition models. Now, if you wish, please download the sample application available in the top of this article so we can get started. The application performs gesture recognition using the mouse. In fact, you could use it for other things as well - such as signature recognition. It is just a sample application on how to use hidden Markov models. The application is shown below.

This is the main window of the application.
After the application has been opened, you will be granted with a drawing canvas. You can click on the canvas you can start drawing any gesture you would like. Gestures start red, then gradually switch to blue. This gives a information about the time-orientation of the gesture. After you have finished drawing your gesture, tell the application what you have drawn by writing in box right after the "What's this" text mark.

It has a canvas you can drawn, and a box to specify what you had drawn.
After you have registered a few gestures, you are ready to learn a classifier. In order to do so, click the button "Learn a Hidden Markov Model classifier". The classifier will learn (using default learning settings) and will immediately attempt to classify the set of gestures you created. Gestures marked in green will have been successfully recognized by the classifier.

Learning and recognizing new gestures. Try and see how it performs.
Now, if you start drawing another gesture, you will notice that once you release the mouse button the application will attempt to classify your gesture and will also ask if the answer was correct. You can try a few times and reinforce learning by adding more examples of gestures it didn't classify correctly. Click the "Learn a Hidden Markov Model Classifier" button to learn the gestures again.

Improving recognition using HCRFs.
Now, you may have noticed that eventually some gestures just can't be recognized properly. In order to improve the recognition rates, you can click the "Create a Hidden Conditional Random Field" button to use the already created HMM to initialize and learn a new HCRF model.
In this article we presented what a Hidden Conditional Random Fields is and what it can do. After a brief theory, the article had shown how to create, learn and use HCRFs created by the Accord.NET Framework. The sample application shown in the article is part of the Accord.NET Framework Sample Applications, and demonstrates how to create and initialize HCRFs using continuous-density, Gaussian, HMMs to recognize gestures drawn in the screen with the mouse.
Now that you are an expert on sequence classification using Markov models and conditional
random fields, feel free to create your own useful applications with the Accord.NET Framework!
History
- March 11th, 2013 - Article submission.
- December 9th, 2014 - Fixing all links that were broken after the project webpage moved.
References
- V. Vapnik, Statistical Learning Theory. Wiley-Interscience, September 1998.
- C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
- Charles Sutton and Andrew McCallum (2012) "An Introduction to Conditional Random Fields", Foundations and Trends® in Machine Learning: Vol. 4: No 4, pp 267-373.
- Phone images from Wikipedia's page on "Mobile Phone" and chair images from Wikimedia Commons.
See also