In this article, you will learn what a Hidden Conditional Random Fields is and what it can do. After a brief theory, the article will show how to create, learn and use HCRFs created by the Accord.NET Framework.

## Sequence Classification Series

## Contents

- Introduction
- Conditional Random Fields
- Fields with Hidden Variables
- Sequence Classification
- Using the Code
- Conclusion
- History
- References
- See Also

In the previous article, we have seen how to approach the sequence classification problem using hidden Markov models. This approach worked and seemed to perform fine. However, what else could be done about it? How else could we have solved this classification task?

Another and (this depends on who you ask, and also on the application context) superior (ahem) way to do it is using **Hidden Conditional Random Fields**. Despite the fancy name, HCRFs can be seen as generalizations of the hidden Markov model classifier. They can be used in classification in the same fashion, and are able to learn from instances almost in the same way. They are also applicable to much more general structures and are not constrained to chained observation. But in this article, we will be focusing on linear-chain HCRFs, which have an almost equivalence with the previously seen hidden Markov models.

This article aims to present the reader to the current workings of the Accord.NET Machine Learning and Statistics Framework; show where the Hidden Conditional Random Fields are located within the framework, the HCRF models' and learning algorithms' source code, how the Fields namespace is organized and the general ideas behind this organization.

### Generative Learning and Discriminative Learning

Remember that to create a hidden Markov model classifier for `n`

classes, we first had to learn `n`

models, training each model to recognize one class of sequences separately. The models did not know they were going to be used to solve a classification problem. Thus all they would do was to learn **everything they could**, hoping it was going to be useful for anything you would need. Too bad that learning everything is an extremely more general problem than just learning what is actually needed for classification.

I have an example for you.

Suppose I give you a number of **cellphones**, perhaps about three. I ask you to learn everything you can about them. I will not be telling you what I am going to do with the information, but I'll be relying on you to tell me whether any new object in the world is similar to the ones I gave you. What could you do? You could annotate common details about the architectures of the phones, about how many cores they have, which apps they support, their memory capacities, their predominant color, the case material, pixel densities... There are so many things to learn! But I am sure you will handle it well.

Then, I ask another person - your friend - to learn about some other objects. Note that I won't be telling him what I gave you, nor I'll be telling you what I am about to give him. I will give your friend **five chairs**. Yes, chairs, the ones we sit on. And I will ask him to learn everything he can about those chairs.

And now, I finally tell you and your friend I will be counting on both of you to help me *differentiate between cellphones and chairs*. Seems absurd, but this is what we were doing before. Each person placed immense efforts attempting to learn everything they could about their own set of objects, without knowing what would be the goal of doing this. If they had been told beforehand what we were going to do, choosing a discriminating rule to differentiate among a phone and a chair would be quite trivial. All we would need to learn, for example, is that one of them is bigger.

The example I gave above is an extreme example of generative learning used for discrimination. In the real world, it is not as extreme as I made it to be, but it serves well to point out some difficulties on executing it properly. On a generative learning setting, we create one model after each class we are trying to distinguish. But *we do not tell the models they will be used to distinguish things*. They are not aware they are going to be used for classification.

It is interesting to note that this comes in direct confront with a key learning principle:

*When solving a problem, never attempt to solve a bigger problem as an intermediate step of the original problem. There may be enough information to solve the original problem directly, but not enough information to solve the intermediate step we created in the middle*. (Vapnik, 1998)

This is one of the main learning principles behind the Structural Risk Minimization principle. The SRM was created by Vladmir Vapnik, who practically started the field of Statistical Learning Theory. One of his most fruitful creations originating from a proper application of this principle had been the famous Support Vector Machine (although there is more to the SRM than simply the principle stated above). And while the stated principle may seem obvious, this is exactly what we do when we apply Bayes' theorem to invert probabilities and estimate *a posteriori* probabilities for a class label in a classification problem.

Please note, however, that I am not saying generative learning should always be avoided. It will always depend on the application and the problem we are trying to solve. Learning quite a simple discriminative rule is not always what we want, and in those cases, we may be willing to learn quite general descriptions of our data before blindingly trying to solve a classification problem. As everything in life, any extreme should likely be avoided. An interesting case of a combination of both generative and discriminative learning which has yielded quite surprising results lately had been conceived as the *deep learning paradigm*. But I will leave this topic for a future article. :-P

The (linear-chain) Conditional Random Field is the discriminative counterpart of the Markov model. Whilst I had not discussed about (visible) Markov models in the previous article, they are not much different in nature. An observable Markov Model assumes the sequences of states **y** to be visible, rather than hidden. Thus they can be used in a different set of problems than the hidden Markov models.

Those models are often used for sequence component labeling, also known as part-of-sequence tagging. After a model has been trained, they are mostly used to tag parts of a sequence using the Viterbi algorithm. This is very handy to perform, for example, classification of parts of a speech utterance, such as classifying phonemes inside an audio signal.

A general formulation for a Conditional Random Field can be stated using the horrifying-at-first formulae:

where:

Now, the conditional random field has a particular interesting form. While it may seen complicated at first, its linear-chain with single maximum-clique formulation is **identical **to a hidden Markov model. It has just been buried with some mathematical transformations. Let's start by writing the HMM formula, and then I will gradually walk towards a linear-chain CRF formulation so we can see this more clearly. So, let's begin:

Now let's write the emission and transition probabilities in terms of the matrices A and B:

Now we can apply an exponential-logarithm identity. Note that nothing has changed.

Now we take that the logarithm of the product is the sum of the logarithms:

We proceed simplifying (or perhaps complicating even further) the expression.

Note that the two parts of the formula have an extremely similar form. How could we remove this duplication? We can think of this as if we were refactoring the formula to reduce duplicated code. Remember that before we had two matrices with parameters to specify our model:

If we "linearize" those matrices into a single parameter vector, we will only have one structure to maintain:

But now, how could we access individual elements of this vector as if it were the original matrices? Well, we may do so by using *indicator functions*. Those are feature functions that activate only when their inputs are equal to some pre-defined values. Let's consider two classes of functions, one that activates for elements in the transition matrix and one that activate for elements of the emissions matrix.

Next, we instantiate a member from those functions for each corresponding parameter in our parameter vector:

And now, when things seems to have gotten extremely hairier than before....

.... we suddenly realize the structures are so similar, we can process then in the same way. As if we had created a common "interface" for accessing the matrix values:

Up to now, nothing had changed in the formula. This form is still completely equivalent to an observable Markov model. With this model, we may be interested in detecting, for example, the most likely sequence of states **y** given an observation sequence **x**. Because the parameter values are not constrained anymore to form probabilities, we need to marginalize over all possible state sequences **y**, which requires some extra effort. And this is where the partition function `Z(x)`

comes in:

Finally, we have a model completely equivalent to an observable Markov model, but in a more general form, with a single parameter vector which does not necessarily need to be regarded as probabilities. We can see that every Markov model is a CRF, but not every CRF is a Markov model, as its parameters are not necessarily probabilities.

Now suppose we would like to devise a CRF model for classification. In order to do so, the first logical step would be to incorporate class variables in the CRF formulation. We could do it, for example, by expanding the aforementioned model, the potential functions and the partition function to accommodate a new input, the class label `w`

Now, in order to perform sequence classification as we were doing with the Hidden Markov Models, we can again set the sequence of states **y** to be hidden. By doing this, we assume those variables are *latent*, and we achieve a model called the Latent-variable Conditional Random Field, or simply a *Hidden Conditional Random Field*.

Again, we can marginalize the joint probability over **y** by summing over *all possible variations of* **y**, as shown above. However, in the case the graphical model obeys some known structures, such as the case of a linear-chain HCRF, we can use the same techniques shown in the previous article for HMMs to get those probabilities. Besides, since linear-chain HCRFs and HMM classifiers share the same structures and parameters, we can always use the parameters of HMM to initialize a HCRF. But the reverse is not always possible, since learned HCRFs are not required to always form probabilities as the HMMs are.

Let's take a look on the class diagrams for a linear-chain Hidden Conditional Random Field. The diagram below shows how a HCRF is associated with a single potential function.

However, this potential function could be (and typically is) divided into several potential factors, each of them defined over a maximum clique of a graph. In the linear-chain case, those are analogue to the individual hidden Markov models associated with each class label in a HMM classifier.

The potential function has a set of weights and features, with each element in the Weight array corresponding to an element in the Features array. Those arrays are also segmented, and each segment of those arrays is associated with a single factor potential.

The overall model visualization is shown below:

Note, however, that this structure is not always obeyed by the framework. For performance purposes, the framework provides specialized classes for evaluating some factors, such as factors modeled against Markov models.

In the last article about HMMs, we have seen some examples to perform sequence classification. Let's start reproducing the sample examples, however using HCRFs. In order to reproduce the functionality of a discrete hidden Markov classifier, we can use the MarkovDiscreteFunction as the potential function for our HCRF, as given in the example below:

int[][] inputSequences =
{
new[] { 0, 1, 1, 1, 0 },
new[] { 0, 0, 1, 1, 0, 0 },
new[] { 0, 1, 1, 1, 1, 0 },
new[] { 2, 2, 2, 2, 1, 1, 1, 1, 1 },
new[] { 2, 2, 1, 2, 1, 1, 1, 1, 1 },
new[] { 2, 2, 2, 2, 2, 1, 1, 1, 1 },
new[] { 0, 0, 1, 1, 3, 3, 3, 3 },
new[] { 0, 0, 0, 3, 3, 3, 3 },
new[] { 1, 0, 1, 2, 2, 2, 3, 3 },
new[] { 1, 1, 2, 3, 3, 3, 3 },
new[] { 0, 0, 1, 1, 3, 3, 3, 3 },
new[] { 2, 2, 0, 3, 3, 3, 3 },
new[] { 1, 0, 1, 2, 3, 3, 3, 3 },
new[] { 1, 1, 2, 3, 3, 3, 3 },
};
int[] outputLabels =
{
0, 0, 0,
1, 1, 1,
2, 2, 2, 2, 2, 2, 2, 2
};
var function = new MarkovDiscreteFunction(states: 3, symbols: 4, outputClasses: 3);
var classifier = new HiddenConditionalRandomField<int>(function);
var teacher = new HiddenResilientGradientLearning<int>(classifier)
{
Iterations = 50
};
teacher.Run(inputSequences, outputLabels);
int y1 = classifier.Compute(new[] { 0, 1, 1, 1, 0 });
int y2 = classifier.Compute(new[] { 0, 0, 1, 1, 0, 0 });
int y3 = classifier.Compute(new[] { 2, 2, 2, 2, 1, 1 });
int y4 = classifier.Compute(new[] { 2, 2, 1, 1 });
int y5 = classifier.Compute(new[] { 0, 0, 1, 3, 3, 3 });
int y6 = classifier.Compute(new[] { 2, 0, 2, 2, 3, 3 });

Note that in the above example, we have used the Resilient Backpropagation method to learn our models. However, it would also be possible to use the standard Stochastic Gradient Descent and even other optimization algorithms. The Resilient Backpropagation, however, has been found as one of the best algorithms for this task.

As the HCRF is a generalization of the HMM classifier, we can use any discrete, continuous, or even a mixture of feature functions in our model. This allows the HCRF to replicate any Markov model, even continuous ones. In this section, we will see how we can use the HCRF to perform mouse gesture recognition, as we did in our last article.

After a hidden Markov model is learned, we can use its probabilities as an initial guess for initializing our HCRFs. Note that since we are initializing the field with a continuous density hidden Markov model, all our feature functions will operate directly on the observed sequence points, here represented as `double[]`

arrays of size two, in which the first element is the X coordinate of the point and the second element is the Y point coordinate.

hcrf = new HiddenConditionalRandomField<double[]>(new MarkovMultivariateFunction(hmm));

After the model has been created, we can create a learning algorithm to train the classifier.

var teacher = new HiddenResilientGradientLearning<double[]>(hcrf)
{
Iterations = iterations,
Tolerance = tolerance
};

And this is all that is needed to create our recognition models. Now, if you wish, please download the sample application available in the top of this article so we can get started. The application performs gesture recognition using the mouse. In fact, you could use it for other things as well - such as signature recognition. It is just a sample application on how to use hidden Markov models. The application is shown below:

This is the main window of the application.

After the application has been opened, you will be granted with a drawing canvas. You can click on the canvas, you can start drawing any gesture you would like. Gestures start red, then gradually switch to blue. This gives a information about the time-orientation of the gesture. After you have finished drawing your gesture, tell the application what you have drawn by writing in box right after the "**What's this**" text mark.

It has a canvas you can draw on, and a box to specify what you had drawn.

After you have registered a few gestures, you are ready to learn a classifier. In order to do so, click the button "**Learn a Hidden Markov Model classifier**". The classifier will learn (using default learning settings) and will immediately attempt to classify the set of gestures you created. Gestures marked in green will have been successfully recognized by the classifier.

Learning and recognizing new gestures. Try and see how it performs.

Now, if you start drawing another gesture, you will notice that once you release the mouse button, the application will attempt to classify your gesture and will also ask if the answer was correct. You can try a few times and reinforce learning by adding more examples of gestures it didn't classify correctly. Click the "**Learn a Hidden Markov Model Classifier**" button to learn the gestures again.

Improving recognition using HCRFs.

Now, you may have noticed that eventually some gestures just can't be recognized properly. In order to improve the recognition rates, you can click the "**Create a Hidden Conditional Random Field**" button to use the already created HMM to initialize and learn a new HCRF model.

In this article, we presented what a Hidden Conditional Random Fields is and what it can do. After a brief theory, the article had shown how to create, learn and use HCRFs created by the Accord.NET Framework. The sample application shown in the article is part of the Accord.NET Framework Sample Applications, and demonstrates how to create and initialize HCRFs using continuous-density, Gaussian, HMMs to recognize gestures drawn in the screen with the mouse.

Now that you are an expert on sequence classification using Markov models and conditional random fields, feel free to create your own useful applications with the Accord.NET Framework!

- 11
^{th} March, 2013 - Article submission - 9
^{th} December, 2014 - Fixed all links that were broken after the project webpage moved

- V. Vapnik, Statistical Learning Theory. Wiley-Interscience, September 1998
- C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006
- Charles Sutton and Andrew McCallum (2012) "An Introduction to Conditional Random Fields", Foundations and Trends® in Machine Learning: Vol. 4: No 4, pp 267-373
- Phone images from Wikipedia's page on "Mobile Phone" and chair images from Wikimedia Commons