Part 1 of 2: Theory. Go to Part 2 of 2: C# Source code.
When dealing with systems, methods or tests involving the detection, diagnostics or prediction of results, it is very important to validate obtained results in order to quantify and qualify its discriminative power as good or not for a given analysis. However, we must consider that the simple quantization of hits and misses on a test group does not necessarily reflects how good a system is, because this quantization will fundamentally depend on the quality and distribution of this test group data.
To exemplify the above paragraph, lets consider the example of a diabetes detection system, which outputs are 1 or 0, indicating whether a patient has the condition or not. Now, lets suppose we applied this system on a test group of 100 patients which we (but not the system) already know who has diabetes or not, and the system correctly identified 90% of the conditions. Seems a rather good performance, doesn't?
Not exactly. What has not been said is how the condition is distributed along the test group. Lets reveal now that 90% actually had diabetes. The system, thus, could have said "1" to every and any input given, and even still obtain a 90% hit rate, as the 10% patients left were healthy. By now, we can not be sure if the system was really good or just acted by chance, saying everyone was ill without any prior knowledge or calculations.

Contingency Table (confusion matrix) for a binary classifier 
For situations like this, other measures have been created in order to consider such unbalance in the test groups. Before going further into those measures, lets discuss about the so called contingency table (or confusion matrix), which will act as a base for the next measures shown. Its mechanics are rather simple: we consider positive values that the system has predicted as positive as true positives (hit), positive values that the system predicted as negative as false negatives (miss), negative values the system said were negative as true negatives (hit), and negative values the system said were positive as false positive (miss).
Now I'll present some measures derived from this rather simple table.
Accuracy
The proportion of correct predictions, without considering what is positive and what is negative. This measure is highly dependant on the data set distribution and can easily lead to wrong conclusions about the system performance.
ACC = TOTAL HITS / NUMBER OF ENTRIES IN THE SET
= (TP + TN) / (P + N)
Sensitivity
The proportion of true positives: the ability of the system on correctly predicting the condition in cases it is really present.
SENS = POSITIVE HITS / TOTAL POSITIVES
= TP / (TP + FN)
Specificity
The proportion of true negatives: the ability of the system in correctly predicting the absence of the condition in cases it is not present.
SPEC = NEGATIVE HITS / TOTAL NEGATIVES
= TN / (TN + FP)
Efficiency
The arithmetic mean of Sensibility and Specificity. In pratical situations, sensibility and specificity vary in reverse directions. Generally, when a method is too responsive to positives, it tends to produce many false positives, and vice versa. Therefore, a perfect decision method (with 100% specificity and 100% specificity) rarely is conceived, and a balance between both must be obtained.
EFF = (SENS + SPEC) / 2
Positive Predictive Value
The proportion of true positives in contrast with all positive predictions. This measure is highly susceptible to the prevalence in the data set, but gives an estimate on how good the system is when making a positive affirmation. It also can easily lead to wrong conclusions about system performance.
PPV = POSITIVE HITS / TOTAL POSITIVE PREDICTIONS
= TP / (TP + FP)
Negative Predictive Value
The proportion of true negatives in contrast with all negative predictions. This measure is highly susceptible to the prevalence in the data set but gives an estimate on how good the system is when making a negative affirmation. It can easily lead to wrong conclusions about system performance.
NPV = NEGATIVE HITS / TOTAL NEGATIVE PREDICTIONS
= TN / (TN + FN)
Matthews Correlation Coefficient  or Phi (φ) Coefficient
The Matthews correlation coefficient is a measure of the quality of two binary classifications that can be used even if both classes have very different sizes. It returns a value between −1 and +1, in which a +1 coefficient represents a perfect prediction, 0 a random prediction, and –1 an inverse prediction. This statistic is an equivalent to the phi coefficient, and attempts, like the efficiency measure, summarize the quality of the contingency table in a single value which can be compared.
MCC = φ = (TP*TN  FP*FN) / sqrt((TP + FP)*(TP + FN)*(TN + FP)*(TN + FN))
Note that, if any of the sums in the denominator equals zero, the denominator can be considered 1, resulting in a MCC of 0, which is also the correct limit for this situation.
The Receiver Operating Characteristic (ROC) Curve
The ROC curve was developed by electrical and radar system engineers during World War II to detect enemy objects in enemy fields. The ROC analysis has been used in medicine, radiology, psychology and other areas for many decades. And, more recently, has been introduced to areas such as machine learning and data mining.
Because the output from classification systems are generally
continuous, it is necessary to define a cutoff value, or discriminatory threshold, to classify and count the number of positive and negative predictions (such as positive or negative diagnostics in the case of pathology occurrence). As this threshold can be arbitrarily determined, the best practice to compare the performance of different systems is to study the effect of selecting diverse cutoff values over the output data.
Considering many cutoff values, it is possible to calculate a set of pairs (sensitivity, 1specificity) which can then be plotted in a curve. This curve will be the ROC curve for the system, having sensitivity values as its ordenades (yaxis) and the complement of specificity (1specificity) as its abscissas (xaxis).
A standard measure for system comparison is the area under the ROC curve (AUC), which can be obtained by numerical integration, such as, for example, the trapezoidal rule. Theoretically, higher the AUC, better the system.
Calculating the area of a ROC curve in Microsoft Excel®
Put the sensitivity and (1specificity) pairs in the columns A and B, respectively. If you have 10 points (from A1 to B10), you can use the following formula to calculate its ROC area:
=SUMPRODUCT((A6:A15+A5:A14)*(B5:B14B6:B15))*0,5
Determining the standard error when calculating the area
The standard error of a ROC curve is based on the standard deviation assumed when applying our system to a population sample rather than to the entire population. This measure comes from the fact that, depending on which samples of a population we take to perform the ROC analysis, the area under the curve would vary according the particular sample distribution.
The error calculation is, to a certain point, simple, as it comes from only three known values: the area A under the ROC curve, the number N_{a} of samples which have the investigated condition (i.e. have diabetes) and the number N_{n} of samples which does not have the condition (i.e. does not have diabetes).
ERROR = sqrt((A*(1  A) + (N_{a} 1)*(Q1  A²)+(N_{n} 1)(Q2  A²))/(N_{a }* N_{n}))
where:
Source code
For a C# code implementing ROC curve creation and analysis, please follow to the next part of this article, Discriminatory Power Analysis using ReceiverOperating Characteristic Curves (Part 2 of 2: C# Source Code).
Most recent modifications will be available on the original blog entry. However, a local hosted copy can be downloaded directly from CodeProject here:
Download source code  1.08 MB
Further Reading
Receiver Operating Curves: An Introduction
Excellent page about ROC curves and its applications. Includes excellent applets for experimentation with the curves, allowing for better understanding of its workings and meaning.
BBC NEWS MAGAZINE, A scanner to detect terrorists; Very interesting paper about how statistics are usually wrongly interpreted when published by the media. “To find one terrorist in 3000 people, using a screen that works 90% of the time, you'll end up detaining 300 people, one of whom might be your target”. Written by Michael Blastland.
References
CodeProject