You have just written some code and now you sit with your favorite test framework and wonder what tests to use to confirm that it works as expected, without flaws. In this article, I will walk you through how to get the input space under control, how a combinatorial approach provides a measure of strength and at the same time keeps your test suites reasonably sized. Finally, I will explore a useful shortcut to validation.
Years ago, I repeatedly found myself with a unit test framework and a critical system needing integration test, wondering how I could make sure that the system did not contain any serious bugs. To me at that time, unit testing was simply a framework that could iterate a list of function pointers (delegates in C#) and report any exceptions that occurred when calling the functions, but it did not give me any clue what to throw at the system or how to validate that it worked as expected. Later I learned that unit testing also involves a process and a particular way of working (TDD) and about its extension into specification / acceptance testing by BDD (AKA Specification by Example), e.g. in the SpecFlow framework. I also came across test frameworks that provide data drivers so that each test can run multiple times with different data each time, but they still do not help me choosing how to test or how to validate my code.
I decided to dig deeper, a lot deeper, and ended up building my own fully featured test application called “Quality Gate One Studio” (QS) to help me test not only correctness but also state, asynchronous behavior, performance and concurrency. In this article, I will share some of what I learnt. This article briefly covers abstract classification, Covering Arrays, Random Arrays and validation. The next article will cover stateful and asynchronous systems and simulation of scenarios. The final article will be about performance, timing and such.
I hope you will enjoy the article and gain some deeper insights into the topic of test.
Overview: How Unit Test, TDD, BDD, Fakes and Integration Test Fit Together
What you normally do with unit tests and with TDD in particular is to perform a white-box test (AKA a “structural test”): You inspect your code and identify the structure of it, and then you write tests to ensure that you cover the branches, loops etc. in the code. With TDD you reverse it and start with the tests but then you iterate the process and refine and extend the tests as the code develops, so in effect you achieve the same thing – a white-box test that covers the implementation.
TDD will keep your mind bouncing between writing code and verifying code and you normally end up with something quite solid, but being a white-box approach it will validate what you built and not necessarily, what your users expect.
Enter BDD, which provides a way for users to express the expectation towards the code. BDD provides a straightforward language that you can use to express your expectation, and you do not even need to be a developer to understand it. The format is commonly something like “Given [some description of state] When [some action is performed] Then [some validation]”. BDD is a black-box approach (AKA a “specification based test”).
BDD will keep your code to spec but will usually not comprise a thorough test of every detail.
In unit testing (which underlies both TDD and BDD) it is good practice to test just one thing in a test, but when your test requires a predefined state or relies on a certain behavior from a dependency, that dependency has to be set up before executing the test. While it may be possible to achieve the desired state through calls to other parts of the code than the code you are actually testing (for example, to create clients and accounts before testing a transfer between accounts), it is discouraged to do so. Adding dependencies to unrelated functionality makes the test hard to debug; if it fails then was it because of a dependency or the code you wrote? Having good maintainability and good performance are other reasons for not having dependencies like this.
Enter fakes / mocks, testable code design and dependency injection. The purpose of these are to be able to simulate and control the dependencies of the code you are about to test and to do it without your test code bloating too much. Common to these approaches is that they provide mechanisms whereby you can replace dependencies with alternative implementations that are controllable from your test. This way your test code will start with a preamble configuring the dependencies, then a call to the functionality you wish to test and finally a validation of results. That is the Arrange-Act-Assert pattern as we know it.
From this, a combination of TDD, BDD, testable code design and appropriate fakes will achieve solid code that works according to spec and a reliable and maintainable test. Right?
Unfortunately not. If you do not fake dependencies, you get into all sorts of trouble but if you do fake them, then your test relies on how you assume the dependencies work, not how they actually work. That might be the same thing initially, but all bets are off unless you continuously update your tests to reflect all modifications in all dependencies, which is next to impossible especially when multiple teams manage your dependencies or if you are working in a distributed environment.
So what am I saying here, does TDD and BDD not work? Sure, but TDD only works if your assumptions are complete and remain valid over time, and BDD will only keep your code to spec as far as the spec has been implemented as tests. What is missing from the equation is a solid integration test that works against real dependencies and that covers the parts that the BDD spec is missing. When the integration test fails on your TDD or BDD tested code, you get a hint that an assumption no longer holds true.
Assuming this powerful integration test is available, why then also use TDD and BDD at all? It is about timing and independence: An integration test requires a substantial amount of working code. With TDD and BDD, you can start before you code and the tests for your code are independent of the state of the rest of the system.
Building Powerful Integration Tests
Building a powerful integration test can be a complex task because here you will normally find yourself working with complex data structures, state and maybe even timing issues and asynchronous behavior. Before we dig in to all of that, let us go back to the original question: How do you design a good test?
For the remainder of this article I will describe the techniques in a stateless and synchronous scenario. I will cover stateful and asynchronous systems in my next articles about test.
As an example, let us use something we all know: The
TimeSpan structures and let us consider how we could test the method
DateTime.Add(TimeSpan) against the properties of
We could start by working out a list of things we would like to test: Add a year to a date and the Year property should increment while the other properties (Month, Day etc.) should remain unchanged. If we add one year to Feb 29 of a leap year Month and Day should also change so we add a special case for that, and so we continue.
To understand this approach, envisage the program as a set of execution paths where every input leads down its own separate path to its own separate output. In the test, we select a few inputs, call the program with these inputs, compute an expected result and check that the program produces the expected result.
The problem with this approach is that the total number of paths is equal to the size of the input space and enormous. There is no structure to our choices and no notion of completeness, making it impossible to give a high-level description of what we do and do not cover in such a test. This is neither powerful nor solid!
To get a notion of completeness and coverage we need to put a cap on the number of possibilities, and the standard trick in the book is to use categorization to group tests that essentially do the same thing.
Let us start with DateTime.Year. Valid values are from 1 to 9999 but do we need tests for every single year? Surely not! Min and Max values in a range are always good candidates so let us include 1 and 9999. Leap years are special so we need at least one leap year and one that is not. Let us include 2011 and 2012. Centuries are also special, 1900 was not a leap year while 2000 was not so let us include these as well. That is it.
|Factor ||Categories ||Cardinality |
|Year ||1, 1900, 2000, 2011, 2012, 9999 ||6 |
|Month ||1, 2, 6, 7, 8, 12 ||6 |
|Day ||1, 28, 29, 30, 31 ||5 |
|Hour ||0, 12, 13 ||3 |
|Minute ||0, 30, 59 ||3 |
|Second ||0, 30, 59 ||3 |
|Millisecond ||0, 500, 999 ||3 |
By following the same line of thought, we can identify values for the remaining properties.
The formal names for the techniques used above are “Boundary Value Analysis” (BVA) and “Equivalence Class Partitioning” (ECP). BVA applies to ranges while ECP applies to natural categories (like leap years and leap centuries). Standard practice in BVA is to use min and max values for each range but I like to also include a “normal” value (e.g. a mid value or a random value inside the range) to be sure that the most common scenarios are also covered by the test.
With categorization in place, we can get an idea about how many
DateTime instances we at most need to pass through our test. That is all combinations of the above values and amounts to 6*6*5*3*3*3*3 = 14.580 different values. Some of these combinations are invalid but let us leave that aside for a moment and keep 14.580 as a ballpark upper bound on how many
DateTime instance we would need to test. While it is a lot, it has to get worse before it gets better!
|Categories ||Cardinality |
|Days ||0, 1, 31, 365, 366, 367 ||6 |
|Hours ||0, 11, 23 ||3 |
|Minutes ||0, 29, 58 ||3 |
|Seconds ||0, 29, 58 ||3 |
|Milliseconds ||0, 499, 998 || |
If we do a similar classification for the
TimeSpan properties, we might get these values.
Note that the line of thought is a little different, as we must consider the effect of adding the
TimeSpan to a
DateTime. For Seconds the logic is “A value that does not change the second”, “a value that does not change the minute when second is mid” and “a values that does change the minute when second is mid”.
The total number of
TimeSpan values we must use is 6*3*3*3*3 = 486. If we use all
TimeSpan values against all the chosen
DateTime values, we get a total of 7.085.880 tests to execute.
It’s trivial to code something that passes through all these combinations, with a bit of logic we can filter out the invalid combinations and since
DateTime operations are fast, an implementation of this test would probably complete in a few seconds. However, if we did this for say some web service or some complex business logic operating on a bigger timescale we would be in deep trouble. More on how to deal with that below.
The first takeaway from this example is to use categorization by applying BVA or ECP. It is straightforward to use and provides
a structured approach to your tests. Using classifications as part of your tests is key to efficient testing.
The second takeaway from this example is that exhaustive testing is infeasible and even if you apply categorization,
you often end with an infeasible amount of categories if you try to test them all. Nevertheless, as I will show below you can still do well.
As we stand now we have open issues:
- How do we reduce the number of tests to something more reasonable?
- How do we deal with invalid combinations?
- How do we validate?
Reducing the Number of Tests
One obvious approach to reducing the number of tests is to consider whether it is necessary to try all
TimeSpan variants against all
DateTime variants. Following this line of thought, you might pick a few representative
DateTime variants where you apply all
TimeSpan variants and a few
TimeSpan variants that you apply to all remaining
DateTime variants. The problem with this approach is that it requires a lot of work to find good representatives, and it will only bring down the number of tests to a small multiple of the maximum number of variants of any factor (in this case the number of
DateTime variants). For the
DateTime.Add(TimeSpan) test you will still be facing more than 10.000 tests. It gets exponentially worse the more complex your data is, the more factors you need to describe your data and the more categories you have for each factor.
Let’s take a step back and consider what we are doing: We are facing somewhere between 10.000 and a few million tests for a
DateTime class, but how many branches and how many real execution paths (that is, those you actually code) could there possibly be inside it? The source code for
System.DateTime is a few hundred LOC and there has to be some balance between the number of possible execution paths and the number of tests, otherwise the test
is either too large or too small.
Following this line of thought, there must be a relationship between the categorizations we made earlier, and the branching inside the code: If one of the chosen values does not affect the execution flow then all that value does is to force multiple traversals of same path. Conversely, uncovered paths means that the categorization is incomplete.
Assume there is a piece of code inside the
DateTime class that looks like this:
If our test covers all combinations of
IsLeapYear (true/false) and
IsFebruary (true/false) then we can be sure it traverses all paths. Assume instead that we do not know what the code looks like but we know that the maximum nesting level is two, what do we have to do to cover all paths? We need to cover all combinations of any two factors describing our data, and that is definitely not the same as all combinations of all factors!
The general term for a list of factor combinations that covers all combinations of any t factors is a Covering Array of strength t. A Covering Array is an extremely powerful tool for test design both because the size of the resulting test suite is low, and because there is an empirical connection between the strength of the covering array and how many percent of the defects that are left undetected.
As rule of thumb #1, the percentage of defects left undetected by a test design using a Covering Array of strength t is 0.2(t-1). According to this rule, a test design of strength 2 finds around 80% of the bugs, while a design of strength three finds around 96% of the bugs and so on. This rule of thumb is based on a few empirical studies (see
http://csrc.nist.gov/groups/SNS/acts/documents/SP800-142-101006.pdf) reporting detection rates per factor between just below 70% to above 95%.
As for any good rule of thumb the disclaimer should be bigger than the rule itself: 80% is only used because it’s easy to remember (but using 70% does not change much), the rule only applies if your categorization matches the code and only if your code has normal branching behavior (i.e. many branches on one or two variables, fewer depending on three variables and so on).
Rule of thumb #2 is about the size of the Covering Array. There are no precise bounds but if all k factors have the same cardinality v then the size grows like vtln(k). With varying cardinality you take the t largest factors describing your data and multiply them together with ln(k) to get a ballpark estimate.
DateTime example above, k is 7 and the largest factors are Year, Month and Day yielding 6*6*ln(7) = 70 combinations, while for
TimeSpan, k is 5 and the largest factors are Days and Hours yielding 6*3*ln(5) = 24, so we are definitely getting somewhere with this.
Computing a (short) Covering Array is far from trivial and finding the shortest possible Covering Arrays is NP-Complete. If you decide to go for covering arrays, there are tools available on the Internet to do this (and so can my tool, including the handling of constraints) but if you just need something quick and dirty using a randomly generated array is not bad. Assume you would like to do a test involving a CA of strength 2 and the rule of thumb #2 above tells you that you need 100 combinations to do this, then if you simply generate 100 random factor combinations, then your test will contain a high percentage of all factor pairs (something like 80-90%). If you need to get close to 100% then create more tests (Rule of thumb #3 says that a factor 5 will get you very close). This also works if you need a higher strength.
You can read more about Covering Arrays vs. Random Arrays in chapter 7 of this document: http://csrc.nist.gov/groups/SNS/acts/documents/SP800-142-101006.pdf
The third takeaway from this article is that Covering Arrays are extremely powerful for test design.
It provides an empirical assessment of how powerful your tests will be in terms of detecting bugs and it provides very compact test designs (i.e. less work!).
As a poor man’s alternative you can use Random Arrays instead and use the provided rules of thumb to decide how many tests you need.
Using Abstract Classification to Remove Invalid Combinations
Choosing a classification scheme for test data is similar to data modeling (e.g., OO or relational): Once you get it right, everything else becomes a lot easier.
|Factor ||Categories ||Cardinality|
Max, CommonYear, CommonCentury, LeapYear, LeapCentury
|Month ||Jan, Feb, Jun, Jul, Aug, Dec ||6 |
|Day ||First, SecondLast, Last ||3 |
|Hour ||Min, Mid, Max ||3 |
|Minute ||Min, Mid, Max ||3 |
|Second ||Min, Mid, Max ||3 |
|Millisecond ||Min, Mid, Max ||3 |
One of the problems we created for ourselves when classifying our data comes from the fact that we used a concrete classification instead of an abstract classification. Using abstract classification, we get values like “FirstDayInMonth”, SecondLastDayInMonth” and “LastDayInMonth” for days that we need to resolve for any specific year and month. The abstract classifications highlight the semantics of what we try to achieve and if we choose to employ randomization it becomes clear where it is appropriate (e.g. the "Mid" categories for the time part).
One, Year, YearAndOneDay
|Hours ||Wrap, NoWrap ||2 |
|Minutes ||Wrap, NoWrap ||2 |
|Seconds ||Wrap, NoWrap ||2 |
|Milliseconds ||Wrap, NoWrap || |
Doing the same for
TimeSpan leaves us with the table to the right. Again semantics are much clearer and reflects behavior we want to achieve.
Notice also how the number of categories have dropped across the board.
The forth takeaway from this article is to work with abstract classifications and use runtime resolution to get concrete values.
It prevents many problems with invalid combinations (but not all) and once you start working with dynamic data, it makes a huge difference. The trouble you
avoid easily offsets the extra work needed to do runtime resolution.
Sometimes, invalid combinations are inevitable no matter how hard we try to adapt the classification. If you use Covering Arrays, your generator should support constraints, as it is very hard to get anything useful out of post-filtering generated Covering Arrays for invalid combinations. It is different with Random Arrays where you can simply retry the random generation until you get a combination that does not conflict with your constraints. Naturally, the stricter your constraints are, the more trials will be required until you get a successful combination.
The ballpark estimate for the number of different
TimeSpan variations with the updated classification are 70 and 13. If we use every
TimeSpan with every DateTime we get 910 tests. Is that the right number or can we make it lower?
Let us consider our data. One of the things we definitely
must test is the logic for leap years so we need at least to consider all
DateTime combinations of Years, February,
LastDayInMonth against each possible
TimeSpan.Day value while we can keep the time aspect out of the picture. Now, instead of going into details with a very clever test design, let’s simply stay with the observation that the above involves four variable factors, so if we go with a Covering Array of strength 4 across the combined factors from
TimeSpan, then we can be sure that the test covers all of the above.
|2 ||90 ||39 ||80% |
|3 ||358 ||185 ||96% |
|4 ||1073 ||705 ||99% |
|5 ||3219 ||2546 ||99.8% |
|6 ||9657 ||8087 ||99.97% |
Using rule of thumb #2 from above and assuming
we have shifted to abstract classifications, the estimate for the size of the test is now | DateTime.Years| * | DateTime.Months| * |TimeSpan.Days| * |DateTime.Days| = 6*6*4*3*ln(12) = 1073 tests (an actual array produced by QS contains 705 tests).
Using rule of thumb #1 from above we can expect a detection rate of somewhere around 1 - 0.23 » 99%, provided the classification covers the functionality. With access to a code coverage tool, it is straightforward to validate coverage to see if the classification is good.
So far, we have looked only on the input side of the test but tests without validation are useless, so how do we do that? There are two extremes in how you can do this and then something in between. The first extreme is to go through every single combination and figure out what the expected result should be. This is very common in both manual and automated testing but tedious for large tests. The other extreme is to build a reference implementation of the code you test and verify that your code and the reference (not made by yourself) yield identical results.
The figure to the right illustrates the in-between approach, which is to build upon the abstract classifications to provide as many details of the validation as possible and then add just enough logic to produce expected results. The basic idea in this approach is that compared to random input, the classification of the input side provides information that can be leveraged to simplify validation.
For the DateTime.Add() scenario validation is partly arithmetic but we can assist the validation by determining where there are overflows: With abstract classifications in place it is straightforward to determine when there are overflows from Milliseconds to Seconds, from Seconds to Minutes and so on, and with overflows in place, it is again straightforward to compute expected values for Milliseconds, Seconds etc.
In my experience, straightforward dependencies between input and validation is the rule rather than the exception and the resulting validations can be surprisingly simple, with the above example being to the complicated side of things. I did an implementation of the test for DateTime.Add() in C# and with QS for the combinatorial part. The total amount of code for the test is 140 LOC (as counted by VS2010 code analysis) and 55 lines of QS expressions. The code for the Random Array approach (pure C#) is similar with 205 LOC. Code and expressions work for all CA strengths.
It is worth noticing that this style of validation is very different from building a reference implementation, which makes it safe to use even when you validate your own code.
The fifth and final takeaway from this article is that abstract classifications eases the task of producing expected results for validation, because the abstract classification of the input provides information about what to expect from the result.
Using the Code
The attached code contains two VS2010 projects and a project file for QS. The project "DateTimeTests.csproj" requires QS to be installed to run while the second project
"RandomArray.csproj" essentially does the same thing as the first project but with Random Arrays instead of Covering Arrays. This project does not require QS. There is a note in the sample "AboutTheCode.pdf" that explains the code in greater detail.
In this article I have motivated the need for integration test and demonstrated how the concept of (abstract) classification provides an operational (albeit empirically based) notion of coverage. I also demonstrated how classification fits with Combinatorial test in the form of Covering Arrays and Random Arrays and how a well-designed abstract classification scheme simplifies both test design and validation.
The methods described in this article are directly applicable to stateless systems. To test stateful systems it is necessary to add separate support for the state component. Achieving this will be the goal of my next article.
I hope you enjoyed the article.