Designing and developing real time software today is a sophisticated juggling act. Today’s systems employ more functionality than ever, elaborate GUIs, network connectivity, interprocessor communication and, at the same time, the time constraints must still be met. This paper explores some of the different types of RTOSs and the consequences of using them. Designing and developing real time software today is a sophisticated juggling act. This paper is meant to provide the reader with a general overview of Real-Time Operating Systems (RTOS) including some basic definitions, fundamental formulas, and basic taxonomy of RTOSs.
This first part of the paper will explain exactly what a RTOS is, define some terminology, and discuss the difference between soft and hard real-time operating systems.
The second section extends some basic concepts to include an explanation of interrupt latencies and context switch time and how they affect performance, an analysis of static vs. dynamic systems, a look at several different scheduling algorithms including implementation and general failure handling in RTOS.
This paper is meant to provide the reader with a general overview of real-time operating systems (RTOS) including some basic definitions, fundamental formulas, and basic taxonomy of RTOSs. The paper is broken down into the general categories of:
- review of vocabulary and concepts, and
- RTOS selection considerations.
Review of Vocabulary and Concepts
This section will explain exactly what a RTOS is, define some terminology, and discuss the difference between soft and hard real-time operating systems.
First of all, there is a common misconception that a RTOS must be fast. While this is usually the case, it is not always necessarily true. However, in all cases, the system must be predictable. In other words, the correctness of an answer is a function of time. The right answer at the wrong time is wrong. A good basic definition of a RTOS is the following: “A real-time system is one in which the correctness of the computations not only depends upon the logical correctness of the computation but also upon the time at which the result is produced. If the timing constraints are not met, system failure is said to have occurred.” 
Example: Consider the bottling equipment in a beverage plant. It’s not sufficient enough if the equipment attempts to cap bottles at regular time intervals. If it’s off by just a fraction of a second, then the bottle may not be capped properly and the beer will be spoiled. That is, the time at which the machine caps the bottle is a critical part of the process.
Some definitions that may be helpful to the reader are:
- Scheduler - A typical RTOS may have a kernel, a hardware abstraction layer, an executive and some device drivers. The scheduler is that part of the kernel that is responsible for handling events.
- Task Execution Time - The CPU time required to execute a task.
- Task Period - This is the time interval between tasks. If a task has a period of 50ms, then that task must be started and completed every 50ms.
- Task Deadline - A task may have a period of T, but a completion requirement of D, where D<T. Such a requirement is called a task deadline. The default deadline is the task period.
- Task Rate - The reciprocal of the task period.
- Transport Lag - The elapsed time between a sensor measurement and the output of the corresponding actuator. This term is borrowed from controls system engineering. Generally speaking, the smaller the transport lag, the better the stability and accuracy of a system.
Soft vs. Hard RTOS
Practically speaking, RTOS fall into two groups, those that are able to respond in “hard” real time, and those which respond in “soft” real time where requirements are less severe. Simply stated, a hard-real time operating system must, without fail, provide a response to some kind of event in a specific time window. This response must be predictable and independent of other activities undertaken by the operating system. Often, hard RTOSs employ specific hardware and special device drivers. Examples of hard RTOSs include: Chimera II, QNX, and LynxOS.
In contrast, a soft RTOS is one that has reduced constraints on “lateness” but still must operate quickly with fairly constant time constraints. It must be “good enough” to service events so that, on average, the response should be satisfied. Windows NT is a popular example of a soft RTOS.
A simple way for the end-user to decide on whether a soft or hard RTOS is required is to ask the question, “What are the implications of a failure?”. If the answer is catastrophic, you need a hard RTOS, for everything else, a soft RTOS will probably do.
RTOS Selection Considerations
This section extends some basic concepts to include an explanation of interrupt latencies and context switch time and how they affect performance, an analysis of static vs. dynamic systems, a look at several different scheduling algorithms including implementation and general failure handling in RTOS.
Interrupt Latency and Context Switch Time
Two metrics which are frequently used to compare real-time operating systems are interrupt latency and context switch time.
Interrupt latency is defined as the worst-case time that can pass between the assertion of an interrupt and the execution of the first instruction of the corresponding interrupt service routine (ISP). This latency depends on three things:
- the interrupt latency of the hardware (which is usually negligible),
- the time it takes the RTOS to handle the interrupt and pass control to the ISR, and
- the RTOS’s disable time, which is the worst-case length of time for which the RTOS disables interrupts in order to protect its own critical code segments.
All RTOSs have some interrupt latency associated with them, and when combined with context switch time, provides an indication of the responsiveness of a system.
Context switch time is defined as the amount of time it takes a RTOS to save the context of one task (save its pointers and registers to a control block) and load the context of another task from a second control block. Context switch time never shows up in isolation in a system, it is always overhead that results from a RTOS call.
But be careful here, though context switch time and interrupt latency often provide some measure for comparison, frequently, the vendors publishing the data do not even include the time it takes the RTOS to decide which task to execute. Also, context switch times can very dramatically depend on whether the FPU registers are saved as well.
Dynamic vs. Static Systems
There are two further sub-classifications of RTOSs, dynamic and static. By dynamic, we mean reconfigurable. A dynamically reconfigurable system can change in time without the need to halt the system. Consider, as an example, a tactile sensor on the end of a robotic manipulator used to explore an object. The resolution of the sensor is 2n by 2m taxels where n and m can vary dynamically between 1 and 4. When the robot is exploring uninteresting parts of object such as the edge of a table, one would want to use low resolution for reduced computation. But when exploring more interesting parts, such as the rounded corner of the table, one would want to use the more demanding high resolution mode with more computational overhead. Such dynamic systems may have many sensors or actuators, only a subset of which may be used at any one time. Or, hardware could be used in a different configuration as in the example before. In any case, it is crucial that critical tasks do not fail, even if the tasks or frequency of tasks change in the system. This requires a dynamic scheduler.
Before we discuss dynamic schedulers, a brief review of a static scheduler, the rate monotonic (RM) algorithm is in order. RM is a fixed priority or static algorithm. The algorithm basically says to:
- assign the highest priority to the tasks which occur the most frequently,
- at any time, the scheduler chooses to execute the task with the highest priority.
By specifying the period and computational overhead time of each task, the behavior of the system can be categorized apriori.
RM is not without problems. For instance, the schedulable bound is less than 100%. What does this mean? The CPU utilization of a given task is computed as the ratio of the worst-case computing time to the period. The total CPU utilization is computed as the sum:
For RM, the worst case schedulable bound for n tasks is:
and for 1, 2, 3, and 4 tasks, is found to be:
Thus, a set of tasks for which CPU utilization is less than 69% will always meet deadlines. All tasks can be guaranteed to meet deadlines if:
then the subset of highest priority tasks, S such that:
will be guaranteed to meet all deadlines and will form the critical set. As a note, the worst case bound for RM is quite pessimistic, and it has been shown that, for the average case:
Another problem with RM is that is does not support dynamically changing periods. For example: Take a set of three tasks, P1, P2, P3 of periods T1=30ms, T2=50ms, T3=100ms. Under RM, these tasks would have the following fixed priority assignment (from highest to lowest): P1, P2, P3. Now, suppose the period of P1 changes to T1=75ms. Under RM, this would require a reassignment of priorities to P2, P1, P3 which violates the condition that the priorities are static.
This brings us to dynamic priority algorithms. Though many such algorithms exist, we will restrict our discussion to only a few: Earliest-Deadline-First Algorithm (EDF), Minimum-Laxity-First Algorithm (MLF), and Maximum-Urgency-First Algorithm (MUF).
Earliest-Deadline First (EDF)
As the name implies, this algorithm uses the deadline of a task as its priority. The task with the earliest deadline has the highest priority. The task with the latest deadline has the lowest priority. The advantage to this is that one can achieve a 100% CPU schedule bound and that it supports dynamic priorities.
The disadvantages of EDF are:
- there is no way to guarantee which tasks will fail during transient overload.
- In many systems, although average CPU utilization is less than 100%, worst-case CPU utilization is above 100%. Generally, it is desirable to be able to control which tasks fail and which succeed during transient overload.
Minimum-Laxity First (MLF)
This algorithm assigns a laxity to each task The scheduler then selects task with the minimum laxity to execute next. Laxity is defined as:
laxity = deadline time - current time - CPU time needed
Laxity is a measure of flexibility. A laxity of ti means that even if the task is delayed for ti time units, it will still meet its deadline. A laxity of 0 means that the task must be executed now or will fail to meet its deadline. This is the basis for the MUF algorithm that follows. Main difference between MLF and EDF is that MLF takes into consideration the execution time of a task. Like EDF, MLF has a schedule bound of 100% and there is no way to guarantee which task(s) will fail in transient overload.
A combination of fixed and dynamic priority scheduling (a.k.a. mixed priority). Each task is assigned an urgency which is defined as a combination of two fixed priorities and one dynamic priority. One of the fixed priorities, the criticality, has precedence over the dynamic priority. Meanwhile, the dynamic priority has precedence over the other fixed priority, called the user priority. The dynamic priority is inversely proportional to the laxity of a task.
MUF consists of two parts. The assignment of the criticality and user priority (done apriori), and the actions of the MUF scheduler done at run-time.
Assigning criticality and user priority:
- As with RM, order the tasks from shortest to longest period.
- Define the critical task set as the first N tasks such that worst-case CPU utilization does not exceed 100% (these will be the tasks that will not fail even during transient overload). If a critical task does not fall within the critical task set, then a period transformation, as with RM, can be used.
- Assign high criticality to all the tasks in the critical set, and low criticality to all the other tasks.
- Optionally, assign a unique user priority to every task in the system.
Note that static priorities are assigned once and do not change during execution. The dynamic priority is assigned at run-time, inversely proportional to the laxity. Before its cycle, each task must specify its desired start time, deadline time, and worst-case execution time. Whenever a task becomes ready to run, a reschedule operation is performed.
The MUF scheduler is used to determine which task is to be performed next, using the following algorithm:
- Select task with highest criticalness.
- If two or more tasks share the same criticalness, choose the one with the highest dynamic priority (minimal laxity).
Only tasks with pending deadlines have a non-zero dynamic priority. Tasks with no deadlines have a dynamic priority of 0.
- If two or more tasks with the same criticalness also share the same dynamic priority, then the task with the highest user priority is chosen.
- If all parameters are equal, the tasks are served in a first-come, first-served basis.
The optional assignment of user priorities assure we never reach step 4, and thus provides a deterministic scheduling algorithm.
MUF provides for many advantages over EDF or RM.
- You can guarantee which tasks will meet deadlines even during transient overloads.
- Scheduling is dynamic.
- Improved performance.
- You can detect failures.
There are three types of errors MUF can detect.
- A task has not completed its cycle when the deadline time has been reached.
- A task was given as much CPU time as requested in the worst-case, yet it did not meet its deadline.
- The task will not meet its deadline because the minimum CPU time requested cannot be granted.
The first case is the standard notion of a missed deadline. The second case detects bad worst-case estimates. The third case allows the scheduler to make the most of CPU time (it will not start executing a task if it knows it will not complete it) and thus provides early detection of missed deadlines.
One possible problem with the MUF approach is the computational overhead associated with it. The solution is to encode the algorithm into a single urgency value as shown below:
For such an encoding scheme, the ranges are:
- 0 - 2c-1 for criticality
- 0 - 2d-1 for dynamic priorities
- 0 - 2u-1 for user priorities
This scheme will work as long as:
This efficient encoding scheme then allows the MUF scheduler to calculate only a single dynamic priority for each task, then select the task with the highest urgency.
RTOS Failure Handling
The use of deadline failure handlers is recommended for all tasks in a system for several reasons.
- Estimating worst-case execution times of tasks is difficult at best, and bad estimates can cause havoc with systems.
- Most commercially available hardware are geared towards increasing average performance through the use of caches and pipelines. Because such hardware is often used in real-time applications, execution time is not necessarily predictable.
So how do failure handlers work?
Whenever a task fails to meet a deadline, a failure handler is invoked. The handler should be able to be programmed to execute at the same or different priority as a failing task to prevent problems with priority inversion.
Possible actions by the handler include:
- aborting the task and preparing it to restart the next period,
- sending a message to some part of the system to handle the error,
- modifying the priority of the task and continuing the execution,
- performing emergency handling such as a graceful shut-down of the system,
- maintaining statistics on failures to help aid in analyzing the system,
- in the case of iterative algorithms, returning the approximate value regardless of precision.
All modern RTOSs should provide some type of failure handling.
Designing and developing real time software today is a sophisticated juggling act. Today’s systems employ more functionality than ever, elaborate GUIs, network connectivity, interprocessor communication and, at the same time, the time constraints must still be met. The consequences of failure range from severe to the unthinkable. This paper has explored some of the different types of RTOSs and the consequences of using them, with the hope that the reader will be in a better position to select the appropriate system.
 This definition is located in the real-time FAQ in the Internet’s comp.realtime newsgroup.