Managing a Concurrent System Design

Jek Platform

2.38/5 (7 votes)

Nov 12, 2006

CPOL

20 min read

56395

This article discusses concurrent software design issues that can be addressed by the proposed design platform.

Download all samples and SDK

Introduction

The success of a concurrent system depends on well designed hardware, flexible software that controls the hardware, and clear marketing vision. To adapt the changing marketing requirement, hardware and software need to have flexible architectures. This article is focused on software development issues, and discusses concurrency design at application layer as opposed to concurrency inside the Operating System.

Any multi-threaded system can be considered a concurrent system; for example, a lengthy task can be implemented as a background thread so that it will not block the graphical user interface. Here, we are discussing concurrent systems which can be characterized by the following traits:

System input and output can be clearly identified.
System internal consists of system resources, such as hardware modules, which are used to process system input and generates system output.
One or more execution steps are needed for a system input to be processed by system resources and to become a system output.
The system resources and their relationship are identified by system analysis. The concurrency properties of system resources determine the constraints between execution steps, which ultimately define the system concurrency behavior.

The above description can be illustrated in figure 1. The system consists of three resources: 2 inputs and 2 outputs. Each input needs to go through two steps to become output. Resources 1 and 2 are independent and can be parallel. The output of resources 1 and 2 are the input of resource 3, which is independent of 1 and 2.

Figure 1. A Sample Concurrent System

Design Goals

Most concurrent systems have these design goals:

Have an easy to understand software architecture so that the desired concurrency can be implemented and verified quickly.
Have a solid system concurrency kernel to adapt system environmental changes such as inconsistent hardware responses, and still achieve high system reliability.
Have a good scalable architecture to adapt new requirement changes.
System concurrency and throughput are well understood by all teams involved in the system specification and design, not just by a few key software engineers. Therefore, the concurrent software should expose how the current system internally works with minimum cost, so that the team communications can be conducted effectively.
Different system concurrencies can be achieved with different execution configurations without major interruption to the system reliability.

For a complex concurrent system, the design cost to achieve such goals could be very high for inexperienced engineers. Most systems end up with only a few engineers who can understand and maintain the fragile concurrent kernels.

What is needed to meet the concurrent system design goal from a management perspective?

To have the capability to quickly understand the marketing or hardware concurrency requirement, and to provide a clear road map on how to achieve the desired software system concurrency at an early stage of development, not when delivering the alpha or beta product. This requires the software engineer to have a clear understanding of the system resource concurrency at the very beginning.
To shorten the cycle of turning the desired concurrency into a real functioning software system.
To communicate the achievable concurrency goal to other teams frequently, and to adjust the concurrency accordingly based on new marketing input, or new resource constraints improvements or limitations, such as hardware.

Common Issues

The followings are common issues found in a concurrent software design:

The software team member is not very experienced in concurrency design. Most teams have engineers knowing threads, critical sections, semaphores, and events. But, this usually does not guarantee achieving the design goals listed above.
The understanding of system concurrency is very slow. The software engineer could not present a full picture of how the system concurrency design is going to be working until the alpha or beta stage. Therefore, nobody will question how the system concurrency is designed since there is no good method to communicate the software design. The engineer usually gives you his/her understanding of the system concurrency on small pieces, which is hard to convince the software team manager or other teams that the software team has fully understood the system and will be able to deliver on schedule.
The marketing group has a wrong system throughput assumption and commitment at the beginning of a project, with false understanding of the system resource constraints, or the complexity for available engineers to achieve the desired high throughput without sacrificing software system reliability. The marketing group might assume that software engineers could just achieve it, but have no way to verify it during the process until it is too late.
Almost all designs do not have a clear distinction between the code controlling the system resource operation and the code performing the system resource concurrency. This architecture makes it very hard to enhance upon new concurrency requirements. By simply using a synchronization object, such as a critical section, event, or semaphore from the Operating System, it's almost impossible to perform such a partition without a major investment on the system architecture design. Unfortunately, most applications do not separate the two domains and let one engineer handle all of them, who is already overwhelmed by the concurrency choreography. The software manager usually does not understand the importance of such a design, or they don't have the time to spend on infrastructure building, and just want to see something is beginning to work. The result is that more time will be wasted during debugging and the feature enhancement period.
The fragile concurrency architecture is hard to understand. It is almost impossible for new engineers to take over the design, except to abandon the old one, and then propose a "better architecture" which usually goes through the same design cycle and delays the schedule. The software manager usually is not aware of the engineer's redesign approach except to accept it, since both the engineer and manager have no choice to improve the old architecture.
The manager and software engineer mistakenly think that object oriented analysis of the concurrent hardware modules will guarantee a good concurrent software design which delivers a flexible concurrent software architecture. Most OOA just help engineers to identify objects in a system without concurrency analysis, and engineers have to use a synchronization object in the Operating System to address concurrency. If this approach is used, it will not help achieve the concurrent system design goals listed above. Unfortunately, most systems are designed with such approach.
Engineers begin to experience an unexplained hang, and begin to put a sleep function somewhere to solve weird timing problems, simply because the understanding of system resource concurrency is not complete at the beginning, and the design can not adapt to a different running environment. When switching to different platforms, such as a faster machine, software needs major retest, or a possible overhaul. And the engineer and the manager begin to hide facts from the upper manager. The development cost goes up, and the software always needs major "improvements" to adapt to a new hardware with a newly tuned concurrency, which should not happen if it is well designed at the beginning.
Typically, the design of a system concurrency is architected by a senior person in a team, and it is very hard for other people to challenge the delicate design. System maintenance and enhancement for the concurrency part is a major issue with such a design approach.

How to Address Those Issues

The cost of making a complex concurrent system flexible and reliable is extremely high for average engineers who simply use the Operating System's critical section, semaphore, event, and thread. To address the above problems, we need to develop a platform to help engineers in modeling a concurrent system with an easily understood object model, communicating the design by a user friendly graphical user interface, and verifying the internal concurrency of the design quickly by simulation.

A simple concurrent object model is needed. An object oriented analysis method based on the model should be easy to perform.
An inter-task communication mechanism is provided based on the object model to allow task synchronization.
A design development toolkit is needed to support the object oriented analysis, and helps software engineers to spend more time on understanding the system concurrency, and system resource controlling during implementation, instead of struggling with multithreaded code that is implemented with Operating System synchronization objects such as critical sections, semaphores, and events.
The design platform provides a graphic presentation of the system concurrent execution status that helps software engineers to present and to validate a design effectively. Eventually, it will help the whole team, even different groups, to understand the system concurrency internals.
The object model and its development toolkit allow separation of the code performing system resource concurrency and the code performing system resource control. In figure 1, the code controlling Resource 1 is a resource control domain. The code controlling independent Resources 1 and 2 to operate in parallel is a system resource concurrency domain. This architecture helps the manager to partition the concurrent system design work into two domains so that it can be assigned to different engineers to improve team productivity and product reliability.

Usually, a manager will not offer a resource to implement above the environment to help the design long term, since it is very time consuming and no immediate results can be seen.

A Solution

JEK Platform is designed to address these issues with the above concurrent system design goals, and makes concurrent problems easier to model for software engineers. JEK SDK automatically turns a modeled application job into a concurrent execution engine. The object model also separates the resource synchronization code and the resource control code, so that the engineer can spend more time understanding system concurrency, instead of dealing with Operating System synchronization objects, which is used by most software engineers. It will also help the engineer to spend more time communicating their understanding of the system concurrency within a team, or with other teams.

Two Samples

Here, two samples are presented (please go to www.jekplatform.com/CodeProjectSamples.htm to get the source code) to demonstrate how the JEK Platform works:

Sample 1: Philosophers dining problem.
Sample 2: Automated coffee machine.

Sample 1. Philosophers Dining Problem

The philosophers dining problem is five philosophers sitting around a table doing what they do best: thinking and eating. In the middle of the table is a plate of food, and in between each philosopher is a fork. The philosophers spend most of their time thinking, but when they get hungry, they reach for the two forks next to them and start eating. A philosopher cannot begin eating until he has both forks. When he is done eating, he puts the sticks down and continues thinking.

To solve the problem with JEK Platform, five routes are defined to represent the actions of five philosophers. Each route has two tasks: eat and think. Obviously, eat task of each philosopher's route can not be active at the same time because of resource constraints. A Mutex synchronization resource is used to restrict the eat task of each philosopher. The resource allocation scheduling algorithm in the JEK Kernel has an important feature to avoid a deadlock in this sample: if one philosopher gets a fork and finds another is already taken, it will release the one and notify other tasks in other routes so that other philosopher can continue to eat.

The following diagram, figure 2, illustrates the philosopher dining job execution engine's timing diagram. The application code is pretty simply since it does not need to handle threads and thread synchronization that is handled in the JEK SDK, but simply describes the resources (forks) and tasks (philosophers' actions).

Figure 2. JEK Studio monitors the philosophers dining job execution

In figure 2, JEK Studio GUI has four components illustrated by four yellow bubbles:

The task matrix presents the application engine internal structure and the real-time execution activity status.
The task timing diagram presents a more detailed real-time execution status for tasks, which helps developers to understand and to validate concurrent system behavior quickly.
The activity resource matrix presents real-time task activity resource status.
The synchronization resource matrix presents real-time task occupy status.
The task trace window displays log status, which is also saved in a log file.

In JEK Studio, five routes are shown in the task matrix. Its execution is shown in the task execution timing diagram. The four blue bubbles are explained as follows:

Blue bubble 1. Job execution engine starts to execute job. Philosophers 1 and 4 start to eat.
Blue bubble 2. Philosophers 3 and 5 start to eat at the same time. Philosophers 1 and 4 start to rest at the same time. The reason that philosopher 3 and 5 can start to eat at the same time is because the application code is configured so that the eat time for all philosophers are the same. The rest times are also the same for all philosophers.
Blue bubble 3. It's interesting to observe the job execution status after a few loops. Philosophers 1, 2, 4 are resting. Philosophers 3 and 5 are eating. If observed carefully, philosophers 3 and 5 are not starting to eat at the exact same time. Philosophers 1, 2, 4 do not rest at the same time.
Blue bubble 4. This is another interesting job execution status. Only one philosopher, #2, is eating at this moment. Philosophers 1, 3, 4, 5 are all resting. The reason, the resting times of all philosophers are longer than eating times.
Another observation is that fairness is not guaranteed for each route. It's unpredictable which philosopher will get a chance to eat next time based on the scheduler used.
Route starvation is possible. In other words, some philosophers might never get a chance to eat. This is not demonstrated in the graph since the result is random. You can try to start the engine a few times and the results could be different each time.

If not using simulation, it is very hard for a software engineer to answer: if scenarios marked by the blue bubbles 3 and 4 are possible.

Sample 2. Automated Coffee Machine

An automated coffee machine mixes milk, sugar, and coffee into a cup, and serves the cup to customer when it is done.

Figure 3. Coffee machine model analysis

The coffee machine has five robots:

Platform robot. It holds the coffee cup, so that milk, sugar, and coffee can be poured into it and gets mixed. After the coffee is mixed, it moves the cup with the mixed coffee to a customer.
Cup robot. It puts an empty cup onto the platform robot.
Milk robot. It pours milk into the coffee cup on the platform robot.
Sugar robot. It pours sugar into the coffee cup on the platform robot.
Coffee robot. It pours coffee into the coffee cup on the platform robot.

Coffee machine operating procedure:

All robots are in initial positions.
Cup robot puts an empty cup onto the platform cup.
Milk robot pours milk into cup.
Sugar robot pours sugar into cup.
Step 3 and step 4 can be parallel.
Coffee robot pours coffee into cup.
Platform robot moves mixed coffee to customer.

Important operating requirements of the coffee machine are as follows:

Above operating procedures have to be followed. Otherwise, the robot's positions might be in wrong places, and results in robot damages.
Milk and sugar need to be poured into the cup before coffee, so that the coffee can be mixed properly without requiring adding a coffee stir robot that increases the complexity of the machine.

Solution 1

To solve the problem with JEK Platform, two routes are defined. One route is designed to control the cup robot and the platform robots. Another route is designed to control milk, sugar, and coffee robots. The reason to define these routes is that the task steps inside each route are sequential. The synchronization resource between routes is the platform robot. For detailed analysis and code, please go to http://www.jekplatform.com/CodeProjectSamples.htm to download the complete JEK Platform and to look for sample section 7: Machine Control.

Figure 4 is the coffee machine execution timing diagram implemented with the JEK SDK and presented with the JEK Studio. X axis is time. Y axis is tasks. A bar is the execution time for a task. Tasks within one route are displayed with one color. Different tasks in one route are displayed with different Y axis values. Multiple bars in the same Y axis represent the same task executed at different times.

Route 1 controlling milk, sugar, and coffee robot (orange color) has three tasks from bottom to top:

Task1_1: control milk robot to pour milk.
Task1_2: control sugar robot to pour sugar.
Task1_3: control coffee robot to pour coffee.

Route 2 controlling cup robot (blue) has two tasks from bottom to top:

Task2_1: control cup robot to put cup onto platform robot.
Task2_2: control platform robot to serve mixed coffee to customer.

The six blue bubbles in figure 4 are explained as follows:

Blue bubble 1. Task2_1 puts a cup onto the platform robot.
Blue bubble 2. Task1_1 and Task1_2 pour milk and sugar into an empty cup at the same time.
Blue bubble 3. Task1_2 finishes pouring sugar and Task1_1 is still pouring milk.
Blue bubble 4. Task1_3 starts pouring coffee.
Blue bubble 5. Task2_2 controls the platform robot to serve mixed coffee.
Blue bubble 6. Repeat the same process.

Figure 4. Coffee machine execution status of solution 1

Solution 2

This robot is not very efficient. Route 1 is idle after blue bubble 5. To increase the throughput, another independent platform robot is added to make route 1 as busy as possible. The position of platform robot 1 is different from that of platform robot 2. Therefore, the control code for pouring milk, sugar, and coffee is different in context, but have the same structure. Figure 5 is a new robot diagram.

Figure 5. Platform 2 robot is added

Route 3 (burgundy red color) is added to serve second cup, which is presented as red in figure 6. It has identical tasks as defined in route 2. Since a new platform robot is added, route 2 and 3 are redefined as follows:

Route 2 controlling cup robot (blue) has two tasks:

Task2_1: control cup robot to put cup onto platform robot 1.
Task2_2: control platform robot 1 to serve mixed coffee to customer 1.

Route 3 controlling cup robot (burgundy red) has two tasks:

Task3_1: control cup robot to put cup onto platform robot 2.
Task3_2: control platform robot 2 to serve mixed coffee to customer 2.

Figure 6. Solution 2 coffee machine has two platform robots

The five blue bubbles in figure 6 are explained as follows.

Blue bubble 1. Task3_1 puts the cup onto platform robot 2. Both routes 2 and 3 are started at the same time, but only one route can use the cup robot. It is random that route 2 gets the cup robot.
Blue bubble 2. Task1_1 and Task1_2 pour milk and sugar into an empty cup at the same time after a cup is put on platform robot 2. Note: the execution context of route 2 is platform robot 1 since it has a different location than platform robot 2. In other words, the control code is different when the context is different.
Blue bubble 3. Task2_1 starts to put a cup on platform robot 1. Coffee robot uses task1_3 to pour coffee into the cup on platform robot 1. Both tasks are started at the same time.
Blue bubble 4. Platform robot 2 uses task3_2 to serve mixed coffee to customer 2. Task1_1 and Task1_2 pour milk and sugar into an empty cup on platform robot 1. The route 1 execution context (pouring milk, sugar, and coffee into which platform robot) is not visible from the timing diagram. It is only visible from the trace window or log file.
Blue bubble 5. Task1_3 pours coffee into platform robot 1.

Comparing figures 4 and 6, route 1 is almost busy all the time. Therefore, the throughput of the coffee machine with two platform robots is increased.

Solution 3

But, if looking carefully, the machine with two platform robots can be even faster. The reason is that the coffee robot can serve another platform robot while milk and sugar robots are serving one platform robot, if the time of putting one cup on the platform is shorter than the time of pouring coffee by the coffee robot.

To increase the speed of the coffee machine, an engine design with different route configuration is used. Route 1 represents the actions of serving sugar, milk, and coffee to platform robot 1. Route 2 represents the actions of serving sugar, milk, and coffee to platform robot 2. The design of route 3 and 4 are the same as before. Figure 7 is the timing diagram of a new coffee machine.

Figure 7. New concurrency of higher system throughput for the coffee machine with 2 platform robots

The five blue bubbles in figure 7 are explained as follows:

Blue bubble 1. Task4_1 puts cup onto platform robot 1.
Blue bubble 2. Task1_1 and Task1_2 pour milk and sugar into empty cup at the same time to the cup platform robot 1. The cut robot starts to put a cup on platform 2 since it is free at this time.
Blue bubble 3. Task2_2 starts to add sugar into cup 2 on platform robot 2. The reason that this action can happen is that the sugar robot just finishes adding sugar for cup 1 on platform robot 1. It is also clear that the milk robot is still busy pouring milk into cup 1. Therefore, cup 2 only has sugar for now.
Blue bubble 4. Task2_1 starts to pour milk into cup 2 since it just finishes pouring milk for cup 1 on platform robot 1. Coffee robot begins to add coffee into cup 1 on platform robot 1.
Blue bubble 5. Cup robot begins to serve cup 1 to customer since coffee is done for cup 1.

It is obvious that the choreography of this new coffee machine is different from that of the previous coffee machine. It appears to a user that its robots are smarter and works more intelligently since it starts to do the next job more promptly.

Conclusion

The above three samples demonstrate the following:

It is easy to model and to analyze the concurrency of machine control with the JEK Platform.
To adapt to a new hardware configuration, the JEK SDK helps achieve new system concurrency with minimum code changes.
JEK Studio can visually identify system throughput potential quickly so that better throughput can be achieved.
The samples demonstrated in these solutions have a pretty simple architecture (see downloaded code).

If not using the JEK Platform, can we solve the problem quickly? A few questions are raised here.

If several teams, such as marketing, hardware, and software, are working on the product, they might stop when solution 2 is working. Do they know that solution 3 is the best solution? Can they figure it out quickly?
Team members might be satisfied when they see that robots are working in parallel. If it is found that improvements can be made, how much code change is needed from solution 2 to solution 3 if implemented with synchronization objects in the Operating System?
How much competitiveness a company would gain to have a more efficient and reliable machine?

How to Get the Two Samples

Please go to http://www.jekplatform.com/CodeProjectSamples.htm to download the complete JEK Platform which includes the two samples.