In the first article about the Unified Concurrency and implemented synchronization primitives I skipped all the complex parts about benchmarking methodologies and yet I dared to present some basic performance assessments about each synchronization primitive without any proof. It is now the time and place to deep dive into dark magic of benchmarking and methodologies used to understand properties of general synchronization primitive. I am going to present benchmarking methodology I have used to understand properties and behavior of implemented or wrapped up synchronization primitives under Unified Concurrency framework and the methodology will be applied and studied on C# lock (Monitor class).
We will have to consider different scenarios, problems of comparing multiple data sequences against each other, calibration of data sequences to ensure relevant comparison and even selecting the type of data. The cherry on the cake is then putting lots of data into some reasonable self-explanatory chart.
The Unified Concurrency framework is implemented in the open-source GreenSuperGreen library available on the GitHub and the Nuget (3 packages, lib, unit tests, benchmark).
We need to measure three important parameters:
- CPU load from processor time - performance counter.
- The number of times we have gained the lock access.
- The time span between synchronization primitive enter and synchronization primitive exit.
The CPU load requires one time per the second measurement from performance counter. The access counter is easy. The time span is the problematic one. Without special hardware, we have no direct means to gain precise time measurement of short intervals below milliseconds, a time source that has to work reliably for single threaded and multi-threaded scenarios without affecting the measurement and with microsecond uncertainty. Calling the system to get time is not free of charge. Obviously, systems we are working with are not made for precise real-time measurements.
We have to abandon hope, that we can measure time span for each access directly. To gain access to sub-millisecond time span precision we have to adapt and measure a large number of accesses for about ten seconds and then divide the time. Can we gain precise time span measurement? Yes. With large enough number of accesses, we can measure these time spans precisely and we can neglect some irregularities caused by the operating systems threading if we use the median instead simple average value. Of course, there are some limitations, I will address these with SpinLock.
The time span computed from these measurements is describing median Throughput Period for the given computational complexity of the locked code protected by given synchronization primitive. The median Throughput Period is effectively describing median time span necessary to gain lock access, execute code in the locked area and exit the lock. Measured for n-thread and the given computational complexity of the locked code we can gain information how much work and how often we can push the work through given synchronization primitive under specific configuration.
The Unified Concurrency framework is currently capable to unify locking types of synchronization primitives. The interesting part is, that the ultimate goal of these types of synchronization primitives is to mimic sequential execution, single-threaded execution. We can use sequential single-threaded execution measurement to describe ideal synchronization primitive and create the basis for comparison.
Let's assume, that the ideal linear speedup for sequential single-threaded cases is one. For n-cores CPU running n-threads contending ideal synchronization primitive, we can assume, that the speedup would be 1/n. The real synchronization primitives will be way worse than 1/n, they have to lose some time on communication and coordination of threads and data over CPU/cache/memory busses.
These relations are important for understanding that throwing too much parallelism on a sequential problem is not going to improve performance.
The unit of work relates to given hardware. It is sufficient if the unit of work is with reasonable uncertainty similar in execution time if executed many times in cycle on a single thread on uncontended CPU. It is pointless on modern CPU architectures to count instructions, operations and rather understand relative differences between synchronization primitives. There are some restrictions for the unit of work, it will be integer operation over one integer value on the stack.
To fully understand the behavior of each synchronization primitive, we will create spectral data based on the number of spins of the unit of work. The given number of spins of the unit of work is creating defined computational complexity and relates to the median throughput period specific to given configuration: synchronization primitive, number of threads and scenario. We will create a list of spins. For each number of spins, we will measure multiple CPU loads and Throughput periods and we will compute median CPU load for given number of spins and median Throughput period for the number of spins.
To understand relation, we will compare median Throughput periods and median CPU load for the same number of spins. On the same hardware, we can compare measured data for configurations with the same number of spins of the unit of work. This will give us the ability to compare median Throughput periods and median CPU loads for same spins between different synchronization primitives and/or configurations.
Axis-x is the number of spins of the unit of work and all other data sequences are driven by this number.
Axis-y-left is median from 10 seconds long measurement of CPU usage, solid line axis, always represented by a solid data line and measured points are rounded fully filled data points of the same color as the data line.
Axis-y-right is median from 10 seconds long measurement of Throughput period, dotted line axis, always represented by a dotted data line and measured round data points with a different color in the center.
Note: Axis-x and Axis-y-right are scaled by log2, it helps to avoid wide areas where the data series are not changing in large enough steps.
It is important to understand, that any vertical line constructed on the chart crossing a median CPU data point, a median Throughput Period data point and a number of spins of the unit of work data point is actually representing 10 seconds long measurement where the number of spins is constant, equal to the number of spins crossing the vertical line.
Each chart is constructed from a large amount of data gathered over minutes long measurements for each synchronization primitive.
Based on the described methodology, we can construct the first benchmark chart for the ideal synchronization primitive.
Chart 1: Ideal Synchronization Primitive with single thread ideal throughput period and adjusted ideal throughput period for 24 threads.
In the Chart 1, we can see one constant data line, median CPU usage that is constant about 4 % CPU. It is not clearly visible, but CPU Data are always solid data lines, with rounded measurements points filled with the same color as the data line.
The Median Throughput periods for the ideal synchronization primitive create a diagonal line, thanks to the log2 scaling of axis-x and axis-y-right with some differences for non-ideal throughput periods of all non-ideal synchronization primitives. In the Chart 1, we can see two diagonal lines. One is clean measurement and another is calibrated for the case of ideal synchronization primitive contended by 24 threads. This kind of calibration will be usual for all synchronization primitives benchmarks.
The number of measurements was designed for larger charts in the Excel, almost twice as large. Forcing same charts into 640 pixels is challenging, but still, all parts are recognizable and data series that are not exactly constant won't cause any issues.
Any multi-threaded software depending on synchronization is hiding some part of its behavior which is visible only in edge cases. More complex the software gets, less knowledge we have about our own software because it is getting increasingly difficult to push the software into all possible edge cases and sometimes even to make the complete list of all the edge cases can be an issue. Performance testing outside the production phase is not always successful to detect issues exactly because it is difficult to prepare all edge cases (peak loads) that naturally occur in the production phase randomly.
I strongly believe that the solution to these issues lies in the detailed knowledge of synchronization primitives because they are driving a software behavior in edge cases. If we think about the software execution pathways in the highly concurrent multi-threaded environment in terms of contention in the time domain, we can recognize two types of contention:
- Direct contention – shared synchronization primitive, multiple threads knocking on the same door.
- Indirect contention – multiple unrelated groups of threads, each contending at the minimal level, each group has own synchronization primitive, multiple doors.
Now the interesting part comes from the question, what is the source of contention in case of Indirect Contention? Those groups do not share synchronization primitive with each other, they can actually be divided into different processes. The source of contention here is shared CPU, memory bandwidth, and bus especially in cases where atomic instructions are used to acquire the lock (hybrid locking approaches). Busy spin-waiting is also burning CPU cycles, what is significantly reduced on modern CPU’s with Simultaneous Multi-Threading, but memory bandwidth reduction is not possible to avoid, interlocked operations of repeatable acquiring entry come with a high rate of cache synchronizations affecting other services on the shared server box.
Figure 1: Direct contention
Figure 2: Indirect contention
Pictures above describe simple benchmark far from edge cases, average contention, and resource wastage by synchronization is usually small. It is the easiest testing scenario which usually does not require complex changes to tested code, but it cannot be used to extrapolate or reason about behavior during peak loads in edge cases because cost of synchronization during peak loads is capable to dissipate 99% of CPU resources with very little work being done and in simple benchmark the probability of synchronization contention is greatly decreased, meaning we often cannot observe edge cases, heavy contention in our code without code changes. The simple benchmark is capable to describe only an incomplete picture of the synchronization primitive impact. In such case, we have to be interested about heavy direct contention where the cost of synchronization primitive will be visible and measurable and also adjustable in the time domain to study the spectral behavior of two relations: A) number of spins x CPU usage, B) number of spins x throughput period.
Figure 3: Heavy direct contention
It will be similar for Heavy indirect contention scenario, which is capable to study hardware side effects between multiple unrelated synchronization primitives which is not sharing anything but hardware.
Figure 4: Heavy indirect contention
There will be two types of test scenarios:
- Heavy Contention – representing the hardest case of Heavy Direct Contention.
- Bad Neighbor – representing the hardest case of Heavy Indirect Contention, where the cumulative effect of minimally contending threads pairs and ability to play well with others will be made as visible as possible.
These two scenarios in edge cases, under peak loads, can be viewed as compression in the time domain, which will increase the level of contention, as it is more probable to catch other thread with exclusive access.
Figure 5: Tested Hardware and Software
The throughput of Monitor, C# lock, behaves very well, monotonically in a similar way as sequential throughput without surprises. But, the amount of spent CPU resources is troubling. The shorter the throughput gets, the more CPU resources are wasted in heavy contention scenarios! That is actually significant as it is contrary to common sense!
The explanation is very simple. As it was stated, the Monitor, C# lock, is a hybrid lock. During an attempt to gain access the thread is not suspended immediately when other thread already owns the access, but it is rather spin-waiting in attempt to gain access sooner in hope to limit thread suspend/resume costs and if that did not work, the thread is suspended. This hybrid approach was introduced during times of first multi-core boxes and it made throughput faster then, but now with 24 cores or more, it comes with heavy costs and with each extra core it gets worse.
Scenarios where we have just short work in Monitor, C# lock, but with many requests lose lots of CPU resources! Obviously, we can not neglect legacy code-base and aging software architectures lagging behind hardware architectures anymore.
Based on the chart, throughput below 64 ms on the given box should be avoided, because over 60% of the CPU resources will be dissipated into the heat! The number can get lower with a lower number of threads. Avoiding high levels of concurrency is always good checking point, whether we do not throw too much parallelism on the sequential problem. But the throughput level at which this gets 60 % or more CPU resources wasted will go higher with more CPU cores, and that is the future where it goes currently.
Chart 2: Heavy Contention, Monitor (C# lock)
Please remember, that the Bad Neighbor scenario is about the cumulative effects of multiple unrelated thread pairs each with minimal contention. As was mentioned in report above in Heavy Contention scenario for Monitor, C# lock, with low contention the throughput period accessible through Monitor, C# lock, can be significantly lower without great CPU wastage, but even here we can see, that Monitor, C# lock, is playing role in CPU wastage, twice as expected for throughputs below 16 ms, thus Monitor, C# lock, is really a Bad Neighbor, few unrelated services/threads with low contention can cumulatively dissipate over 50 % CPU resources into heat with only 48 % of usefull work being done. This statement is based on ideal sequential trend lines in the chart.
Chart 3: Bad Neighbor, Monitor (C# lock)
The main purpose of this article was to introduce benchmarking methodologies under Unified Concurrency framework / Green Super Green library. We studied important ideal synchronization primitive used as the basis for comparison with the main synchronization primitive, Monitor, C# lock.
Few issues of the Monitor, C# lock, have been noted.
In next article, we will finally discuss all the benchmarks at once against each other and we will have the chance to compare relatively with each other to find weaknesses and strengths.