Identify Scalability Problems in Parallel Applications

Intel

0/5 (0 vote)

Apr 17, 2017

CPOL

15 min read

15141

How to Improve Scalability for Intel® Xeon and Intel® Xeon Phi™ Processors Using New Intel® VTune™ Amplifier Memory Analysis

Click here to register and download your free 30-day trial of Intel® Parallel Studio XE.

Vladimir Tsymbal, Software Technical Consulting Engineer, Intel Corporation

With the constantly increasing number of computing cores in modern systems, we expect well-parallelized software to increase performance―preferably linearly―with the number of cores. However, there are some factors limiting parallelism and scalability on multicore systems. We are not going to cover all of them in this article. But in most cases, the limitation is due to the implementation of parallelism:

Load imbalance that leads to idle threads and CPU cores.
Excessive synchronization and, as a result, wasted CPU time in spin-waiting and other nonproductive work.
Parallel runtime library overhead, which might be due to misuse of the library API.

When those limiting factors are eliminated, parallel efficiency improves significantly―with all CPU cores busy doing useful work. Near-linear speedup is observed on well-tuned benchmarks like STREAM or LINPACK. However, with the increasing number of cores on your system (or as you run your code on a newer system with more cores), you might notice that the performance of your application is not increasing linearly―or that parallelism begins to plateau (Figure 1).

Figure 1. Performance changes according to the number of cores

According to the top-down performance analysis approach,¹ you should first check if other components are limiting performance. Make sure that:

Your system is not constantly busy with something else that might consume resources, such as other applications or services consuming compute time.
Your application is not bound to system I/O (e.g., waiting for disk or other file system or network system operations to complete).
Your system has enough physical memory to avoid frequent swapping to the hard disk drive.

As a common recommendation, you are expected to be aware if your hardware is configured properly and the memory subsystem provides expected performance characteristics. For example, you have all memory slots filled with DIMMs that correspond to the motherboard characteristics (e.g., number of channels, memory speed). You can easily check performance of your hardware with known benchmarks. It’s important to do such a check, since it’s easier to fix the problem with hardware than with software optimizations.

Once all these checks are done, look at memory latency as one of the main reasons for poor parallel scaling. In the x86 systems architecture, the CPU retrieves data from its cache subsystem. Ideally, data resides in the cache closest to the CPU (the L1 data cache) by the time it is needed by instructions (Figure 2). The farther requested data is from the CPU, the longer it takes to travel to the CPU core execution units. A CPU hardware prefetcher should help to bring data in faster, but it’s not always possible. Often, data is delayed, which can stall the CPU.²

Figure 2. Retrieving data from the memory subsystem

Basically, there are two reasons why data is late:

When data is requested by an instruction being executed in an EXE unit of the CPU, data bits make the long trip from the main memory or other caches to the CPU’s L1D (i.e., the prefetcher didn’t work). This creates a memory latency problem.
Data is requested in advance (i.e., the prefetcher did its work), but the bits got stuck in traffic on the way to the CPU because of the transport infrastructure capacity. This creates a memory bandwidth problem.

Of course, there might be a combination of both problems if several requests are made from several sources. To avoid these problems, it’s important to make smart usage of data. To solve the memory latency problem, ensure that data is accessed incrementally by its address. Sequential data access (or even unit stride, with a constant small distance) makes the prefetcher’s life easier―and data access faster. To solve the memory bandwidth problem, reuse data and keep it hot in cache as much as possible. Either solution requires reconsidering data access patterns or even the whole algorithm implementation.

What’s Limiting the Scalability of Your Application?

Once we have identified that our code execution is inefficient on a CPU, and we have observed that most stalls are memory bound, we need to define the specific memory problem because the solution is different depending on whether the problem is due to memory latency or bandwidth. We will use Intel^® VTune™ Amplifier’s embedded memory access analysis for a detailed investigation of memory problems.

Let’s consider several iterations of improving a simplified matrix multiplication benchmark. In spite of its simplicity, it effectively demonstrates the possible memory problems that can occur depending on how the algorithm is implemented. For measurements, we will be using an Intel^® Xeon^® processor E5-2697 v4 (code-named Broadwell, 36 cores) system with known theoretical parameters of memory bandwidth = 76.8 GB/s and double-precision (DP) floating point operations per second (FLOPS) = 662 GFLOPS/s.

Naïve Implementation of the Matrix Multiplication Algorithm

The naïve matrix multiplication implementation (multiply1, in Figure 3) will never scale linearly to a large number of CPU cores. Nevertheless, for educational purposes, it’s a good example to illustrate how to identify causes of inefficient performance. The only improvement we would make is adding the –no-alias compiler option in order to allow vectorization. Otherwise, a scalar implementation would be roughly 10 times slower. The results of running the vectorized benchmark multiply1 on matrix size 9216 x 9216 can be found in Table 1. Note that the best performance is well below the theoretical maximum FLOPS.

Table 1. Performance and scaling of the naïve matrix multiplication (36 cores, Intel^® Xeon^® processor E5-2697 v4, two sockets @ 2300 MHz)

No. Threads	Elapsed Time, Seconds	DP FLOPS, GFLOPS/Second
4	208	7.7
8	102	15.1
16	59	26.8
36	42	37.8
72 HT	24	66.1

As Table 1 shows, the parallel benchmark is scaling almost linearly with increasing numbers of threads. Scaling begins to plateau when more than 30 cores are involved. The data in Table 1 might create a false confidence regarding the performance and scaling of the multiply1 benchmark. It’s extremely important to understand how much your benchmark is using the compute power of a machine. In our case, the reported FLOPS (determined in the benchmark) is far from the theoretical number calculated for the machine earlier (approximately 10x smaller). The parallel scalability is not limited but the serial performance is. Note that Intel VTune Amplifier indicates the code execution within the loop is inefficient (Figure 4). The low Retiring and high CPI rates help estimate how far we are from practical limits.

Figure 4. Performance of the naïve, parallel matrix multiplication benchmark

Next, we’ll look at an optimized implementation of the matrix multiplication algorithm (multiply2 in Figure 3). If the algorithm is simple enough, and if your compiler is smart enough, it will recognize the inefficient index strides and generate a version with interchanged loops automatically (or you can do that manually).

void multiply2(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM])
{
	int i,j,k;

// Loop interchange
	for(i=tidx; i<msize; i=i+numt) {
		for(k=0; k<msize; k++) {
#pragma ivdep
			for(j=0; j<msize; j++) {
				c[i][j] = c[i][j] + a[i][k] * b[k][j];
			}}}}

Figure 3. Optimized implementation of the matrix multiplication algorithm (multiply2)

No. Threads	Elapsed Time, Seconds	DP FLOPS, GFLOPS/Second
4	208.8	7.8
8	103.3	15.1
16	58.8	26.4
36	38.4	40.5
72 HT	24.7	63.0

Table 2. Performance and scaling of the optimized matrix multiplication (36 cores, Intel^® Xeon^® processor E5-2697 v4, two sockets @ 2300 MHz)

As you might have noticed from Table 2, the absolute numbers are slightly better, but still far from ideal.

Let’s try to understand what’s limiting performance and scalability. The General Exploration profiling results (Figure 5) implement yet another top-down analysis approach, this time for CPU microarchitecture.³ We might notice a couple of interesting things.

Figure 5. General Exploration profiling results

First, notice that the memory latency for bringing data from DRAM to CPU decreased. This is expected, since we implemented contiguous address access in the algorithm. But the memory bandwidth metric is very high. With that in mind, we should check the bandwidth numbers of the main data paths to make sure the DRAM controller and Intel^® QuickPath Interconnect (QPI) are not bottlenecks. Second, notice the L3 latency is high as well, even though the data access has a contiguous pattern. This requires additional considerations. High L3 latency meant that we frequently have L2 misses, which is strange because the hardware L2 prefetcher should work (and does work, since the DRAM latency does not decrease with contiguous access). Third, the remote DRAM latency is significant. This indicates that there are nonuniform memory access (NUMA) effects, and some portion of data is fetched from remote DRAM for each node. So, to make the whole picture of data transfers clearer, we need to measure data traffic on the DRAM memory controller and the QPI bus between sockets. For that purpose, we use VTune memory access profiling.

Figure 6 shows the profiling results for the example with 72 threads. Only one DRAM controller is loaded with data (package_1), and the average data rate is almost 50 GB/second, which is roughly two-thirds of the maximum bandwidth. On the memory controller of package_0, the traffic is negligible.

Figure 6. Collecting the Memory Access profile on multiply2 with 72 threads

In the same time period, we observe that half of the data traffic in the outgoing QPI lane formed package_1. This explains how the data gets from package_1 DRAM to the package_0 CPU cores (Figure 7). This cross-QPI traffic creates extra latency for data that is being fetched either from remote DRAM by the prefetcher or from remote LLC by a CPU core. Eliminating the NUMA effect might be easy for the benchmark, since the data is well structured and evenly distributed among threads. We just set thread affinities to the CPU cores and have each thread initialize the a, b, and c matrices. But we need to be careful in assuming that allocating data within each thread would eliminate all NUMA effects.

Figure 7. Cross-QPI traffic

Figure 8 shows an example that fails to improve performance under the previous assumption, and a way to detect the problem using Intel VTune Amplifier. In the benchmark source code, we introduce a function that represents threads pinned to enumerated CPUs. Figure 8 shows a part of the code.

CreateThreadPool( … )
{
pthread_t ht[NTHREADS];
pthread_attr_t attr;
cpu_set_t cpus;
pthread_attr_init(&attr);

for (tidx=0; tidx<NTHREADS; tidx++) {
              CPU_ZERO(&cpus);
              CPU_SET(tidx, &cpus);
              pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &cpus);
pthread_create( &ht[tidx], &attr, (void*)start_routine, (void*) &par[tidx]);

}
for (tidx=0; tidx<NTHREADS; tidx++)
pthread_join(ht[tidx], (void **)&status);
}

Figure 8. Threads pinned to enumerated CPUs

In a data initialization function, the arrays should be distributed between threads in the same way the arrays are multiplied in the multiplication function. Figure 9 shows the modification done in the functions to simplify NUMA awareness. In the initialization function, the data array is divided by chunks of size msize/numt, which is the size of the matrix divided by the number of threads. The same is done in the multiplication function shown in Figure 10. Surprisingly, the runtime for the benchmark is not much better than the NUMA-unaware version, so let’s analyze with an Intel VTune memory access profile (Figure 11).

InitMatrixArrays (int msize, int tidx, int numt,  … )
{
    int i,j,k,ibeg,ibound,istep;
    istep = msize / numt;
    ibeg = tidx * istep;
    ibound = ibeg + istep;

    for(i=ibeg; i<ibound; i++) {
      for (j=0; j<msize;j++) {
            a[i][j] = 1.0*i+2.0*j+3.0;
            b[i][j] = 2.0*i+1.0*j+3.0;
            c[i][j] = 0.0;
        }
    }
}

Figure 9. Simplifying NUMA awareness

multiply2(int msize, int tidx, int numt,  … )
{
    int i,j,k,ibeg,ibound,istep;
    istep = msize / numt;
    ibeg = tidx * istep;
    ibound = ibeg + istep;

        for(i=ibeg; i<ibound; i++) {
                for(k=0; k<msize; k++) {
                        for(j=0; j<msize; j++) {
                                c[i][j] = c[i][j] + a[i][k] * b[k][j];
}}}}

./matrix.icc
Threads #: 72 Pthreads
Matrix size: 9216
Using multiply kernel: multiply2
Freq = 2.30100 GHz
Execution time = 20.162 seconds
MFLOPS: 72826.877 mflops

Figure 10. Multiplication function

Figure 11. Memory access profiling

The summary page notifies us that the application is still memory bound (with stalls due to data latencies from memory and data traffic), but the latencies are mostly caused by LLC and less by DRAM. Also, the ratio between local and remote access is very high, which means that the NUMA awareness approach didn’t work. If we check the timeline for traffic over the DRAM controller and QPI (Figure 12), we see that the data stream from DRAM is hardly reaching 30 percent of peak bandwidth, but the QPI is saturated at approximately 90 percent of its capacity in each direction (the practical limit for QPI is 29.2 GB/s for this system).

Figure 12. Checking the timeline for traffic over the DRAM controller and QPI

Remote access (whether DRAM or LLC) is increasing latency for reading memory blocks and making the CPU stall. Those latencies can be measured by Intel VTune Amplifier’s memory access, which allows us to identify which data (matrix) is still being accessed in an inefficient, remote way. If we examine the memory analysis summary (Figure 13), we can observe which memory objects created most of the latency.

Figure 13. Top memory objects by latency

Among the three top memory objects (represented by their allocation function), we notice that one clearly represents the biggest portion of latencies and is responsible for a large number of load operations (Figure 14). Note that only one object has an average latency high enough to conclude that the data is from the remote DRAM of LLC. We can confirm this conclusion by the numbers in the Remote DRAM Access columns.

Figure 14. Memory objects by allocation function

It’s easy to figure out that those three objects are the three a, b, and c matrices. The one with high Stores is matrix c. To identify which matrix data is creating high latencies, you need to check the stack for the memory object in the Intel VTune Amplifier stack pane (Figure 15). Going by the stack in the user code, we can drill down to the source line of data allocation presented in the Intel VTune Amplifier Source View (Figure 16). In this case, it’s a matrix b data that creates latency chatter and an increased number of loads. Now we need to understand why it’s happening despite the fact that the data arrays were allocated and initialized within pinned threads.

Figure 15. Memory objects by stack pane

Figure 16. Intel^® VTune™ Amplifier source view

A quick investigation of the algorithm with transposed matrices reveals a fundamental inefficiency in the data access pattern (Figure 17). For each matrix a row, the whole matrix b has to be read entirely from memory.

Figure 17. Algorithm with transposed matrix

The matrices include about 9K elements in a column/row. So, the whole matrix memory block size will exceed any CPU cache capacity, generating constant cache data eviction and reloads from DRAM. Even though the distributed rows of matrices c and a are accessed by threads on the CPU cores on which they were allocated, it doesn’t completely apply to matrix b. Half of the matrix b data will be read by threads from a remote socket in this implementation of the algorithm. Even worse, reading the whole matrix b for each row of matrix a creates a redundant data load operation (N times more than needed) and generates excessive traffic on QPI for accessing remote data.

Similarly, you can define which data objects were contributing to increased traffic for DRAM or MCDRAM on Intel Xeon Phi processor-based systems. You just need to select which memory domain traffic you want to analyze. You can get the objects’ reference and allocating stack information (Figure 18) and, when grouped by bandwidth Domain and bandwidth Utilization Type, you can observe the objects and identify those that contribute most to the L2 Miss Count (Figure 19).

Figure 18. Analyzing memory domain traffic

Figure 19. Bandwidth domain

Data Blocking

We can decrease the data latencies for eliminating CPU stalls by yet another modification of the multiplication algorithm. We want all data in the three matrixes being accessed by threads running on a local socket. One of the well-known and frequently used modifications is data blocking (Figure 20). It allows working with smaller blocks of arrays from each of the matrices, keeping them hot in caches and reused by CPU (which, in turn, gives opportunities for further performance improvements through optimizing the blocks for CPU cache sizes). Also, this makes it easier to distribute the blocks among threads and prevent massive remote accesses and reloads.

Figure 20. Matrix data blocking

If we look at the results of the cache blocking modification (Figure 21), we can observe that even without fixing NUMA effects, memory latencies are much smaller and execution is much faster.

./matrix.icc
Threads #: 72 Pthreads
Matrix size: 9216
Using multiply kernel: multiply3
Freq = 2.3 GHz
Execution time = 12.08 seconds
MFLOPS: 128710.367 mflops

Figure 21. Cache blocking modification (multiply3)

According to the General Exploration profile (Figure 22), the Retiring pipeline slots increased up to 20 percent, while the rest of CPU stalls are shared between Memory Bound and Core Bound execution.

Figure 22. General Exploration profile (multiply3)

According to the Latency Histogram (Figure 23), most of the latencies are concentrated around L2 access values, while the rest are in the zone of 50 to 100 cycles, which is in the area of LLC Hit latency numbers. The Bandwidth timeline diagram (Figure 24) shows that most of data is taken from a local DRAM, and traffic on QPI is slightly increased. This is still a smaller performance than the Intel^® Math Kernel Library (Intel^® MKL) implementation of the double-precision matrix multiplication (dgemm), but closer to it for this size of matrix (Figure 25). So, the final optimization that we could do is modifying the algorithm to be blocked and fully NUMA aware. The final performance is shown in Table 3 and Figure 26.

Figure 23. Latency Histogram (multiply3)

Figure 24. Bandwidth timeline diagram (multiply3)

$./matrix.mkl
Threads #: 72 requested OpenMP threads
Matrix size: 9216
Using multiply kernel: multiply5
Freq = 2.799980 GHz
Execution time = 2.897 seconds
MFLOPS: 540032.936 mflops

Figure 25. Performance measurement for Intel^® MKL-based multiply5

Table 3. Performance of the final matrix3 optimization (36 cores, Intel® Xeon® processor E5-2697 v4, two sockets @ 2300 MHz)

No. Threads	Elapsed Time, Seconds	DP FLOPS, GFLOPS/s
4	104.8	14
8	60.1	25
16	31.3	49
36	17.85	87
72 HT	12.08	128

Figure 26. Matrix multiplication benchmark results

Note on the scalability graph:

The matrix3 line goes beyond the ideal line due to cache blocking effects, which make the single-threaded version execute faster than a naïve implementation.
Until the number of threads is equal to the number of physical cores, the matrix3 line goes closer to the ideal line, while adding hyper-threading does not improve scaling.

Conclusion

Some memory access patterns might appear to be causing poor scalability in parallel applications due to CPU microarchitecture constraints. To avoid the constraints, you need to identify exactly which data arrays cause the CPU to stall while waiting for data. With Intel VTune Amplifier memory access profiling, you can identify the data objects that cause the biggest delays, as well as the amount of this delay measured in CPU clock ticks, the level of the cache subsystem in which the data reside, and the source code for data object allocation and delayed access. This information should help you reconsider your algorithm to deliver better memory access patterns.

References

Charlie Hewett. Top Down Methodology for Software Performance Analysis
Intel^® 64 and IA-32 Architectures Optimization Reference Manual.
Ahmad Yasin. "A Top-Down Method for Performance Analysis and Counters Architecture." IEEE Xplore: 26 June 2014. Electronic ISBN: 978-1-4799-3606-9.

Try Intel^® VTune™ Amplifier today