Click here to register and download your free 30-day trial of Intel® Parallel Studio XE.

We live in a world where humans rely more and more on computers to solve a variety of engineering problems―ranging from weather prediction to the discovery of lifesaving drugs. We are on the verge of another dramatic change where machines are capable of reaching and even exceeding humans in their ability to make decisions and solve complex problems. Computers have already beaten the best human players in Jeopardy* and Go*, and autonomous cars drive on the roads of California. This is all possible due to petaflop levels of compute power (thanks to Moore’s Law) and the vast amounts of data available for training machine learning algorithms.

At Intel, we work in close collaboration with our leading academic and industry fellow travelers to solve the hardware and software architectural challenges for Intel’s upcoming multicore/manycore compute platforms. To help innovators tackle the complexities of machine learning, we are making performance optimizations available to developers through familiar Intel® software tools, specifically through the Intel® Data Analytics Acceleration Library (Intel® DAAL) and enhancements to the Intel® Math Kernel Library (Intel® MKL).

## The Challenge of Machine Learning

In the last decade, machine learning has been an extremely fast-growing discipline. This growth is fueled by the Internet, which generates immense amounts of data. A desire to extract patterns from the data and apply this knowledge to make predictions has resulted in the development of new approaches and algorithms. The exponential growth of compute power has made it possible to apply these algorithms to enormous data sets and make useful predictions.

Deep neural networks (DNNs) are on the cutting edge of the machine learning domain. These algorithms, which received wide industry adoption in the late 1990s, were initially applied to tasks such as handwriting recognition on bank checks. Deep neural networks have been widely successful at this task, matching―and even exceeding―human capabilities. Today, DNNs have been used for image recognition and video and natural language processing, as well as in solving complex visual comprehension problems such as those posed by autonomous driving. DNNs are very demanding in terms of compute resources and the volume of data they must process. To put this into perspective, the modern image recognition topology AlexNet takes a few days to train on modern compute systems and uses slightly over 14 million images. Tackling this complexity requires well-optimized building blocks to decrease training time and meet the needs of the industrial application.

## Intel MKL

Intel MKL is the high-performance math library for Intel and compatible architectures (**Figure 1**). This library provides implementations of common dense (BLAS and LAPACK) and sparse (Sparse BLAS and Intel® MKL PARDISO) linear algebra routines, discrete Fourier transform, vector math, and statistical functions optimized for current and future Intel® processors. Intel MKL leverages instruction-, thread-, and cluster-level parallelism to boost the performance of numerous scientific, engineering, and financial applications on workstations, servers, and supercomputers.

While Intel MKL was designed for high-performance computing (HPC), the functionality is as universal as mathematics itself. Functions such as matrix-matrix multiplication, fast Fourier transform (FFT), or Gaussian elimination create not only the foundation for many scientific and engineering problems, but also the foundation for machine learning algorithms.

Intel MKL 2017 includes optimized functionality to benefit key machine learning algorithms, along with new DNN extensions to address the unique computational needs of machine learning. We will consider two of these: Deep Neural Network extensions and matrix-matrix multiplication improvements.

## Accelerating DNNs with Intel MKL

### DNN Primitives in Intel MKL

Deep learning, a branch of machine learning that uses deep graphs with multiple processing layers to model high-level abstractions, is rapidly conquering data centers. This approach was inspired by the way living organisms perceive reality through the visual cortex. It has become widely successful in applications like image and video recognition, natural language processing, and recommender systems. These workloads rely on many algorithms, most notably multidimensional convolutions and matrix-matrix multiplications. While convolutions can be expressed as matrix multiplications, it is sometimes more effective to implement a direct approach that produces significantly better performance on modern architectures.

In addition to convolution, deep learning workloads include several types of layers that operate on matrices with small dimensions. To minimize the overhead of data transformations, we introduced optimized implementations of these key functions in Intel MKL 2017 in the new Deep Neural Networks (DNN) domain.

Intel MKL 2017 introduces the DNN domain, which includes functions necessary to accelerate the most popular image recognition topologies, including AlexNet, VGG, GoogLeNet, and ResNet.

These DNN topologies rely on a number of standard building blocks, or primitives, that operate on data in the form of multidimensional sets called tensors. The primitives include convolution, normalization, activation, and inner product functions along with functions necessary to manipulate tensors. Performing computations effectively on Intel architectures requires taking advantage of SIMD instructions via vectorization and of multiple compute cores via threading. Vectorization is extremely important, since modern processors operate on vectors of data up to 512 bits long (16 single-precision numbers) and can perform up to two multiply and add (fused multiply-add, or FMA) operations per cycle. Taking advantage of vectorization requires data to be located consecutively in memory. Since typical dimensions of a tensor are relatively small, changing the data layout introduces significant overhead. We strive to perform all the operations in a topology without changing the data layout from primitive to primitive.

Intel MKL provides primitives for most widely used operations implemented for vectorizationfriendly data layout:

- Direct batched convolution
- Inner product
- Pooling: Maximum, minimum, average
- Normalization: Local response normalization (LRN) across channels, batch normalization
- Activation: Rectified linear unit (ReLU)
- Data manipulation: Multidimensional transposition (conversion), split, concat, sum, and scale

Execution flow for the neural network topology includes two phases: setup and execution. During the setup phase, the application creates descriptions of all DNN operations necessary to implement scoring, training, or other application-specific computations. To pass data from one DNN operation to the next, some applications create intermediate conversions and allocate temporary arrays if the appropriate output and input data layouts do not match. This phase is performed once in a typical application and followed by multiple execution phases where actual computations happen.

During the execution step (**Figure 2**), the data is fed to the network in a plain layout like NCWH (batch, channel, width, height) and is converted to a SIMD-friendly layout. As data propagates between layers, the data layout is preserved and conversions are made when it is necessary to perform operations that are not supported by the existing implementation.

## Case Study: Caffe*

Caffe*, a deep learning framework developed by Berkeley Vision and Learning Center (BVLC), is one of the most popular community frameworks for image recognition. Together with AlexNet, a neural network topology for image recognition, and ImageNet, a database of labeled images, Caffe is often used as a benchmark.

Caffe already takes advantage of the optimized mathematical routines provided in Intel MKL using the standard BLAS interface. However, the original implementation does not use parallelism effectively and does not take full advantage of GEMM function in the convolution implementation. By restructuring the compute flow to use GEMM on larger matrices, and extracting additional parallelism with the help of OpenMP*, we achieved 5.8x speedup on AlexNet topology training using a two-socket Intel® Xeon® processor E5-2699 v4-based system. As discussed before, a direct convolution implementation is more effective. By using the new Intel MKL 2017 primitives, we achieve an 11x speedup.

Intel architecture and Intel MKL continue to develop. In June, we launched the Intel® Xeon Phi™ processor x200 series featuring a massively parallel architecture with up to 72 cores, Intel® Advanced Vector Extensions 512 (Intel® AVX-512), and fast on-chip memory. Using Intel’s Caffe fork enabled with new Intel MKL primitives on a single socket, the Intel Xeon Phi processor 7200 provides an additional 2x speedup (**Figure 3**).

### Better performance in Deep Neural Network workloads with Intel® Math Kernel Library (Intel® MKL)

## Faster Matrix-Matrix Multiplication

Matrix-matrix multiplication (GEMM) is a fundamental operation in many scientific, engineering, and machine learning applications. There is a continuing demand to optimize this operation. Intel MKL offers parallel, high-performing GEMM implementations. To provide optimal performance, the Intel MKL implementation of GEMM typically transforms the original input matrices into an internal data format best suited for the targeted platform. This data transformation (also called packing) can be costly, especially for input matrices with one or more small dimensions. Intel MKL 2017 introduces [S,D]GEMM packed application program interfaces (APIs) to avoid redundant packing. This can improve performance on some matrix sizes and use cases found in machine learning.

The APIs allow users to explicitly transform the matrices into an internal packed format and pass the packed matrix (or matrices) to multiple GEMM calls. With this approach, the packing costs can be amortized over multiple GEMM calls if the input matrices (A or B) are reused between these calls, as may occur during recurrent neural networks.

Better performance in Deep Neural Network workloads with Intel® Math Kernel Library (Intel® MKL) For more complete information about compiler optimizations, see our Optimization Notice. Sign up for future issues Share with a friend The Parallel Universe 23

### Example

Three GEMM calls shown below use the same A matrix, while B/C matrices differ for each call:

float *A, *B1, *B2, *B3, *C1, *C2, *C3, alpha, beta; MKL_INT m, n, k, lda, ldb, ldc; // initialize the pointers and matrix dimensions (skipped for brevity) sgemm("T", "N", &m, &n, &k, &alpha, A, &lda, B1, &ldb, &beta, C1, &ldc); sgemm("T", "N", &m, &n, &k, &alpha, A, &lda, B2, &ldb, &beta, C2, &ldc); sgemm("T", "N", &m, &n, &k, &alpha, A, &lda, B3, &ldb, &beta, C3, &ldc);

Here the A matrix is transformed into internal packed data format within each sgemm call. The relative cost of packing matrix A three times can be high if n is small (number of columns for B/C). This cost can be minimized by packing the A matrix once and using its packed equivalent for the three consecutive GEMM calls as shown below:

// allocate memory for packed data format float *Ap; Ap = sgemm_alloc("A", &m, &n, &k); // transform A into packed format sgemm_pack("A", "T", &m, &n, &k, &alpha, A, &lda, Ap); // SGEMM computations are performed using the packed A matrix: Ap sgemm_compute("P", "N", &m, &n, &k, Ap, &lda, B1, &ldb, &beta, C1, &ldc); sgemm_compute("P", "N", &m, &n, &k, Ap, &lda, B2, &ldb, &beta, C2, &ldc); sgemm_compute("P", "N", &m, &n, &k, Ap, &lda, B3, &ldb, &beta, C3, &ldc); // release the memory for Ap sgemm_free(Ap);

The code sample above uses four new functions introduced to support packed APIs for GEMM: sgemm_alloc, sgemm_pack, sgemm_compute, and sgemm_free. First, the memory required for packed format is allocated using sgemm_alloc, which accepts a character argument identifying the packed matrix (A in this example) and three integer arguments for the matrix dimensions. Then, sgemm_pack transforms the original A matrix into the packed format Ap and performs the alpha scaling. The original A matrix remains unchanged. The three sgemm calls are replaced with three sgemm_compute calls that work with packed matrices and assume that alpha=1.0. The first two character arguments to sgemm_compute indicate that the A matrix is in packed format ("P"), and the B matrix is in nontransposed column major format ("N"). Finally, the memory allocated for Ap is released by calling sgemm_free.

GEMM-packed APIs eliminate the cost of packing the matrix A twice for the three matrix-matrix multiplication operations shown in this example. These packed APIs can be used to eliminate the data transformation costs for A and/or B input matrices if A and/or B are reused between GEMM calls.

### Performance

**Figure 4** shows the performance gains with the packed APIs on Intel® Xeon Phi™ processor 7250. It is assumed that the packing cost can be completely amortized by a large number of SGEMM calls that use the same A matrix. The performance of regular SGEMM calls is also provided for comparison.

### Implementation Notes

It is recommended to call gemm_pack and gemm_compute with the same number of threads to get the best performance. Note, if there are only a small number of GEMM calls that share the same A or B matrix, the packed APIs may provide little performance benefit.

The gemm_alloc routine allocates memory approximately as large as the original input matrix. This means that the memory requirement of the application may increase significantly for a large input matrix.

GEMM-packed APIs are only implemented for SGEMM and DGEMM in Intel MKL 2017. They are functional for all Intel architectures, but they are only optimized for 64-bit Intel® AVX2 and above.

## Intel DAAL

In large part, the increased usage and adoption of machine and deep learning algorithms has become possible due to the rapid growth of data available for algorithm training. The term big data has a number of definitions; we will cover big data in terms of its three main attributes: variety, velocity, and volume. Each of these big data attributes requires special computational solutions, and Intel DAAL was designed to address all of them.

### Variety

By *variety*, we refer to the many sources and types of structured and unstructured data that create certain challenges for data extraction and further analysis. To address this attribute of big data, Intel DAAL introduces a data management component as an essential part of the library. This includes classes and utilities for data acquisition, initial preprocessing and normalization, data conversion into numeric formats from one of several supported data sources, and model representation. Since in real-world analytical applications you can discover bottlenecks at any part of the application, Intel DAAL focuses on optimizing the entire data flow. It provides optimized building blocks covering all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision-making (**Figure 5**).

In addition, Intel DAAL provides support for both homogeneous and heterogeneous data and performs intermediate data conversion as necessary. It also supports both dense and sparse data types and provides algorithms that can work with noisy data.

### Velocity

Computational speed is critical in today’s business landscape. One of the most important aspects of big data is that it usually comes in real time and requires a fast response time. Intel DAAL is designed to help software developers reduce the time it takes to develop their applications and deliver them with improved performance. Intel DAAL helps applications make better predictions faster, scaling with the available compute resources. It relies on Intel MKL building blocks and provides additional algorithmic optimizations to bring maximum performance gains on Intel architecture. Intel DAAL provides APIs for C++, Java*, and Python* languages. This allows users to quickly prototype their models and easily deploy them to a large cluster environment.

### Volume

Big data implies enormous volumes of data, possibly so large that they cannot fit into memory. It is also very common that the data set is distributed across different nodes on a cluster or at various devices. To address this, Intel DAAL algorithms support Batch, Online, and Distributed processing computation modes.

In the Batch processing mode, the algorithm works with the entire data set to produce the final result. A more complex scenario occurs when the entire data set is not available at the moment, or when the data set does not fit into the device memory, necessitating the other two processing modes.

In the Online processing mode, the algorithm processes a data set in blocks streamed into device memory by incrementally updating partial results, which are finalized upon processing the last data block.

In the Distributed processing mode, the algorithm operates on a data set distributed across several devices (compute nodes). The algorithm produces partial results on each node, which are ultimately merged into the final result on the master node.

Since there are many platforms available for distributed computation, distributed algorithms in Intel DAAL are abstracted from underlying cross-device communication technology. This enables use of the library in a variety of multidevice computing and data transfer scenarios, including (but not limited to) MPI-, Hadoop*-, or Spark*-based cluster environments. While the user is required to implement cross-device communication, Intel DAAL provides samples that demonstrate how to use this library with the most common analytical platforms.

## Intel DAAL Usage in Different Computation Modes

### Batch Processing Mode

All Intel DAAL algorithms support the Batch processing computation mode, and it is relatively easy to use it in this manner. In the Batch processing mode, only the compute method of a particular algorithm class is used. Simply select an algorithm to use, set input data and algorithm parameters, run the compute method, and get access to the result. The example below demonstrates how to use Intel DAAL in the Batch processing mode with the Cholesky decomposition algorithm:

// Create an algorithm to compute Cholesky decomposition using the default method cholesky::Batch<> cholesky_alg; // Set input data cholesky_alg.input.set(cholesky::data, dataSource.getNumericTable()) // Run compute cholesky_alg.compute(); // Get access to results of computation services::SharedPtr<cholesky::Result> res = cholesky_alg.getResult()

### Online Processing Mode

Some Intel DAAL algorithms enable processing of data sets in blocks. The API for the Online processing computation mode is similar to Batch mode, with the difference being that you need to run the compute method each time data becomes available and then call `finalizeCompute`

to combine results at the end. The example below demonstrates how to use Intel DAAL in the Online processing mode with the singular value decomposition (SVD) algorithm:

// initialize SVD algorithm svd::Online<double, defaultDense> algorithm; // Process data blocks within a loop Status loadStatus; while((loadStatus = dataSource.loadDataBlock(nRowsInBlock)) == success) { // set input data algorithm.input.set( svd::data, dataSource.getNumericTable() ); // run compute algorithm.compute(); } // Finalize the computations and retrieve SVD results algorithm.finalizeCompute(); SharedPtr<svd::Result> res = algorithm.getResult(); // Access results printNumericTable(res->get(svd::singularValues), "Singular values:");

### Distributed Processing Mode

Some Intel DAAL algorithms enable processing of data sets distributed across several devices. Depending on the algorithm, you may need to apply a different scheme for computation. This can range from a simple MapReduce, where the first step is executed on local nodes and the final step on the master node, to more complex schemes like MapReduce-Map. The Intel DAAL programming guide includes samples that demonstrate how to use Intel DAAL with Hadoop, Spark, or MPI. The example below demonstrates how to use Intel DAAL in the Distributed processing mode with the principal component analysis (PCA) algorithm:

Step 1: On local nodes (Mapper) // Create algorithm to compute PCA decomposition using SVD method on local nodes DistributedStep1Local pcaLocal = new DistributedStep1Local(daalContext, Double.class, Method.svdDense); // Set input data pcaLocal.input.set( InputId.data, ntData ); // Compute PCA on local nodes PartialResult pres = pcaLocal.compute(); // Prepare data for sending to master node context.write(new IntWritable(0), new WriteableData(index, pres)); Step 2: On Master node // Create algorithm to calculate PCA decomposition using SVD method on master node DistributedStep2Master pcaMaster = new DistributedStep2Master(daalContext, Double.class, Method.svdDense); // Add input data as it arrives from local nodes for (WriteableData value : values) { PartialResult pr = (PartialResult)value.getObject(daalContext); pcaMaster.input.add( MasterInputId.partialResults, pr); } // Compute PCA on master node pcaMaster.compute(); // Finalize the computations and retrieve PCA results Result res = pcaMaster.finalizeCompute(); HomogenNumericTable eigenValues = (HomogenNumericTable)res.get(ResultId.eigenValues );

### Performance Data

**Figure 6** shows the performance gains of Intel Data Analytics Acceleration Library (Intel DAAL) versus Spark MLlib on an 8-node Hadoop cluster.

## What’s New in Intel DAAL 2017

Intel DAAL 2017 introduces a number of new features, most notably the implementation of neural network functionality. This consists of a number of components that allow the user to easily construct and run various neural network topologies ranging from image classification and object tracking to financial market predictions. To support these various use cases, Intel DAAL introduces several key building blocks for neural networks (**Figure 7**):

**Layer**: NN building block, forward and backward versions of a single layer.**Topology**: The structure to represent NN configuration (e.g., AlexNet).**Model**: A set of layers connected according to the defined topology, their parameters (weights and biases), and service routines.**Optimization solver**: Updates weights and biases after forward-backward pass according to specified objective function.**NN**: Driver managing NN computational flow. During training, the driver executes forward and backward passes over the topology and updates parameters using the optimization solver. During scoring, it returns prediction results by applying forward pass.**Multidimensional data structure (tensor)**: Structure used to represent complex data (e.g., stream of 3D images).

With the same APIs as traditional machine learning algorithms, it becomes relatively easy for a data scientist to compare the accuracy of a model created with neural networks to that of traditional machine learning algorithms like support vector machine (SVM) or linear regression, while getting great performance on Intel architecture. The features of the library reveal an opportunity to combine neural networks with other algorithms (e.g., injecting the Neural Net model into boosting algorithms as weak learners).

## Meeting the Challenges of Big Data and Machine Learning

Dealing with big data analytics and machine learning usually requires developers to deal with different programming languages, tools, and libraries to solve their problem. It is quite common for data to come from a variety of sources and be heterogeneous, sparse, or noisy. Scripting languages such as Python and others are commonly used during the prototyping stage, while C/C++ languages are applied for optimization of the most critical parts of the data flow. Real data arrives over time and/or does not usually fit into the memory of one computer, so use of distributed systems like MPI, Hadoop, or Spark is required for effective data processing.

Because of all of this, the process of developing, testing, and support requires developer knowledge from various fields. It takes time to integrate many solutions under a single application. As a result, development time is significantly increased and adoption of big data analytics is delayed.

Intel DAAL was designed to address all of these challenges and cover most of the possible use cases that a developer may deal with while working with big data. The library focuses on optimization of the entire data flow, not just the algorithmic part, and provides optimized building blocks that cover all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision-making. It is possible to scale a model from a single node to a large cluster while using the same APIs. Intel DAAL, together with Intel MKL, can help unleash the power of big data analytics and machine learning.