Download Intel® Math Kernel Library
Ranjan Anantharaman, Viral Shah, and Alan Edelman, Julia Computing
High-performance computing (HPC) has been at the forefront of scientific discovery. Scientists routinely run simulations on millions of cores in distributed environments. The key software stack involved has typically been a statically compiled language such as C/C++ or Fortran, in conjunction with OpenMP*/MPI*. This software stack has stood the test of time and is almost ubiquitous in the HPC world. The key reason for this is its efficiency in terms of the ability to use all available compute resources while limiting memory usage. This level of efficiency has always been beyond the scope of more “high-level” scripting languages such as Python*, Octave*, or R*. This is primarily because such languages are designed to be high-productivity environments that facilitate rapid prototyping―and are not designed to run efficiently on large compute clusters. However, there is no technical reason why this should be the case, and this represents a tooling problem for scientific computing.
Julia,* a new language for technical computing, is meant to address this problem. It reads like Python or Octave, but performs as well as C. It has built-in primitives for multithreading and distributed computing, allowing applications to scale to millions of cores.
The first natural questions that spring to mind are: Why can’t the other high-level languages perform this well? What’s unique about Julia that lets it achieve this level of efficiency while still allowing the user write readable code?
The answer is that Julia uses a combination of type inference and code specialization to generate tight machine code. To see how, let’s use the Julia macro @code_native
, which generates the assembly code generated by a function call. For example:
julia> @code_native 1 + 1
.section __TEXT,__text,regular,pure_instructions
Filename: int.jl
pushq %rbp
movq %rsp, %rbp
Source line: 32
leaq (%rdi,%rsi), %rax
popq %rbp
retq
Source line: 32
nopw (%rax,%rax)
This example shows the assembly code generated when Julia adds two integers. Notice that Julia picks the right assembly instruction for the operation, and there is virtually no overhead.
The next code snippet shows the addition of two double-precision floating-point numbers:
julia> @code_native 1. + 1.
.section __TEXT,__text,regular,pure_instructions
Filename: float.jl
pushq %rbp
movq %rsp, %rbp
Source line: 240
addsd %xmm1, %xmm0
popq %rbp
retq
Source line: 240
nopw (%rax,%rax)
Julia automatically picks a different instruction, addsd, since we are now adding two floating-point numbers. This is because two versions of the function “+” are generated depending on the type of the input, which was inferred by the compiler. This idea allows the user to write high-level code and then let Julia infer the types of the input and output, and subsequently compile a specialized version of that function for those input types. This is what makes Julia fast (Figure 1).
Figure 1: Julia benchmark times relative to C (smaller is better). C performance = 1.0. Benchmark code can be found at the
GitHub respository.
Julia’s unique features solve the two-language problem in data science and numerical computing. This has led to demand for Julia in a variety of industries from aerospace to finance.
This led to the formation of Julia Computing, which consults with industry and fosters adoption of Julia in the community. One of Julia Computing’s flagship products is JuliaPro*, a Julia distribution that includes an IDE, a debugger, and more than 100 tested packages. JuliaPro will soon ship with Intel® Math Kernel Library (Intel® MKL) for accelerated BLAS operations and optimizations for multicore and the latest Intel® processors.
Parallel Computing in Julia
Julia has several built-in primitives for parallel computing at every level: vectorization (SIMD), multithreading, and distributed computing.
Consider an embarrassingly parallel simulation as an example. An iterative calculation of π involves generating random numbers between 0 and 1 and eventually calculating the ratio of the number of points that “land” inside the unit circle as opposed to those that don’t. This gives the ratio of the areas of the unit square and the unit circle, which can then be used to calculate π. The following Julia code makes use of the @parallel
construct. The reducer “+” sums the values of the final expression that was computed in parallel on all Julia processes:
addprocs(2) # Add 2 Julia worker processes
function parallel_π(n)
in_circle = @parallel (+) for i in 1:n # <-- partition the work
x = rand()
y = rand()
Int((x^2 + y^2) < 1.0)
end
return (in_circle/n) * 4.0
end
parallel_π(10000)
Julia offloads work to its worker processes that compute the desired results and send them back to the Julia master, where the reduction is performed. Arbitrary pieces of computation can be assigned to different worker processes through this one-sided communication model.
A more interesting use-case would be having different processes solve, for example, linear equations and then serialize their output and send it to the master process. Scalable machine learning is one such example that involves many independent backsolves. Figure 2 shows the performance of RecSys.jl, which processes millions of movie ratings to make predictions. It is based on an algorithm called Alternating Least Squares (ALS), which is a simple, iterative method for collaborative filtering. Since each solve is independent of every other, this code can be parallelized and scaled easily. Figure 2 is a performance chart that shows the scaling characteristics of the system.
Figure 2. Scaling chart of the ALS Recommender System: “nprocs” refers to the number of worker processes in the shared memory and distributed setting and number of threads in the multithreaded setting
In Figure 2, we look at the scaling performance of Julia in multithreaded mode (Julia MT), distributed mode (Julia Distr), and shared memory mode (Julia Shared). Shared memory mode allows independent Julia worker processes (as opposed to threads) to access a shared address space. An interesting aspect of this chart is the comparison between Julia’s distributed capabilities and that of Apache Spark*, a popular framework for scalable analytics that is widely adopted in the industry. What’s more interesting, though, are the consequences of this experiment: Julia can scale well to a higher number of nodes. Let’s now move on to a real experiment in supercomputing.
Bringing It All Together (at Scale): Celeste*
The Celeste* Project is a collaboration of Julia Computing (Keno Fischer), Intel Labs (Kiran Pamnany), JuliaLabs@MIT (Andreas Noack, Jarrett Revels), Lawrence Berkeley National Labs (David Schlegel, Rollin Thomas, Prabhat), and University of California–Berkeley (Jeffrey Regier, Maximilian Lam, Steve Howard, Ryan Giordano, Jon McAuliffe). Celeste is a fully generative hierarchical model that uses statistical inference to mathematically locate and characterize light sources in the sky. This model allows astronomers to identify promising galaxies for spectrograph targeting, define galaxies for further exploration, and help understand dark energy, dark matter, and the geometry of the universe. The data set used for the experiment is the Sloan Digital Sky Survey (SDSS) (Figure 3), with more than five million images comprising 55 terabytes of data.
Figure 3: A sample from the Sloan Digital Sky Survey (SDSS)
Using Julia’s native parallel computing capabilities, researchers were able to scale their application to 8,192 Intel® Xeon® processor cores on the National Energy Research Scientific Computing Center (NERSC) Cori* supercomputer at LBNL. This resulted in parallel speedups of 225x in image analysis, processing more than 20,000 images (or 250 GB), an increase of three orders of magnitude compared to previous iterations.
Note that this was achieved with Julia’s native parallel capabilities, which allowed scaling of up to 256 nodes and four threads per node. This kind of scaling puts Julia firmly in the HPC realm, joining the elite league of languages such as C/C++ and Fortran.
Pradeep Dubey of Intel Labs summarized this nicely: “With Celeste, we are closer to bringing Julia into the conversation because we’ve demonstrated excellent efficiency using hybrid parallelism―not just processes, but threads as well―something that’s still impossible to do with Python or R.”
The next step in the project is scaling even further and processing the entire data set. This will involve making use of the latest Intel® Xeon Phi™ processors and coprocessors.
Breaking New Boundaries
The Julia project has come a long way since its inception in 2012, breaking new boundaries in scientific computing. Julia was conceived as a language that retained the productivity of Python while being as fast and scalable as C/C++. The Celeste Project shows that this approach allows Julia to be the first high-productivity language to scale well on clusters.
The Julia project continues to double its user base every year and to be increasingly adopted in a variety of industries. Such advances in tooling for computational science not only bode well for scientific research to address the challenges of the coming age, but will also result in rapid innovation cycles and turnaround times in the industry.
Learn More
Julia Computing
You may know us for our processors. But we do so much more. Intel invents at the boundaries of technology to make amazing experiences possible for business and society, and for every person on Earth.
Harnessing the capability of the cloud, the ubiquity of the Internet of Things, the latest advances in memory and programmable solutions, and the promise of always-on 5G connectivity, Intel is disrupting industries and solving global challenges. Leading on policy, diversity, inclusion, education and sustainability, we create value for our stockholders, customers and society.