I had the chance to catch up with Henry Gabb at Intel. Henry is a Senior Principal Engineer in the Intel Software and Services Group and is one of those intimidatingly smart, yet incredibly nice guys who you could talk to over a beer or 6 for hours. His current work, and his passion for the last many years, is high performance computing. Basically: making your code faster through parallelising and vectorizing your code.
Parallelising your code simply means writing your code in a way that allows the workload to be spread across multiple CPU cores simultaneously. Instead of having all the work done on 1 core while the rest sit idle, write your code so the load is spread across all cores.
Vectorisation means applying a single operation to each element in an array in a single operation instead of having to iterate through each element of the array individually.
Working at Intel, Parallel Computing and the State of Tools Today.
Chris: Your initial work at Intel was in driving parallel computing inside Intel. Was that work aimed at using parallel computing to help Intel's design and manufacturing efforts, or was it aimed at providing developers outside of the company with the hardware and software tools to make parallel computing more available?
Henry: I joined Intel at an interesting time. I was working for Computer Sciences Corporation in their High-Performance Computing section and was trying to tempt an employee at Kuck & Associates to come and join me. They neatly turned it around and had me come out and interview for them. However, by the time I got there Intel had acquired Kuck & Associates and so I was told I was now interviewing for Intel. Bait and switch.
At the time Kuck & Associates had some amazing computing technology: parallel programming tools and runtimes and an excellent C++ compiler. The parallel programming tools, Assure and Guide, were part of a package called the KAP/Pro Toolset that my team at CSC was using to migrate legacy code from a soon-to-be-decommissioned Cray vector supercomputer to our new parallel computers.Intel had just released their first dual-processor motherboard and Assure and Guide were early ancestors of the Intel Inspector and Intel Advisor tools in Intel Parallel Studio XE..
My work at Intel at the time was to help our users adopt and use these tools to take advantage of parallelism.
An interesting aside is that in 1993 when I was doing my postdoctoral research, CNRS has just purchased a Cray PVP system for all CNRS institutes in France to share. Because the Cray was so crowded CNRS was strict about the code that ran on this system. If it didn’t take advantage of the vector architecture and have at least 93% of computations inside the vector registers the job was killed.
Ed. note: If you've never programmed on a Cray you've not lived. You'd write your FORTRAN or C code, run it through the analyser, and with a very polite sniff, English Butler style, the analysers would suggest maybe this change here, that loop unrolling there, and maybe take that entire function over there out back and have it shot.
Chris: I miss those days of having a Cray sneer at my code. It was a very high level of sneering, but the reports you got from the analysers were magic.
Henry: And this is what Intel® Parallel Studio does for you.
Chris: You get Vectorisation reports on Intel chips? I'm showing my ignorance here but I didn't know Intel chips supported vectorisation.
Henry: Absolutely. The SIMD extensions and the AVX2 and AVX-512 instructions support vectorisation. The Intel compilers can do a lot of vectorisation for you.
Ed. note: Turns out SIMD has been in Intel chips since the Pentium III. Welcome to 1999, Chris
Henry: Remember the reports from the Cray HPM* tool? Essentially a report that would tell you how close to the theoretical peak performance your code was able to reach. The Intel® Advisor tool within Intel® Parallel Studio has a new capability called cache-aware roofline analysis that shows how close your loops are getting to the theoretical peak performance of the processor. The roofline display pinpoints the loops that are on the critical path to performance and will give the most benefit from optimization. Intel Advisor, as the name implies, gives advice on how to optimize the code to get it closer to peak performance.
We had one client who had 14 million lines of C++ code. It would typically take about 2 years for new employees to get to the point where they could be trusted to make changes to the code. Intel® Parallel Studio enabled even someone unfamiliar with the code to immediately get actionable advice about parallel correctness and performance.
Being able to run code through a tool and be shown where concurrency errors are is like magic. Imagine a thread writing a variable in one part of the code while 10,000 lines away another thread is reading the same variable. Manually debugging these kinds of data races is tedious and difficult but Intel® Inspector finds them automatically. Click on the error and you're taken directly to the source locations where the data race occurs. The Intel tools make parallel debugging and tuning so much easier.
Chris: Can you give us a quick rundown of what you're working on now?
Henry: I work with the Intel Parallel Studio team making sure it does what it needs to do for our users. I act like a user myself and ensure that if we're adding a feature it makes sense, and also that if a customer needs a feature then I'll work with the team to get it added if it makes sense. That's really my main focus: to ensure Intel® Parallel Studio is always useful and relevant. I’m also the editor of The Parallel Universe, Intel’s quarterly magazine devoted to software innovation.
Chris: How big is the team?
Henry: It's a sizeable team of developers and engineers. There are also a number of technical consulting engineers who help customers modernise their code.
Chris: I understand the benefits of spreading the computing load across multiple CPU cores simultaneously (parallelising) and having a single operation applied to all members of an array at once (vectorisation). If you have Cores sitting there idle then it's great to put them to work.
But, to me, it's like eating your veggies. I know I should do it but it's easier to be lazy and I prefer chocolate cake and beer. So why would I bother?
Henry: The only reason to bother is if it gives you a competitive advantage. Some people don't need it. But lots of applications *do* need it, especially if their competition can complete computations faster. For example if you can only build 2 models a week but your competitor can do 5 a week then they will beat you.
That's how I got into parallel computing originally. Each of my research projects needed successively more computing power. The only way to complete the computations in a reasonable time was to modernise the code, and that meant vectorisation and parallelisation.
Chris: Do you think parallel computing will become mainstream?
Henry: It already is mainstream and has been for some time. We’re well into the multicore era. For a lot of developers there's typically a notion of good-enough performance. Many developers will use productivity languages such as Python but reach a point were performance suddenly becomes an issue. Languages like Julia are a response to this desire for productivity programming with C-like performance.
The overarching concept is the separation of concerns between domain experts and tuning experts. Domain experts want to focus on the problems they need to solve, whether it’s R&D, engineering, design, digital music, finance, etc. They care about performance but regard code optimization as a distraction from their primary goal. Also, they may not be very good at code optimization. Low-level code optimization requires a different skillset so it is best left to tuning experts.
Intel Parallel Studio XE aims to make code modernization easier for domain experts. However, it also contains performance libraries that encapsulate common computations: The Intel® Math Kernel Library, the Intel Integrated Performance Primitives, and the Intel Data Analytics Acceleration Library. These libraries are tuned by experts. If your code takes advantage of these libraries, you’re getting the benefit of all of their tuning expertise. That’s separation of concerns in action.
Ultimately the goal is to have parallelism abstracted into APIs so developers can focus on the problems they’re trying to solve rather than code tuning. That's what we're trying to do with the performance libraries in Intel Parallel Studio.
Chris: Other than Python and FORTRAN what else do you support?
Henry: We currently focus on FORTRAN, C, and C++ because apps in those language typically require high performance. The Intel Distribution for Python was released about a year ago in response to clients' needs and we actively watch the trends because things always change. Years ago lots of work was prototyped in PERL with the expectation it would be rewritten later in C. It never was. The same thing happens with Python so we work to provide the tools that will mean a rewrite isn't necessary. A big part of that is making sure that Python modules take advantage of the Intel performance libraries.
Chris: I've gone all-in with Visual Studio and Visual Studio Code so I can write apps on Windows, macOS and Linux and I've just started developing on iOS and Android. Sure, I'd love to have my code "modernised" but I don't want to leave the comfy arm-chair that is Visual Studio. My recollection is that Intel Parallel Studio was a standalone tool that you had to pop in and out of. Is this still the case?
Henry: Not anymore. Intel Parallel Studio is fully integrated with Visual Studio. It used to be separate, but now it fully integrates and on install you get a bunch of new buttons and options in the Visual Studio interface. You now have access to all Parallel Studio tools within the Visual Studio interface. Visual Studio is definitely the comfy chair so we work with our customers to make it easy and useful.
The other thing is that Microsoft and Intel compilers are object compatible. You can mix and match.
Let's say you compile your code and your profiler says it’s found a performance hotspot. If you want to test how the Intel compiler handles this code, just right click on the file and choose "Compile with Intel compiler." You can get a vectorisation report and experiment with different optimization flags.
You don't need to compile the entire application with the Intel compilers if you don’t want to. Maybe just a handful of components, be it a project or even just a single source file. Compile as much or as little as you want with the Intel compiler.
The important thing is that your compiled code will have multiple code paths to allow the code to execute differently on different processors. The Intel compiler allows you to either lock the executable to a specific processor, or you can have it provide a default code path that gives good performance on most processors and optimised code paths for specific processors.
Chris: Back to you. What programming challenges do you most enjoy?
Henry: I still do a fair amount of side programming on my weekends and nights. I'm currently working on a research project to measure and classify chemical exposure from everyday consumer products. The framework is mostly written in Python but there’s a good amount of SQL and a smattering of R, awk, sed, and shell scripting as well. I’m a big fan of regular expressions too.
Back in the day I programmed in FORTRAN and I knew it well enough to program blind and would instinctively know how the code would perform before it was compiled. These days I need the internet to help me jump between languages. Without Q&A sites, I can barely program. At least, that’s how it feels sometimes. Even so it's Python I use in my spare time. Data wrangling, cleaning, and manipulation is what I love doing. Data science involves lots of fun stuff like machine learning, but the real work is in ensuring the data is clean and ready for modelling. That's what I love doing because data wrangling offers plenty of opportunities for creativity. By the time you get to the machine learning phase, it's mostly rote.
Others may disagree.
Chris: And least enjoy?
Chris: What's given you the greatest sense of achievement?
Henry: When I compare Intel® Parallel Studio now to what it was 15 years ago, it’s such a massive leap and improvement with simple options, wizards and integration. I love that vectorisation matters again. Obviously I'm not personally responsible for all of that, but being part of it is deeply rewarding.
Chris: Final question: What piece of advice would you give to someone looking to become a software developer?
Henry: The standard answer is "Go learn algorithms and data structures" but I never did formal Computing Science at school. I was writing code long before I studied algorithms and data structures so my real advice would be: try to have an application in mind when you start learning to program. Think about the problem you’re trying to solve. Lay it out before you start coding and think through which way you'll tackle it.
The best way to learn to program is simply to do.
* Here’s the documentation for the Hardware Performance Monitor (hpm) and a sample HPM report. The hpm report showed the application’s FLOPS, which told you how close you were getting to the theoretical peak performance of the system.