Moore’s law has finally started to taper off but demand for performance has not. In response, the industry seems to have committed itself to a path of multicore processors and their bigger, bulkier cousins on steroids: the heterogeneous multicore systems. This path could profoundly change the landscape for software developers, but there has been little discussion of what–if anything–software or system programmers should do about it. This is scary!
One thing that is perfectly clear is that the demand for high-performance computing is not going away anytime soon. Applications and systems are capable of consuming all the computing horse-power that a computer system is delivering today and are already waiting on what else you can deliver next. If you need convincing, here are three examples from three different worlds:
And those are traditional embedded environments – we haven’t even talked about data centers yet. The ITRS 2.0 roadmap expects growth from 29 cores/socket in 2017 to 48 cores/socket in just two years.
Multicore Solutions in Many Shapes and Sizes
If you are on the software end of the world of designers, should you be worried? Will your world descend into a new generation of parallel programming languages? Are you looking at a transition where you have to learn new paradigms like we did with client-server computing, object-oriented programming, or threaded programming? Those were relatively complicated at first, but we all eventually got it.
The question comes down to a matter of what exactly we will need to change. Will adjusting to this new scale of multicore computing be a process of learning some new coding techniques? Or will we have to adopt new languages for design capture that can express how we want to use all those cores? Or are we looking at a more fundamental change in how we develop systems?
Operating systems vendors were the first to have to deal with this new world of multicores. They headed off a crisis by assuring that simple additions to programming models we were already familiar with–threads and object models–would be sufficient from the application programmers’ point of view. The OS vendors back-filled details dealing with the multiple cores with symmetric multiprocessing (SMP), hiding the complexities involved in dealing with these architectures. This is where we are today for the bulk of the programming world that is demanding high-performance computing (HPC).
But this may not be the end of the story. It seems that when the programming world undergoes one of these transitions, there is always a plethora of new languages. Anybody remember Objective-C, or the start of X-windows programming? Remember the world when hardware jocks would swear that Verilog (a C-like hardware description language) or VHDL (an ADA like hardware description language popular among hardware designers) was the way the world of programming would go? That is, of course, until MATLAB and its ilk of graphical programming languages said they would be the way it would go.
It almost feels like we are in the early stages of such a transition. Developers on the bleeding edge have already gone down the path of some form of inherently parallel capture –OpenCL™ or OpenVX come to mind, or AutoSAR if you are in the Automotive world, or POSIX threads or MPI if you are in the HPC world –and it seems like some form of parallel capture is somewhat imminent.
The compiler world has been researching this space for at least the last ten years. But apart from low hanging fruit like task-level parallelism (TLP), data-level parallelism (DLP) or pipeline-level parallelism (PLP) in very special cases where the code is fairly parallelizable–as in pixel-level or image-level repetition–there hasn’t been the breakthrough that is going to deliver (C++)++. Maybe it will yet happen. But unfortunately, the proverbial cart has arrived and the horse isn’t here yet. So what exactly can you do? What should you do? To answer that we need to look deeper into the innards of multicore systems.
Homogeneous Cores with SMP
If you look at most systems’ implementations today, racks of hardware servers with each server card carrying a combination of CPU and GPU chips, are layered with operating systems that provide some kind of SMP. It is great if such systems can address your immediate problems. However, remember these SMP systems are built as shared-memory systems and work only for those applications that can deal with the shared-memory model. If the performance you get is not quite what you need, you simply throw more cores at the problem. There is scant attention being paid to see if the return on investment (ROI) of adding the additional cores is really good, or whether you are heating up your data center unnecessarily. As long as more cores solve the problem nobody is really worried about it. Yet.
But if you are running Linux as your SMP OS, there are inherent issues that make performance prediction—figuring out that ROI–hard. Here are five specific issues that must be understood or at least accounted for.
These are the ugly pieces of the problem. Nobody really talks about them, and everyone would rather sweep them under the rug and hope they go away by system debug time. But to be able to predict performance improvement and analyze ROI before you have a complete system prototype, you have to at least take these realities into consideration.
Beginnings of True Heterogeneity
Such complexities are already realities for SMP systems. But if you are going to explore true heterogeneous computing–like adding one or more DSPs to your architecture to speed up signal processing, or adding a hard-core IP like a fast Fourier transform (FFT) core or a Viterbi decoder for a wireless application–a single OS will no longer work. Each kind of core will have its own OS, kernel, or bare-metal environment. Managing all these different operating systems and coding for each compute infrastructure is challenging.
A third approach that has been gaining some attention is Hyperscale: architectures using one or more FPGAs as part of your architecture. There are two big areas where the use of pooled FPGAs is discussed most often. One is improving specific network latency where disk access is a significant component of the execution time of the application. Millions of database accesses to a relational database on network-attached-storage devices is one example. Tack on an OS-abstracted network file system (NFS) and its associated unpredictability, and this compromises your ability to predict computation time. The second major area includes machine learning or machine inferencing, exploiting the tradeoffs associated with floating-point vs. fixed-point approaches in the FPGAs themselves.
Top Down and Bottom Up
So far we have looked at this demand for computation problem both from a top-down and a bottom-up approach. The top-down approach comprises the multiple parallel capture mechanisms that are getting popular. The bottom-up approach includes the several kinds of multicore architectures that are being offered by semiconductor vendors. In effect, this bottom-up approach is forcing programmers to find a matching top-down approach. Alternatively, if you are starting with the top-down approach, systems programmers don’t have an easy way to bind the parallel specifications to the multicore architectures while also maintaining portability across architectures –usually a critical requirement. Either way the mapping process is cumbersome and a highly manual and iterative one.
For example, if you target Linux SMP, parallel code has to be written using OS tasking facilities directly (such as P-threads). This code is at most portable to other platforms using the same OS (and version). Given that most compute environments are tending towards Linux, this is less of an issue, but taking this approach precludes use of other OSes that might emerge or using that code on a truly heterogeneous platform.
If systems programmers chose to program in OpenCL language, the hope is that vendors support platform independent libraries so you can truly compare performance and portability across architectures. This is usually not the case.
But even before we look at the binding between the top-down and bottom-up approaches, we are missing a fundamental first step before we start choosing between possible solutions. We lack clear criteria for measuring the goodness of different implementations, and hence we lack tools for estimating goodness before we have a complete prototype system to measure.
Each problem space may have its own preferred criteria to evaluate goodness. For example, system parameters like cost, throughput and latency may be obvious. Power consumption may not be. In a wireless base station, where hundreds of cores are likely to be the norm, different solutions can attain very different peak-power consumption profiles which can range from 40% to 60%. Such differences certainly affect the type and thus price of the power source that is to be sustained by the network operator. An automotive multicore ADAS may, as we mentioned earlier, end up consuming around 200W during runtime on a single board. If a system has multiple boards this causes a heat problem—but heat is primarily a function of average power, and not peak power.
It’s not just system developers who face these prediction issues. Platform developers also have to understand the implications of their architectural choices. From a bottom-up or architectural point of view, it is important to be able to quantify each of the approaches in terms of what they do well. We need to measure how a particular solution’s architecture enhances or detracts from the need of a particular problem.
At the end of the day, implementation is the embodiment of behavior onto architecture. So as one evaluates the implementation of behavior on architecture, there are some questions to consider. Are you measuring the actual parameters of interest? How exactly do we do that before the system exists? Can you tune the solutions to optimize for the parameters of interest, or do you get useful estimates of the parameters so late in the design cycle that retuning the system is no longer practical? Is it possible to optimize for more than one parameter simultaneously, or do you risk having successive approximation turn into infinite regression? Is there a way to consider the constraints inherent in the problem domain?
We have seen that physical realities are pushing system developers deeper and deeper into use of symmetric, and now heterogeneous, multicore architectures. These architectures, in turn, have imposed new requirements on OS vendors, and are starting to change the way we capture software designs. But what has not changed yet is the way we model these multicore systems so that we can predict the impact of our task mapping, memory allocation, and coding decisions on the performance of the finished system.
Increasingly, these predictions need to comprehend not just simple measures of speed, but attributes across a wide space, ranging from bandwidth to average power to latency of specific sub-tasks. Our predictions need to reflect what is actually going on beneath the API level, where, even though it is supposed to be hidden from application developers, OS activity can significantly influence system behavior. And because we are making decisions that influence the implementation at fundamental levels—such as what tasks will go on what kinds of hardware—we need the predictions to be accurate earlier and earlier in the design process.
In short, we need tools that will work with early, behavior-level code, not with optimized modules. We need the tools to show us less-than-obvious factors that will be important in final system performance, including not just traditional code-segment profiling, but data and control transfers between tasks and between processors, behavior of interfaces between tasks, and—importantly—dynamic power behavior. Such tools are emerging, but they are coming none too soon for this generation of system designs.