Exploring a Parallel Universe—It’s Coming to a Design Near You

Moore’s law has finally started to taper off but demand for performance has not. In response, the industry seems to have committed itself to a path of multicore processors and their bigger, bulkier cousins on steroids: the heterogeneous multicore systems. This path could profoundly change the landscape for software developers, but there has been little discussion of what–if anything–software or system programmers should do about it. This is scary!

One thing that is perfectly clear is that the demand for high-performance computing is not going away anytime soon.  Applications and systems are capable of consuming all the computing horse-power that a computer system is delivering today and are already waiting on what else you can deliver next.  If you need convincing, here are three examples from three different worlds:

  • In the Internet of Things (IoT) world at last year’s Linley conference, a case study outlined the design of a smart watch (yes, a watch!) with tens of GPUs in addition to the CPUs themselves.
  • In the telecom world, sitting on the threshold of Advanced LTE (what most people refer to as 4G or 4G+) wireless technology, architects estimate that next-generation base stations will need around 500 cores to meet the 1ms latency that the specification calls for. 5G is likely to be even more demanding.
  • In the automotive industry, researcher Edwin Olson, Associate Professor of Computer Science and Engineering, University of Michigan, warned that to meet the computation demand of advanced driver assistance systems (ADAS) alone, the computation platforms in the prototypes had about 40 cores. It wasn’t unreasonable to expect to have to dissipate heat generated from greater than 500 watts. Don’t forget this is still a mobile platform.

And those are traditional embedded environments – we haven’t even talked about data centers yet. The ITRS 2.0 roadmap expects growth from 29 cores/socket in 2017 to 48 cores/socket in just two years.

Multicore Solutions in Many Shapes and Sizes

If you are on the software end of the world of designers, should you be worried? Will your world descend into a new generation of parallel programming languages? Are you looking at a transition where you have to learn new paradigms like we did with client-server computing, object-oriented programming, or threaded programming? Those were relatively complicated at first, but we all eventually got it.

The question comes down to a matter of what exactly we will need to change. Will adjusting to this new scale of multicore computing be a process of learning some new coding techniques? Or will we have to adopt new languages for design capture that can express how we want to use all those cores? Or are we looking at a more fundamental change in how we develop systems?

Operating systems vendors were the first to have to deal with this new world of multicores. They headed off a crisis by assuring that simple additions to programming models we were already familiar with–threads and object models–would be sufficient from the application programmers’ point of view. The OS vendors back-filled details dealing with the multiple cores with symmetric multiprocessing (SMP), hiding the complexities involved in dealing with these architectures. This is where we are today for the bulk of the programming world that is demanding high-performance computing (HPC).

But this may not be the end of the story. It seems that when the programming world undergoes one of these transitions, there is always a plethora of new languages. Anybody remember Objective-C, or the start of X-windows programming?  Remember the world when hardware jocks would swear that Verilog (a C-like hardware description language) or VHDL (an ADA like hardware description language popular among hardware designers) was the way the world of programming would go? That is, of course, until MATLAB and its ilk of graphical programming languages said they would be the way it would go.

It almost feels like we are in the early stages of such a transition.  Developers on the bleeding edge have already gone down the path of some form of inherently parallel capture –OpenCL™ or OpenVX come to mind, or AutoSAR if you are in the Automotive world, or POSIX threads or MPI if you are in the HPC world –and it seems like some form of parallel capture is somewhat imminent.

Figure 1. Different parallel capture mechanisms are starting to appear. (Graphic courtesy of www.pixabay.com)

The compiler world has been researching this space for at least the last ten years. But apart from low hanging fruit like task-level parallelism (TLP), data-level parallelism (DLP) or pipeline-level parallelism (PLP) in very special cases where the code is fairly parallelizable–as in pixel-level or image-level repetition–there hasn’t been the breakthrough that is going to deliver (C++)++. Maybe it will yet happen. But unfortunately, the proverbial cart has arrived and the horse isn’t here yet. So what exactly can you do? What should you do? To answer that we need to look deeper into the innards of multicore systems.

Homogeneous Cores with SMP

If you look at most systems’ implementations today, racks of hardware servers with each server card carrying a combination of CPU and GPU chips, are layered with operating systems that provide some kind of SMP. It is great if such systems can address your immediate problems. However, remember these SMP systems are built as shared-memory systems and work only for those applications that can deal with the shared-memory model.  If the performance you get is not quite what you need, you simply throw more cores at the problem. There is scant attention being paid to see if the return on investment (ROI) of adding the additional cores is really good, or whether you are heating up your data center unnecessarily. As long as more cores solve the problem nobody is really worried about it. Yet.

But if you are running Linux as your SMP OS, there are inherent issues that make performance prediction—figuring out that ROI–hard. Here are five specific issues that must be understood or at least accounted for.

  • Context switches: accounting for context load/store overhead; These can occur at unwanted/uncontrolled points in time;
  • The amount of payload data being sent across the communication channel between cores
  • Interrupt periodicity in a real-time OS
  • Overheads built-in to the inter-task communication application programming interface (APIs) used to simplify programming
  • The multiple programs and daemons that run in the background and can move user threads or reschedule them

These are the ugly pieces of the problem. Nobody really talks about them, and everyone would rather sweep them under the rug and hope they go away by system debug time.  But to be able to predict performance improvement and analyze ROI before you have a complete system prototype, you have to at least take these realities into consideration.

Beginnings of True Heterogeneity

Such complexities are already realities for SMP systems. But if you are going to explore true heterogeneous computing–like adding one or more DSPs to your architecture to speed up signal processing, or adding a hard-core IP like a fast Fourier transform (FFT) core or a Viterbi decoder for a wireless application–a single OS will no longer work. Each kind of core will have its own OS, kernel, or bare-metal environment.  Managing all these different operating systems and coding for each compute infrastructure is challenging.

Figure 2.  A salad bar of choices for computing needs. (Graphic courtesy of www.pixabay.com)

A third approach that has been gaining some attention is Hyperscale: architectures using one or more FPGAs as part of your architecture. There are two big areas where the use of pooled FPGAs is discussed most often. One is improving specific network latency where disk access is a significant component of the execution time of the application. Millions of database accesses to a relational database on network-attached-storage devices is one example. Tack on an OS-abstracted network file system (NFS) and its associated unpredictability, and this compromises your ability to predict computation time. The second major area includes machine learning or machine inferencing, exploiting the tradeoffs associated with floating-point vs. fixed-point approaches in the FPGAs themselves.

Top Down and Bottom Up

So far we have looked at this demand for computation problem both from a top-down and a bottom-up approach.   The top-down approach comprises the multiple parallel capture mechanisms that are getting popular. The bottom-up approach includes the several kinds of multicore architectures that are being offered by semiconductor vendors. In effect, this bottom-up approach is forcing programmers to find a matching top-down approach.  Alternatively, if you are starting with the top-down approach, systems programmers don’t have an easy way to bind the parallel specifications to the multicore architectures while also maintaining portability across architectures –usually a critical requirement. Either way the mapping process is cumbersome and a highly manual and iterative one.

For example, if you target Linux SMP, parallel code has to be written using OS tasking facilities directly (such as P-threads). This code is at most portable to other platforms using the same OS (and version).  Given that most compute environments are tending towards Linux, this is less of an issue, but taking this approach precludes use of other OSes that might emerge or using that code on a truly heterogeneous platform.

If systems programmers chose to program in OpenCL language,  the hope is that vendors support platform independent libraries so you can truly compare performance and portability across architectures. This is usually not the case.

But even before we look at the binding between the top-down and bottom-up approaches, we are missing a fundamental first step before we start choosing between possible solutions. We lack clear criteria for measuring the goodness of different implementations, and hence we lack tools for estimating goodness before we have a complete prototype system to measure.

Each problem space may have its own preferred criteria to evaluate goodness. For example, system parameters like cost, throughput and latency may be obvious. Power consumption may not be.  In a wireless base station, where hundreds of cores are likely to be the norm, different solutions can attain very different peak-power consumption profiles which can range from 40% to 60%. Such differences certainly affect the type and thus price of the power source that is to be sustained by the network operator. An automotive multicore ADAS may, as we mentioned earlier, end up consuming around 200W during runtime on a single board. If a system has multiple boards this causes a heat problem—but heat is primarily a function of average power, and not peak power.

It’s not just system developers who face these prediction issues. Platform developers also have to understand the implications of their architectural choices. From a bottom-up or architectural point of view, it is important to be able to quantify each of the approaches in terms of what they do well.  We need to measure how a particular solution’s architecture enhances or detracts from the need of a particular problem.

Figure 3. Efficient mapping is not easy! (Graphic courtesy of www.pixabay.com)

At the end of the day, implementation is the embodiment of behavior onto architecture. So as one evaluates the implementation of behavior on architecture, there are some questions to consider. Are you measuring the actual parameters of interest? How exactly do we do that before the system exists? Can you tune the solutions to optimize for the parameters of interest, or do you get useful estimates of the parameters so late in the design cycle that retuning the system is no longer practical? Is it possible to optimize for more than one parameter simultaneously, or do you risk having successive approximation turn into infinite regression? Is there a way to consider the constraints inherent in the problem domain?


We have seen that physical realities are pushing system developers deeper and deeper into use of symmetric, and now heterogeneous, multicore architectures. These architectures, in turn, have imposed new requirements on OS vendors, and are starting to change the way we capture software designs. But what has not changed yet is the way we model these multicore systems so that we can predict the impact of our task mapping, memory allocation, and coding decisions on the performance of the finished system.

Increasingly, these predictions need to comprehend not just simple measures of speed, but attributes across a wide space, ranging from bandwidth to average power to latency of specific sub-tasks. Our predictions need to reflect what is actually going on beneath the API level, where, even though it is supposed to be hidden from application developers, OS activity can significantly influence system behavior. And because we are making decisions that influence the implementation at fundamental levels—such as what tasks will go on what kinds of hardware—we need the predictions to be accurate earlier and earlier in the design process.

In short, we need tools that will work with early, behavior-level code, not with optimized modules. We need the tools to show us less-than-obvious factors that will be important in final system performance, including not just traditional code-segment profiling, but data and control transfers between tasks and between processors, behavior of interfaces between tasks, and—importantly—dynamic power behavior. Such tools are emerging, but they are coming none too soon for this generation of system designs.


CATEGORIES : All/ AUTHOR : Kumar Venkatramani, VP Business Development, Silexica Inc.

10 comments to “Exploring a Parallel Universe—It’s Coming to a Design Near You”

You can leave a reply or Trackback this post.
  1. dear Mr. Venkatramani,

    I was delighted to read your article – I’m not sure Intel will like it …:)

  2. Dear Dr Sender,
    Altera is already Intel…

  3. parallel processing?

  4. Looks like the true days of Ai , decision making trees and predictive analysis are really upon us .

    I am Wondering who will rise to be the Alan Turing of the 21 st. Century. From some reports it may be one of Microsoft’s latest additions , who appears to be a true tool maker.

    As always I am confident that the science of the industry will come up with yet another paradigm shift, and all will be lovely in our garden .

  5. Some years ago I was working with some guys doing research using Transputers – massively paralleled (they were designed that way anyway). There was not much software around but one of the programmers found a parallel Fortran from Lahey (I think that is the spelling).

    It was not very adaptive and all had to be recompiled whenever the transputer configuration was changed. Final output at that time was to simulate explosions in a confined space which led to a better understanding of explosions in coal mines.

    Way bleeding edge at the time (late 80s). Have we moved forward much since then?

    • I believe that you Transputer veterans should get together and document that whole story, from the original architectural thinking to the tools and applications that grew up around the chips. It would make a great read, and it might save the current generation of folks who think they are discovering unknown lands a lot of trouble and, now and the, embarrassment.

      • I also would like to hear the transputer story – that was a topic that captured my imagination as a student as it struck me as such an elegant solution going forward. However, it all seemed to disappear (and we all got distracted by rising clock speeds anyway). No doubt there were issues – I can only assume the programming model was somewhat non-humanistic as you scale.
        In many ways, the big issue is to find a humanistic approach to massively parallel computation (in the way that multi-threaded programming is not). I am encouraged to see Intel clearly take an interest in FPGAs – truly parallel bespoke hardware is the ultimate concurrent processor of course – it’s just hard and mostly limited to the domain of us electronic engineers.

        • Fantastic article.

          I agree with Nicholas Outram in that multi-threaded programming seems a bit kludgy/hacky right now, not that natural.
          I think if OOP were to be more formally state-oriented things could be done without mutexes and the such. An object could be either busy, or available.

          Also I think the natural progression would be to just extend OOP with parallelism as ad-hoc optimizations requested by the programmer at different levels.
          I wouldn’t convolve the modeling of the problem with the system model.

          I think functions, loops, objects and object arrays could be tagged with a request to run on a particular core or cores, or this be automatically determined by an administrator, either a simplified routine for bare-metal or as part of a full OS.

          I’m not aware of all the approaches out there, but I would propose something like this (oversimplified):

          //Per core main loop, on fixed task system:
          if(core_id==1) { main_loop_1(); }
          if(core_id==2) { main_loop_2(); }

          //Per object, explicit core assignment
          ball ball1;
          ball ball2;

          agent agents[500];


          int sum(int x, int y){

          I don’t know if I’m missing something, nowhere near as experienced as everyone here.

  6. Fantastic article.

    I agree with Nicholas Outram in that multi-threaded programming seems a bit kludgy/hacky right now, not that natural.
    I think if OOP were to be more formaly state-oriented things could be done without mutexes and the such. An object could be either busy, or available.

    Also I think the natural progression would be to just extend OOP with parallelism as ad-hoc optimizations requested by the programmer at different levels.
    I wouldn’t convolve the modeling of the problem with the system model.

    I think functions, loops, objects and object arrays could be tagged with a request to run on a particular core or cores, or this be automatically determined by an administrator, either a simplified routine for bare-metal or as part of a full OS.

    I’m not aware of all the approaches out there, but I would propose something like this (oversimplified):

    //Per core main loop, on fixed task system:
    if(core_id==1) { main_loop_1(); }
    if(core_id==2) { main_loop_2(); }

    //Per object, explicit core assignment
    ball ball1; core: 1
    ball ball2; core: 2

    agent agents[500]; core: all, auto

    for(i=0;i<1000;i++){ core: all, auto

    int sum(int x, int y){ core: 3

    I don't know if I'm missing something, nowhere near as experienced as everyone here.

  7. In the early 90’s I was a PhD student doing a comparison between two parallel architectures: Transputers (T800) and Texas Instruments (TMS320C40) DSP. I had at that time a 10-processor board of Transputers and another board of 4 DSPs. The good thing about Transputers at the time is the available OCCAM language which was designed specifically for Transputers and it really made writing parallel code simple. However, The DSP processor was superior in terms of performance (FFT, filtering, …etc) but really lacked a good development software tool. What I found at that time is that understanding the underlining architecture of the DSP processor and learning its assembly language and using it for developing your programs for a single processor is usually better than learning to deal with all parallel stuff and writing code in a high level language. I ended up tuning all my algorithms in assembly for DSP board while keeping the (3L parallel C) development tool as the higher abstract to provide me with the necessary functions to communicate with the operating system. I programmed all communications between the processors and it was a nightmare when you have a deadlock somewhere. To debug such an error at that time I used to print the code for each processor, lay it on the building floor, and walk through it (in parallel) to find out that at some moment one processor was expecting a 64 byte block of data from its neighbor while the sender is only sending 32 bytes.

    Parallel processing main objective is achieving higher performance. Learning to develop code for multi-processor systems and be able to test, debug, and measure its efficiency is not really an easy task. Compiler developers should make use of the new architecture features of emerging processors as soon as possible so that programmers do not waste time learning their assembly. To illustrate this problem, the TMS320C40 was capable of doing multiplication and addition at the same time provided that the operands are stored in certain registers, something that higher level language compilers should explore and make use of. Going to parallel paradigm to achieve higher efficiency while the same could have been achieved using a finely tuned single processor system is really a waste of time, money, and effort.

    Nowadays, there are many development tools for helping developers writing code for multiprocessor systems. Most general purpose computers (e.g. PCs) are multi-core systems and there is no escape from learning how to use such hardware efficiently. Educational institutions must develop their computer related programs such that their curriculum include the concepts and means used for parallel processing programming.

Write a Reply or Comment

Your email address will not be published.