The scenario is becoming increasingly familiar. You have a working embedded design, perhaps backed by years of deployment with customers and hundreds of thousands of lines of debugged code. Along comes marketing with a new set of performance specifications, or R/D with a new computer-crushing algorithm. Your existing CPU family just can’t handle it.
At this point your options can look uniformly dismal. Perhaps you can move to a higher-performance CPU family without totally losing instruction-set compatibility. But there will almost certainly be enough differences to require a new operating-system (OS) version and re-verification of the code. And the new CPU will have new hardware implications, from increased power consumption to different DRAM interfaces.
Of course you can also move to a faster CPU with a different instruction set. But a new tool chain, from compiler to debug, plus the task of finding all those hidden instruction-set dependencies in your legacy code, can make this move genuinely frightening. And changing SoC vendors will have system-level hardware implications too.
Or you could try a different approach: you could identify the performance hot spots in your code—or in that new algorithm—and break them into multiple threads that can be executed in parallel. Then you could execute this new code in a multicore CPU cluster. Unless you are currently using something quite strange, there is a good chance that there is a chip out there that has multiple instances of the CPU core you are using now. Or, if there is inherent parallelism in your data, you could rewrite those threads to run on a graphics processor (GPU), a multicore digital signal processing (DSP) chip, or a purpose-built hardware accelerator in an FPGA or ASIC. All these choices require new code—but only for the specific segments you are accelerating. The vast majority of that legacy code can remain unexamined.
If you decide to take the parallel route, the obvious next question is how to do it. As you would expect, the answer depends on the nature of the parallelism you found in your code and on your hardware decisions.
There is always the manual approach. Suppose you are going to exploit task parallelism. For example, the requirement that is causing you trouble is a maximum-latency spec, and there are just too many instructions on the worst-case path through the routine. If you have N CPUs available in a multicore configuration, you may be able to break the critical routine into N independent threads that can run in parallel on separate CPU cores. You may have to execute some of those threads speculatively, so you don’t have most of the threads stalled waiting for the result of one computation somewhere.
With a multicore-aware OS, you can associate each thread with a different core, launch them all at once, and reduce the task latency to the latency of the slowest thread, plus a bit of overhead. If you are lucky, the improvement may be nearly 1/N. Of course you are responsible for identifying the threads, and for avoiding contention problems in shared memory or I/O. If someone changes the algorithm or the hardware, you may have to start over.
Another manual opportunity arises with hardware accelerators. If you have access to a GPU you can use it to exploit a high level of data parallelism. You would need to recode your task in a vendor-developed GPU language such as Nvidia’s CUDA. Similarly, if you are accelerating a task with an FPGA, you will use Verilog or VHDL to describe the necessary state machines and data paths to perform the accelerated calculations. In both GPU and FPGA cases there are translators that can at least get you started on moving your C/C++ into the necessary hardware-specific language. But there will still be manual steps on the front and back ends of the process.
There is an obvious need for automation here: a tool that would traverse your code, compare paths against timing constraints, identify parallelizable hot spots, and transform them into code for the target hardware. No such tool exists. But there are two tools, both developed outside the embedded-computing world that can do a big part of the job.
The challenges of parallelization arose in other areas long before they struck embedded computing. In high-performance computing, for example, the shift from massive single-thread processors like the early Cray machines to arrays of smaller multicore processors forced programmers to rework their code. They had to go from one giant thread with great locality of reference to many semi-independent threads that could run on separate cores. Because really big jobs often have to move to the first available supercomputer, programmers sought a framework that would allow them to do this rework in a machine-independent way. One of the most successful answers to that need has been OpenMP.
True to its heritage, OpenMP was conceived to assist a programmer in adapting code to a homogeneous, shared-memory multiprocessing environment (Figure 1). It has recently been extended to work with accelerators as well as multicore clusters, but it remains based on shared memory. Ideally, once you have adapted your code to OpenMP it will execute correctly on any system that has an OpenMP platform, though the performance will naturally vary enormously.
OpenMP works by reading pragmas—comments you put into your C or FORTRAN source—to create parallel regions in the code, separate out multiple threads in these regions, and assign the threads to separate processors. You mark the beginning and (in FORTRAN) the end of a parallel region with pragmas. Then—the following is a great simplification of a very broad range of capabilities—you can select from two different models. You can tell OpenMP that the enclosed code is a for-loop (or DO loop, if you are a FORTRAN person) and that OpenMP should unroll the loop to extract the threads to be executed in parallel. This is the approach you would normally take to exploit data parallelism. Or you can tell OpenMP that the code contains a number of independent sections—each of which you mark with more pragmas—that may be executed on separate processors. This is the approach for task parallelism.
The OpenMP platform assigns the threads you have identified to processors based on directives you put into the pragmas, on hardware availability, and on run-time variables. So you can achieve anything from a purely static mapping of tasks onto processors to a fully dynamic system. If you tell OpenMP to stand down, your original program runs in its original form, as a single thread on one CPU. If you ask OpenMP to use more CPUs, it will try to do so in the manner you request. Additional directives govern synchronization of threads, ownership of and access to variables, and other ugly necessities of parallel programming.
Note that OpenMP deals with the mechanics, not the analysis. It is entirely up to you to decide which tasks should be parallelized, how to prevent contention issues, how to avoid deadlocks, and so forth. OpenMP does not specify an inter-task communications model or provide a debug environment—you are free to choose your own. OpenMP basically sits on top of your existing tool chain and OS—although there has been some work on creating bare-metal OpenMP platforms.
Separate from the work on OpenMP in the supercomputing world, a team at Apple Computer began some years ago to study the same challenge—parallel code—in a very different context. Rather than working with systems of many identical cores sharing a common virtual memory, the Apple team focused data-parallel problems on smaller, asymmetric architectures in which one CPU was attached to some number of hardware accelerators. That is, they were looking at personal computers. Intel, Nvidia, and AMD—all of which were working on CPU+GPU SoCs, quickly joined the effort. Their result was Open Computing Language (OpenCL).
Like OpenMP, OpenCL provides a platform for parallelization. Unlike original OpenMP, OpenCL envisions a heterogeneous environment in which each processor has access to a clearly-defined memory hierarchy grounded in local memory (Figure 2). And unlike OpenMP, OpenCL requires that some of your C code be converted to a subset dialect with some extensions. (Sorry, FORTRAN programmers.) In exchange for the rewriting in a more restricted language, OpenCL™ platforms translate the parallel portions of your code to execute on supported accelerator hardware. Today, that list includes some GPUs, multicore DSP SoCs, FPGAs, fixed-function accelerators such as video processors, or a mixture of these devices.
The OpenCL development flow is rather different from OpenMP’s development flow. Instead of one program punctuated by pragmas, you develop a set of programs. The main program—the equivalent of the master thread in OpenMP—you write in C/C++, Python, Java, or one of a few lesser-known languages. In many cases, this will simply be your legacy program with a few segments pulled out and replaced by application programming interface (API) calls. You compile this host program normally, along with an OpenCL API library, and it executes on your host CPU. (Note that the thinking in OpenCL is oriented toward a host CPU and coprocessors, even though it can be used with homogeneous multicore hardware if you wish.)
The portions of the code you want to accelerate—the pieces you would have flagged with pragmas in OpenMP—you rewrite as separate small programs in OpenCL’s dialect of C. These small programs are called kernels. Your main program will invoke them through the OpenCL API calls you inserted into it in the previous paragraph.
At run time, the OpenCL platform looks to see what hardware is actually available to it. If there are no accelerators available, the kernels may use the main CPU (if it is an Intel or AMD architecture) as if it were an accelerator, so you essentially get your original single-thread program. If there is additional hardware, OpenCL compiles the kernels to exploit it. This might include unrolling loops or unpacking vector operations and spreading the resulting threads across multiple CPU cores, or across the many small processors in a GPU. Or it could mean binding the kernel’s input data to a fixed-function accelerator in an SoC or FPGA. When you run the main program, it invokes the kernels, which in turn execute on the accelerators.
The FPGA presents an interesting special case, because OpenCL tools for FPGAs can compile a kernel into Verilog and synthesize a hardware accelerator for the FPGA off-line, before run time. Then the OpenCL driver would load the accelerator into the FPGA at run time. By creating custom hardware you can often eliminate instruction fetches and decodes, eliminate internal data movement, and exploit fine-grained parallelism that might not be available to a conventional CPU.
Since the initial release of OpenCL there have been numerous additions to the spec, including more data types, more flexible memory organization, ability to partition devices such as large GPUs or FPGAs into several different accelerators, and the addition of pipes—allowing a program to forego buffers and to stream data from an input directly through an accelerator.
So how to choose: manual parallelization, OpenMP, or OpenCL? (Yes, there are other, less-publicized options out there as well.) The first question, of course, is which platforms are available for the hardware you are considering. There are multicore OSs available for most multicore SoCs—although not all of them support asymmetric multiprocessing. So the manual route is usually viable without having to create your own bare-bones RTOS.
OpenMP is arguably the best solution for task parallelism or for homogeneous multicore SoCs, and if offers a beguilingly simple programming model. But it is just beginning its migration from supercomputing into the embedded world. There appears to be vendor support for the Texas Instruments Keystone ARM®-plus-multicore-DSP SoCs. And suggestively, in February ARM joined the OpenMP Architecture Review Board, giving the embedded giant a say in the direction of the platform. But both of these facts may be more closely related to those companies’ strategies in the server world than to their plans for embedded computing.
OpenCL, with its natural affinity for data-parallel acceleration on heterogeneous systems, has moved more quickly into the embedded infrastructure. Some SoCs combining CPUs with GPUs, some DSP SoCs, and some FPGAs offer OpenCL platforms. And an increasing number of board support packages for these chips provide turnkey OpenCL environments.
None of these solutions offers automatic parallelization of legacy code. And the entire industry is still working on effective debug methodology for multiprocessing systems. No one is suggesting that platforms make these designs easy. But as a way out of a box canyon—requirements beyond the roadmap of your current architecture—moving to parallel execution can be vastly easier and less risky than moving a huge legacy code to a new instruction set architecture. And the resulting performance may be superior, with lower energy consumption. It’s worth a look.
Explore a turnkey implementation of OpenCL for FPGAs.
See the range of multicore operating system support available for Altera® SoCs.
Read a case study of parallelizing a critical routine using an RTOS, Linux, and bare-metal programming.
See how OpenCL is used to implement a Map/Reduce algorithm for big-data analysis.