Hardware Acceleration for Embedded Computing

The Hot Chips conference for 2015 has come and gone, leaving us with a snapshot of some of the best contemporary thinking on silicon architectures. And while the scope of this year’s conference was wide—from wearables to supercomputers—a handful of papers, taken together, sketched out a strategy for the near future of embedded computing.

Why would we look to huge, ambitious chips to divine the direction of a market that loves microcontrollers? Two explanations come to mind. First is the growing virtualization of embedded systems. As more tasks migrate from dedicated hardware to the cloud, architects of server processors and the hardware accelerators that accompany them have to face the realities of the embedded systems world.

Second, embedded applications themselves are evolving. Growing use of machine vision and other robotic algorithms; of state estimators, Kalman filters, and similar techniques to augment direct measurements in feedback loops; use of computed executable system models in the loop; and exploration of machine-learning algorithms are all generating new, heavy-weight computing tasks in even familiar embedded applications. In many designs these tasks threaten to swamp the capabilities of even multicore embedded SoCs.

A Taxonomy of Accelerators

Onto this background of change, Hot Chips papers sketched a range of alternative futures, each represented by an architectural approach. Perhaps most senior of these was the vector digital signal processing (DSP) accelerator, presented in its latest incarnation by Qualcomm. But FPGAs were there as well, with Altera and Xilinx sharing their latest thinking on programmable logic in advanced geometries. GPUs showed up in both commercial and open-source versions. And homogenous many-core arrays appeared to be gaining momentum, with Intel showing a second generation of its huge and much-debated Xeon Phi.

There were also papers comparing the relative merits of the alternative architectures in particular applications, including a candid discussion from many-core vendor Kalray and a tabulation of results by a research team from Microsoft.

When to Use Which

Kalray CTO Benoit Dupont de Dinechin began his paper—a description of his firm’s latest many-core processor, with a succinct tour of the alternatives open to today’s designers. FPGAs, he said, excel at bit-level operations and can provide low-latency, deterministic performance. But their programming model generally requires register transfer level (RTL) hardware description languages such as Verilog. DSP cores excel at fixed-point, repetitive arithmetic operations across streams of organized data. But to achieve their efficiency they require careful coding by an expert.

GPUs, in comparison, are best at highly regular computations with dense patterns of memory access, de Dinechin said. But their architectures make them unsuitable for real-time computations. Unsurprisingly, de Dinechin’s taxonomy favored large arrays of simple, general-purpose CPU cores for real-time work—not for any inherent superiority of many-core architectures, but because of their underlying simplicity.

“Today real-time systems are very demanding,” he declared. “They must be deterministic, predictable, and composable. For demands of functional safety, they must be certifiable through static analysis.” The advantage of simple CPU cores, he explained, is in the features they don’t have. “Non-determinism comes from modern features to capture the last little bit of performance, such as out-of-order execution, speculation, and threads that share physical resources.” This taxonomy neatly organizes accelerator architectures into categories. And it provides useful generalizations to compare against the claims in subsequent papers.

Another, more cloud-centric survey came from Microsoft researcher Eric Chung. Chung presented results on various ways to accelerate deep-learning convolutional neural networks (CNNs) in a CPU-based server environment. His team considered bare CPUs, CPUs with a remote pool of GPU or ASIC accelerators, a GPU or ASIC attached to each CPU, and an FPGA attached to each CPU.

Focusing on the FPGA alternative, Chung described creating a systolic array of arithmetic logic units (ALUs), with accompanying control logic, in an FPGA. He said that the design could be scaled for numerical precision, array dimensions, and data-access parameters.

In initial tests, the FPGA design on an Altera® Arria® 10 device ran seven times faster, and at 1/40th the energy per task, compared to the same CNN evaluation task in software on dual 8-core Xeon CPUs. But the FPGA design was less than tenth the speed, and half the energy efficiency, of a highly-tuned version of the workload on an Nvidia Titan X GPU board. Chung said his team projects that an optimized FPGA design, scaled up to fill the chip, would reach about one-fifth the performance of the Titan X at nearly 50% better peak GOPS/joule.

Chung concluded that executing CNNs proved an ideal task for the architecture of GPUs. FPGAs, however, could deliver better energy efficiency and had the ability to be reconfigured to serve other tasks. Microsoft has already demonstrated using the devices to accelerate Bing searches and as a smart network offload adapter on the server board.

While these two papers provided an index of sorts into the range of accelerator architectures possible in today’s semiconductor and packaging processes, further Hot Chips papers explored particular options in detail. Let’s start out with perhaps the oldest of the ideas, the DSP core.

DSP Lives On

The DSP chip business may no longer be exciting, but DSP cores still play key roles in many application-specific SoCs. Small cores are embedded almost invisibly inside functional blocks like modems and audio processors. But by adapting techniques from high-performance CPUs, DSP cores have made themselves essential for more demanding tasks—such as image processing and machine vision—as well.

For example, Qualcomm senior director of technology Lucian Codrescu described the latest version of the Hexagon HVX DSP core used in his company’s Snapdragon SoCs (Figure 1). A summary reads like a survey of modern architectural ideas: a four-slot very long instruction word (VLIW), scalar and 1024 bit vector SIMD pipelines, and four-way hardware-supported multithreading.

Figure 1. The Hexagon DSP borrows concepts from advanced CPU cores.

f1_hexagon_dsp

In addition, the microarchitects have made some choices that seem specific to the HVX’s role as a vision-processing accelerator. One is that the scalar and vector pipelines are integer, not floating-point. “Floating point isn’t needed for the majority of these applications. So we chose to save the power by not implementing it in hardware,” Codrescu explained. Another is the somewhat unusual cache architecture. The scalar unit attaches to conventional L1 instruction and data caches, coherent with a MegaByte-scale L2. The vector unit is driven by the L1 instruction cache, but bypasses the L1 data cache to connect directly to the L2 for data reads and writes. The thinking appears to be that the vector SIMD unit will work primarily on streaming pixel data, and that it will be far more efficient to stream pixels through the big, slower L2 than to cascade them, a dribble at a time, through the L1.

Supporting this supposition, the L2 connects to smart data-movers that transport pixels from a camera sensor to the cache, and from the cache to a dedicated image signal processor elsewhere on the chip. The L2 also connects to an ARM® system memory manager, allowing sharing between the Hexagon L2, the ARM CPU cluster, and main memory.

Any resemblance between this arrangement and ARM’s NEON™ SIMD engine is quite intentional, Codrescu suggested. Qualcomm made every attempt to preserve the NEON programming model, starting with the basic approach of C/C++ code with hand-crafted application libraries, POSIX-ish threads, and a low-level virtual machine (LLVM) compile chain. Interestingly, Qualcomm is also developing a port of Halide, a domain-specific language for parallel image processing.

GPUs Hear a Different Drum

GPU architectures were well represented at Hot Chips, with both AMD and ARM sharing their latest thinking. But perhaps the most informative paper for our purposes was a presentation on an open-source GPU design from the University of Wisconsin at Madison. Intended as a general-purpose GPU rather than as a rendering engine, the cattily named MIAOW architecture leaves out specialized hardware for vertex generation and texture mapping, and thus reveals the underlying parallel computing engine more clearly.

One way to sort out the complexities of GPU architectures is to imagine them not as they actually emerged from increasingly complex graphics chips, but as a series of evolutionary steps away from advanced DSPs like Hexagon. To begin with, suppose you have a very organized, compute-intensive application that can keep lots of multiply-accumulate units running in parallel. Lots of units—an application like computing a color and intensity for each pixel in a high-resolution image. You might want an engine very much like Hexagon, but with many more ALUs, and floating-point, not fixed-point. The basic SIMD vector organization of the DSP unit is fine—you just need more of it.

An enormously wide vector DSP would be great, but it presents a couple of challenges. First, the machine would be more flexible if instead of one giant SIMD engine you defined several big SIMD engines. They could all work together when you had massive data parallelism to exploit, or they could work independently on a number of more modest tasks. For this reason as well as practical issues with trying to implement a giant synchronous block in silicon, GPUs are usually divided into a number of relatively independent compute units (CUs), each of which can have its own set of threads. MIAOW as presented (Figure 2) has eight CUs, for example. More, smaller CUs means more real estate for instruction fetch and decode logic compared to one giant SIMD machine, but that can be a good trade-off for the added flexibility.

Figure 2. The MIAOW GPU architecture offers a view into the mysterious innards of GPU hardware.

f2_miaow_gpu

The next challenge is gate count. All of those vector pipelines eat up a lot of gates, and the register files to support them get big as well. Nor is the power inconsiderable. The SIMD architecture helps by minimizing duplicate instruction fetch and decode logic: in MIAOW there is only one fetch-decode-schedule pipeline for every collection of 64 vector ALUs, one scalar ALU, and one load-store unit. But we still need to save more.

If we compare this CU to a conventional CPU core, some differences are obvious. The CU has many more ALUs than a CPU—even one with a vector processing unit—of course. It also has less control logic: no branch unit or branch prediction, no speculative execution, no register coloring or renaming, little reordering or out-of-order execution control. But mainly, in comparison, there is very little memory: there are no L1 caches, and only a distant L2 shared by all the CUs. There is a small, software-managed scratchpad.

This lean memory approach means that any time you have to use the load/store unit to load or unload registers, you will face either a long latency to L2 or a nearly interminable latency to DRAM. If the application’s pattern of memory use is quite regular, or if the algorithm does many computations per load/store, you can cover these latencies by clever prefetching and multi-threading. When one thread stops to wait for memory, you launch another into the pipeline.

The challenge here is that you may have to launch quite a lot of threads to keep the pipeline full until that first load/store completes. Wisconsin associate professor Karu Sankaralingam, who presented the MIAOW paper, said their calculations called for 40 threads—each with its own set of pending instructions and its own vector register bank—to keep the beast fed. Thus the GPU carries hardware-supported multithreading a lot deeper than do CPUs. Notice also that each CU is a 64-wide SIMD machine. When an instruction issues, it issues across all 64 pipelines. That line of 64 identical op-codes, together with mask bits to block activity on unneeded ALUs and registers, is called a wavefront, or a warp, depending on whether you are an AMD or an Nvidia aficionado.

Even from this thumbnail description it should be clear that programming a GPU efficiently is non-trivial. Algorithms must be thread-rich, must exploit the very wide SIMD datapaths, and must have memory access patterns that avoid thrashing the shared L2 cache and that keep the DRAM channels—often the most scarce resource in the system—operating efficiently. Even with languages like CUDA, OpenMP, and OpenCL that allow explicit control of parallelism, the programmer has a lot to keep track of. Do it right, and there is no more efficient concentration of ALUs to apply to your task. Do it wrong, utilization drops, and the GPU can slow down to the speed of a fast CPU at many times the power consumption.

It is probably also clear that GPUs are not an ideal environment for hard real-time deadlines, fast context switches, or deterministic timing at task level. There is just too much context and too much going on without central control. It is an interesting speculation, however, that by using external signals to control thread priorities it might be possible to create a very capable real-time system.

The Ascent of Many-Core

While DSP cores are incorporating more CPU-like features, and GPUs are seeking more flexibility for their massive SIMD power, another set of chip architects is moving in a third direction. Many-core architects take as their ideal not DSP chips or graphics chips, but rather massively parallel supercomputers. So their chips are large arrays of independent CPU cores, each with its own complete memory hierarchy, embedded in a high-bandwidth interconnect network.

We have already mentioned Kalray in connection with deterministic computing. But perhaps the best example of this thinking for sheer scale was Knights Landing, the second-generation Xeon Phi device. Intended to team with Xeon CPU chips on a data center server board, KL is a display of force. One die in the multi-die package comprises 36 CPU tiles—each tile including two x86 Atom-derived CPUs, two vector units, a 1 megabyte (MB) L2 cache, and a coherency controller. The die also has six DDR4 DRAM ports. Also in the package are 16 gigabytes (GB) of fast, Hybrid-Memory-Cube-derived multi-channel DRAM (MCDRAM) and a controller for two channels of Intel’s inter-package Omni-Path interconnect fabric.

In a departure from the original Xeon Phi design, each of the CPU cores in KL is fully Xeon CPU-compatible: able to boot itself and run Xeon code. Memory architecture also changes with the new device. Each compute tile has an L2 cache. All 36 L2s are linked through a 2D on-chip mesh, using an aggressive coherency protocol, so they collectively for a huge virtual cash with highly variable latency. This array of L2s can be subdivided to trade size for latency. And the chip can use the in-package MCDRAM as an L3, or as a portion of main memory.

The programming model for KL is both an easy and a hard question. The easy answer is that the device is just a collection of 36 independent Xeon dual-core CPUs, each core having its own 512 bit AVX2 vector coprocessor, and all sharing a high-bandwidth L3 cache and main memory. Any code that runs on a large number of Xeon cores should work on KL.

The more complex answer is that to get good utilization out of KL, you need very careful thread engineering. In order to reduce the impact of data-moving latencies, the CPU cores support up to four active threads, each of which gets the full resources of the core. But four threads won’t be enough to absorb the worst-case (maybe even the expected-case) latencies from an L2 miss. Not only will the application need to be rich in threads—ideally, 288 of them—but the determined programmer will need to mind which threads go on which cores in order to minimize L2-to-L2 forwarding latency when threads are sharing data. She will also have to consider the operation of the MCDRAMs and their location in the mesh, to minimize L3 latency and contention for both the MCDRAM and the mesh paths. And of course she will have to insure efficient use of the DRAM channels.

An Embarrassment of Riches

The options sketched out at Hot Chips all offer compact, potentially energy-saving ways to accelerate parallelizable computation in advanced embedded systems. But each makes a different set of compromises to achieve high-computing density. The approaches consequently differ in many important respects: the number of ALUs available; the balance between ALUs, general-purpose computing, and memory bandwidth; and above all the programming model necessary to achieve the performance, energy efficiency, and determinism you require. What starts out as a simple C program can turn into a tangle of multi-variable optimizations—or what looks to be a forbidding excursion into hardware-description languages. Which approach will prove the most difficult is not obvious at the outset. It has never been more important to consider all your options.

 

 


CATEGORIES : All, System Architecture/ AUTHOR : Ron Wilson

2 comments to “Hardware Acceleration for Embedded Computing”

You can leave a reply or Trackback this post.
  1. I truly believe that future lies in paralel SMP (hardware design) general purpose cores witch will have a wide VFP as a central part along with wide SIMD’s vertical connected future to L2 cache with advanced predictor in the middle. Future more this central block would be tied to (2x or 4x) integer rop books by the sides that will have wide as they are interconnection to the primary block on one side & output on the other side to connect additional accelerators, every block would have its own L1 & TBL. When it comes to accelerators I prefer to leve GPU’s as special purpose ones (graphics only) & to use FPGA’s (with DSP’s) as general purpose ones. I see this approach as to gain all benefits of a fully heterogeneous SoC wile putting in good use all available resources will not having wasted gates or having double blocks (doing nothing part of the time) & it still doesn’t complicate design to much. Afterwards there will come other layers like fast shared unified memory suitable for liquid tank scheduler & cetera.
    This is at least my humble vision of the direction we should be pushing in & even it sounds fairy simply it won’t be easy to get there.

    Best regards.

  2. Very interesting this brief by Mr. Wilson. Congratulations.
    It keep us awake and attentive (and also, alive).

    Besides, good resume by Zola, of a likely future in Hardware Acceleration and computing.

    Thank you.
    Regards.
    Marcelo
    (Argentina)

Write a Reply or Comment

Your email address will not be published.