Asymmetric multiprocessing (AMP) is on the short list to be the technology acronym of the year for 2016. But what is it, exactly, and why would you want any? More usefully, what are the considerations and challenges in implementing AMP in an embedded system?
Let’s begin with definitions. A symmetric multiprocessing (SMP) system is one in which all the processors are nearly identical below the application level: essentially the same software stacks, instruction sets, memory configurations, and CPU hardware (Figure 1). In most SMP systems, different CPUs will be executing different application threads and generally different CPUs may have different peripheral and interrupt-request connections. But otherwise they are all the same.
SMP is heavily used in data centers, where the uniformity allows great flexibility in allocating tasks across tens of thousands of CPU cores. In the embedded world, SMP can increase the speed of thread-rich tasks by running many threads in parallel. It can also be used in redundant systems together with comparison or voting circuitry to increase reliability.
So what is AMP? Simply put, an AMP system is a multiprocessing system in which the processors are not almost identical below the application level. They may differ in operating system (OS), memory, or processing hardware.
SMP offers simplicity and a certain elegance. Why mess it up? Why bind a task to a particular processor? There are several good reasons, according to Intel WindRiver product line manager Michel Chabroux. “In most AMP use cases, the object is to maintain separation between tasks,” Chabroux says. For example, an architect may be consolidating tasks, some of which have real-time deadlines. The designers may choose to use two CPU cores, one running Linux and one a real-time OS (RTOS).
Another situation arises when physical separation of cores is important. For example, designers may move a latency-critical task onto a separate core to protect it from system-level interrupts. Or, Chabroux offers, the safety-monitor task in a self-driving car may be on a separate CPU core to ensure that it continues running even if the rest of the system crashes.
A third motivation is the need for specialized hardware: when there is a task that can’t meet its requirements running on an instance of the main CPU core. ARM’s big.LITTLE technology is an example. By providing two binary-compatible CPU cores—one slow and very low-power, the other fast but power-hungry—big.LITTLE allows supervisory code to move tasks at will to optimize for either power or energy consumption. The result can be a system that both meets its performance requirements and consumes very little energy.
But often, tasks in an AMP system will not be portable—the processors will be of fundamentally different kinds. Examples include hardware accelerators, such as GPUs, FPGAs, and the function-specific accelerators found in most application-specific SoCs.
No matter the motivation for choosing AMP, there are some central issues—common to any multiprocessing system, actually—that will dictate the implementation. These include how tasks will be executed, how tasks will be controlled, how data will move through the system, and how tasks will access the outside world.
Your choice of hardware for the processors comes out of the original design objectives. If the goal is simply to physically separate some tasks from the rest of the system, then usually the easiest approach will be multiple instances of the same CPU core, but with some cores running different operating systems from others. This could mean different builds of the same OS—for example, two Linux kernels, with one handling all the system calls for both. Or it could mean quite different environments—say, Linux on one core and an RTOS or a bare metal application on another.
System constraints may dictate differences not just in OS, but in hardware. For instance, a task that in previous generations had run on a particular microcontroller unit (MCU) core may be best staying on that core. It is relatively easy to find intellectual property (IP) for legacy MCU implementations on ASICs or FPGAs. And no one really wants to reverse-engineer an ancient file of 8051assembly code to rewrite it in C for a 64 bit ARM® core.
Timing, power, and energy may also be reasons to turn to heterogeneity. Sometimes just isolating a task from system interrupts is not enough: you also need a CPU with shorter, deterministic task latency. Thus it might make sense under stringent latency constraints to move a control loop off of the main ARM Cortex-A core onto a separate Cortex-R core. And, as previously mentioned, power or energy constraints may mandate a special core for a particular task, such as a low-power core for an undemanding but persistent task, or a very fast, power-gated core for a bursty, compute-intensive task.
Often, though, the issue will be raw performance for a compute-intensive task. That takes us into the world of hardware accelerators (Figure 2). These may be programmable subsystems such as digital signal processing (DSP) cores or graphics processing units (GPUs). They may be fixed-function accelerators—crypto engines, protocol offload engines, or vision-processors, for example. Or they may be custom parallel or pipelined engines in FPGAs.
Matters of Control
How to control a task executing on another processor is always a key issue in multiprocessing. There are obvious issues, such as initializing the task, starting and stopping it, and exchanging messages with it. And there are less obvious questions, such as getting status information, possibly passing interrupts to the task, handling exceptions, and—crucially—providing adequate observability and controllability for multiprocessor debug.
In an SMP system these issues are often addressed by an SMP OS. Since there would be a nearly-identical OS instance on each processor, they can just message each other. In an AMP system, in which the execution environment on each processor may be quite different, things get more complicated. There are efforts such as the Multicore Association’s OpenAMP to provide a homogeneous hardware adaptation layer between the processors and a variety of operating systems, creating a common set of resources for inter-task communications—in OpenAMP’s case, based on the association’s Multicore Communications API (MCAPI). Similarly, there are Type-1 hypervisors that will run on bare metal on the various processors and present a set of well-behaved virtual machines to the various operating systems. But still some of the work of implementing control may land on your desk, in the form of a specification for bare-metal functions you must implement on your processors.
You can look at these requirements from at least two different points of view: what the application sees, and what the silicon sees. The application may see each task in the system as autonomous, exchanging information and synchronizing through a multiprocessing application programming interface (API). Or, if there is a clear hierarchy of control, the main program may see tasks on the other processors as callable functions, or even as I/O operations accessed through a device driver.
Each of these ways of relating to tasks on other processors suggests—but does not mandate—a particular hardware implementation. If the tasks are autonomous, the obvious implementation would be a non-coherent shared-memory system with a mechanism for message passing, perhaps augmented by large private memories attached to some of the processors. If the tasks may be working on the same data structure concurrently, a coherent shared-memory system might be advisable. Such systems are readily supported by most commercial CPU cores, but could be a challenge if you are developing an accelerator that doesn’t natively support shared-memory management or cache coherency.
Treating subsidiary tasks as functions can simplify things, suggesting a hardware implementation less complex than shared memory. Processors that execute the functions could be attached to a high-bandwidth silicon interconnect like AMBA® AXI™ or an off-die bus such as PCI Express® (PCIe®), exposing local memory and control/status registers to the main CPU (Figure 2). Taking one step further, if the task is treated as an I/O operation, the processor can reside on a peripheral bus like AMBA APB™.
But there is no necessary relationship between the application’s view of a task and the way the hardware is physically connected. If hardware designers take the simplest approach that can deliver adequate latency and bandwidth, software can emulate whatever application-level view is desired.
Data is Critical
One of the most critical decisions in implementing an AMP system is how the processors will fit into the system memory hierarchy. This decision should be driven primarily by the way the tasks on the various processors touch data. In an SMP system, the default choice is a cache-coherent, shared-memory organization in which tasks running on any processor have the same access to a single shared memory space. But in an AMP system, where some tasks only run on a particular processor, there is often the opportunity to tune the memory architecture to the access patterns of the individual tasks.
This tuning depends on the way a task uses memory: in particular on locality of reference, shifts in access over time, and required bandwidth. In the ideal case for conventional shared-memory systems, the task picks up a relatively small set of contiguous data—small enough to fit in its L1 or L2 data cache—works on that set intensively for a while, and then moves on to a different nearby set of data. This pattern allows almost all the loads and stores to hit local caches.
Unfortunately, many important algorithms are less than cooperative. One recent paper has estimated that big-data analysis tasks can experience cache miss rates over 90 percent. Applications that use very large tables or linked lists can show very scattered access patterns as well. In some of these cases it can make sense for a processor running a task to have its own private, very large local memory. This memory may be managed as a cache, but often it may be better managed explicitly in software. Sometimes caches are simply unable to reduce average access time, and you have to find other ways to hide memory-access latency, such as deep multithreading. Such ideas open the door, by the way, to new concepts such as large flash arrays attached directly to processors as local memory.
A special case occurs when data streams continuously into a task, is used only over a short sequence of operations, and then is passed on or flushed. Such situations arise in network packet switching, signal processing and in implementing transfer functions in control systems, for example. The best implementation may be streaming direct memory access (DMA) directly into and out of local memory of the processor, bypassing the main CPU and memory altogether and allowing the streaming task to run almost autonomously from the main CPU.
That brings up a last point: how do AMP processors relate to the outside world? A processor likely has I/O bus connections to the system, allowing I/O register control and status transactions at least for initialization and exception recovery. In the case of streams processing, or if a processor is doing real-time, interrupt-driven processing of real-world events, that processor may have direct I/O connections to the outside world. But more often, the interrupts and I/O transactions in an AMP system will go to the central CPU, which will then buffer data through main memory.
Open for Virtualization
This rather vast range of hardware-level alternatives illustrates the great strength of AMP. You can tailor a hardware and OS environment for each of the most demanding tasks in your system . And it shows the greatest risk of AMP: without care, every important task in your system could face a different environment for execution, communication, and debug, and a different memory model.
Here is where standards like OpenAMP can help. So can an embedded hypervisor.
“Think of a hypervisor as an RTOS that just schedules virtual machines and allows them to talk to each other,” Chabroux advises. In addition to setting up MMUs, the hypervisor can bind the correct code to the correct processor. It can create virtual memory, device, and network connections. It can instantiate soft processors in FPGAs. And it can offer a uniform means for tasks to communicate with each other.
All these services can make AMP systems software-defined, so that each workload may see the virtual system it desires. But there are costs. Hypervisors consume CPU cycles, memory, and power. They can add latency to critical paths, such as interrupt response times. And, as Chabroux notes, to avoid exploding software complexity, hypervisors need hardware support. For example, multithreading in CPUs, live partial reconfigurability in FPGAs, and registers to support multiple active channels in DRAM controllers, bus controllers, and DMA controllers all dramatically reduce the software complexity and latency of the hypervisor.
With or without a hypervisor, AMP can be the best route to meeting your system requirements. But it is still very much a matter of, as the flat-pack furniture folks would say, some assembly required.
For Further Reading
Read a discussion of SMP and AMP implementation choices in an SoC FPGA.
Get an overview of a soft RISC CPU core for AMP in FPGAs.