You could substantially increase the performance of your system with one little change. Or you could take a big bite out of energy consumption. Or you could greatly strengthen the security of your system. All with one little change of habit. If you are a car or bicycle enthusiast, you probably pay attention to custom cars or bespoke bicycle frames. So why not consider a customized CPU (Figure 1)?
Granted, designing a CPU from scratch—and providing all the necessary verification tools, software development suites, and operating system ports—is out of the question for most design teams. Even mighty Apple starts out with ARM* architecture licenses and infrastructure. But modifying an existing CPU can be a practical—even an important—choice even for designs of modest means.
So how do you evaluate this alternative? Where do you start, what customizations are feasible, and what can you accomplish? Read on.
Knowledge Comes First
Before you can improve your system’s performance, power or energy efficiency, or security, you need an accurate, quantitative system model. For some decisions, accurate means functionally correct. For others, you will need a cycle-accurate model, which for many teams means a working prototype.
You need this level of detail because of the kinds of questions you will be asking—about code hot spots, cache activity, memory performance, and multitasking/multithreading behavior. You will be profiling to locate overstressed and underutilized resources. But if you find hot or cold spots, what can you do about them—especially if you aren’t designing your own SoC? A lot, as it happens.
The useful customizations you can make to an existing CPU span a wide range of invasiveness, difficulty, cost, and effectiveness. Some, like attaching an external accelerator or power-gating an unused functional block, can often be done on standard-product CPUs, even for small customers. Other mods that alter the layout of the CPU die—like changing cache sizes—obviously require a large up-front payment or volume commitment, even if they aren’t technically challenging. Still other ideas, such as implementing a configurable processor in an ASIC or modifying data paths in a CPU core, require relatively more expertise.
The decision comes down to a cost/benefit analysis with a number of independent factors: specifically what you are trying to optimize; your design team’s skills, budget, and schedule; and the business case you can put before your CPU vendor.
Starting with Performance
System performance is probably the most common reason designers look critically at their processors. Generally the symptom is timing: a hard deadline is not met, or a buffer overflows. Somewhere a task is not executing fast enough. In the data center world, the symptom may be entirely different: a workload may be lingering on the servers longer than expected. Again, same problem.
The normal responses to these symptoms are quick fixes: first, lean on the software team to improve their code, and if that fails get a faster CPU, or find a way to divide more threads among more cores. But faster chips or more cores may not be necessary, and may not in fact solve the problem. Careful code profiling—which is probably where the software team will start when you lean on them—is where CPU decisions need to start too.
Almost always, profiling the offending code segment will reveal that the vast majority of the time is consumed in one or a few small code kernels. If you can make these go faster, you are done. But before looking at accelerating instruction execution, you have one more important question. And this is where that cycle-accurate model comes in.
The question is simple: does the time actually spent in your critical loop make sense given the number of instructions you executed? If so, you are execution-bound. If not—if your CPU is spending a lot of cycles waiting for cache or DRAM, then the CPU core may not be your problem at all. You may be waiting for a thrashing cache or a bottlenecked DRAM port. CPU-bound or memory-bound, either way some small modifications to the CPU could be the best solution.
Let’s look at the CPU-bound case first. The objective is simple: profiling shows that you need to execute a code segment in less time. There are several approaches, ranging from non-invasive to difficult.
The easy way out is just turning up the CPU clock, if you aren’t already running at the maximum frequency for the fastest speed grade you can use. If you are using a modern high-performance multicore CPU chip, there is another variable involved as well. The CPU itself may be throttling the clock on the core you are using. You may want to see if suspending tasks on other cores or increasing the cooling will give you a higher actual clock frequency.
If that won’t get you there, an external hardware accelerator may be your answer (Figure 2). Such an accelerator will normally be an FPGA, unless you have the time, expertise, and volume to justify an ASIC design. If the function you are accelerating is so standardized that there is off-the-shelf IP available for it, designing the accelerator may be relatively easy. If not, you will have to spend some time analyzing your code. Once you understand how the code works, you can unroll loops into parallel structures, you can pipeline sequences of operations, you can compress multi-instruction sequences into single clock cycles, and you can separate independent operations into parallel paths. The result of these transformations, combined with the elimination of the instruction fetching, loading and storing a CPU must go through, should give you a significant performance gain.
Connecting your accelerator to the CPU also requires some thought. It is not hard to give back to communications overhead all the performance you gained from acceleration. Once again, you have to understand the code. If your function works on streaming data, or if it takes in a block, processes it intensively with little reference to other data, and then returns a result, then a loosely-coupled accelerator connected via PCI Express* (PCIe*) should be sufficient. The data-transfer time should not offset your acceleration gains. But if the operations on the accelerator are entangled with code still running on the CPU cores, you may want a coherent connection to the CPU cache bus. Some high-end CPUs offer such a connection to external pins, but the protocol may or may not be public.
The ultimate in close coupling is to modify the CPU core itself to add instructions—changing dispatch, decode, and pipeline hardware. This allows you to replace a frequently-occurring sequence of instructions with a single instruction, potentially eliminating many fetches and register or cache transactions without any overhead for communicating with an external accelerator. This level of customization is not accessible if you are using an off-the-shelf CPU. Even vendors who may allow big customers to add or delete major blocks on chips will hesitate to modify critical logic within blocks. They simply won’t undertake the necessary logic verification, regression testing, relayout, physical and timing verification, rule checking, and so on. The closest you can come is to buy an ARM architectural license and implement your own core.
But there are other good options. If you are designing an ASIC, both Cadence and Synopsys offer configurable RISC core platforms: the Tensilica and ARC products respectively. These are not just libraries of RTL, but rather custom CPU generators. You tell the platform what you want, and the tool set generates the RTL, simulation models, test benches, and software development tools necessary for a core with your customizations included. For some tasks like audio signal processing or video analytics you can even license preconfigured cores off the shelf.
If you aren’t doing an ASIC, there are still options with FPGAs. Neither most ARM cores nor the ARC nor Tensilica designs are well suited for FPGA implementation. But both Intel and Xilinx offer compact, configurable RISC cores optimized for their FPGAs. And recently, the open-source RISC V core has become popular. These cores are compact enough to implement in small FPGA families—making in effect a user-definable microcontroller. They can also be combined in many-core arrays to create quite formidable processing systems. But they are also configurable, scaling from very simple to large, quite complex cores and easily adding accelerators in RTL code.
Is Memory the Problem?
What if the profiler says your CPU is not overworked, but idle? Again you have options. You may be able to reorganize or enlarge an L1 or L2 data cache so that the entire working set for your offending loop fits inside. Or you may be able to simply lock a portion of a cache or add a tightly-coupled local RAM to keep a block of code or data permanently resident. Any of these moves can reduce cache thrashing and ease the pressure on DRAM channels.
If you are struggling with DRAM bandwidth even after cache optimization, you might want to consider a more sophisticated DRAM controller: one that can group, reorder, and prioritize DRAM requests from multiple clients in order to minimize page misses. Such techniques can increase DRAM effective bandwidth by a factor of 10 or more. If that can’t help, you can add more DRAM channels, either interleaving pages or dedicating a pool of DRAM and a controller to one problem data set.
The good news is that if your intended volume is high, modifications to cache architecture or DRAM channels may be possible even on standard-product CPUs and ASSPs. No vendor publicly discloses this willingness, but it is an open secret that standard chips are sometimes retuned for specific data-center or embedded workloads. If you are doing your own ASIC or FPGA, cache and DRAM-channel optimization is just part of the job.
What About Power?
It might seem odd to think of CPU customization as a way of saving system energy or power. But there are two important strategies in this category: turning off the lights, and shifting tasks from software to hardware.
The first strategy is simple: when you don’t need a block, don’t include it—or at least disconnect it. Just as vendors can sometimes be induced to add cores or resize caches, they can sometimes be induced to leave blocks out or to shrink caches. Maybe you only need three cores instead of four. Or maybe the way you have allocated tasks, only one core needs a vector unit. Or you are processing streaming data, and don’t need data cache at all. If you know your workload and you can make a business case to the chip vendor, you may get a custom layout with just the blocks you need. Or you may be able to have unneeded circuitry disconnected, eliminating both its static and dynamic power consumption. Of course if you are taking the FPGA or ASIC route, these options are just part of the design job.
Less intuitive is the idea that you can save energy or power by adding hardware. This becomes clearer when you look at where the energy actually goes in a computation.
Consider a simple integer add. You fetch an instruction from L1 i-cache. You queue, decode, and dispatch it. The instruction generates a register read or two, maybe requiring address translations to deal with register renaming, a trip through an arithmetic unit, and writes to a write buffer, a general register, and a condition-code register, maybe with more address translation. With today’s speculative, out-of-order, multithreaded CPUs, this is a very simplified picture. The point is that only a tiny fraction of the energy consumed actually goes into adding the two numbers in the arithmetic logic unit (ALU). The rest is overhead.
In contrast, a custom accelerator dispenses with fetching and decoding instructions. It passes interim results through a pipeline rather than writing them back to general registers. It carries no overhead for cache management, multithreading, out-of-order execution or speculation. In short, the accelerator can be far more efficient because it is spending so little of its energy on overhead.
Another potential advantage for accelerators comes from the difference between energy and power. In some applications—notably Internet of Things (IoT) endpoint devices, duty cycles can be quite low, and aggressive sleep modes can eliminate virtually all energy consumption when the node is sleeping. Thus total energy draw is primarily dependent on duty cycle, not instantaneous power consumption. If you can use an accelerator to crunch a block of accumulated data quickly so the endpoint can wake up, process the block, transmit results, and go back to sleep faster, the higher power consumption of an accelerator may save significantly on energy consumption, perhaps even making operation on scavenged power possible.
Some design teams have exploited another benefit of customized processors. If you develop a proprietary accelerator or add hardware to extend the CPU instruction set, you obfuscate—and sometimes totally conceal—your algorithm and its implementation. This is a useful tactic for protecting trade secrets. But it becomes vital for securing a system against a skilled and persistent attacker.
And that brings us to a final point. Often standard CPUs lack hardware features that are necessary to system security. Physically unclonable functions (PUFs), hardware-secured key stores, formally verified atomic operations, tamper detection, and unsnoopable encryption engines will be increasingly demanded in mission-critical operations as cyber attacks become bolder and more damaging. An external FPGA or ASIC functioning as a chip-level hardware security module (HSM) (Figure 3) may be the best interim solution.
We have seen that for performance improvement, for power or energy reduction, and even for security, customizing your processor can be an important alternative. Even if you don’t have the clout of an Amazon or a Facebook, modest ASIC accelerators, custom CPU generation platforms, and FPGAs all offer good options. You aren’t stuck with buying what’s on the shelf.