It is a scenario many embedded-systems designers recognize. An existing design needs an update. That could include Internet connectivity to bring the system into the Internet of Things (IoT). Along with that might come requirements for deeper security. And given the current enthusiasm for all things artificially intelligent, there may be new needs for deep-learning inference or machine vision.
An immediate concern is the impact these changes will have on the system hardware. All of them can be subdued by throwing CPUs at them (Figure 1). But for the smaller embedded design—probably squeezed within an inch of its life to begin with—cost, power, and cooling constraints can make brute force a non-starter. Replacing an aging 32-bit microcontroller—or worse, a legacy microprocessor that makes Smithsonian curators drool—with a server-class CPU or an advanced smart-phone SoC may simply not be feasible.
At this point it is important to look at FPGAs. In fact there is often an old, small FPGA already in the system doing utility work: acting as a port expander or device controller. But a modern low-end FPGA may be able to act as a hardware accelerator to pull the new computing requirements back within the reach of the existing system processor.
Taking the Next Step
But let us go one step more. Does it make sense to absorb the system’s (or a subsystem’s) CPU or microcontroller unit (MCU) into the FPGA as well? The obvious answer is “of course not.” Everyone knows that a soft CPU core in an FPGA will be huge, slow, and expensive. Except that for an important class of embedded systems those generalizations are simply untrue.
We are not talking about systems that already have very substantial CPU horsepower, such as a cluster of Arm* Cortex*-A53 cores. There are midrange FPGAs that include such a CPU cluster in hardware, but they are the subject for a different article. We are talking about systems—or subsystems within the overall design—where the processor is more modest: a Cortex-M class core in a microcontroller, say, or a real legacy CPU like a 68000. Often such older processors can get stranded in a system design, left there from generation to generation out of reluctance to touch ancient, ill-documented code until end-of-life forces someone’s hand. We will show that often it can make great sense to absorb such smaller or older CPUs into even a low-end FPGA (Figure 2).
Just how this absorption can happen depends mainly on not the CPU, but the legacy code. The key question is whether you have the original source code, and in what form. If you have a heavily documented source in C or C++, ideally with the original test bench, you are in an excellent position. You can choose from the entire range of CPU core options available for soft implementation in FPGAs, which we will shortly see is considerable. Then you can recompile and test the code for the CPU you have selected.
Unfortunately, this won’t always be the case. Historically, compilers for microprocessors weren’t always adequate for embedded design—especially for those subsystems with real-time constraints. Very old code—or code written by very conservative engineers—may be entirely in assembly language. More recent code is likely to be mainly in C, with critical routines hand-coded in assembly. Either way, at least some of the code is locked into a particular instruction-set architecture.
A second, closely related consideration is the degree of hardware independence, not in the language, but in the coding style. Bad practices like embedding interrupt handlers, drivers, and physical I/O addresses in application code used to be considered clever, back when the code space and latency they saved were of major importance. They can make porting to new hardware more difficult. Really bad ideas such as writing timing-dependent code used to be thought even cleverer. Such code may have to be mostly rewritten to run on vastly faster modern hardware. But even given assembly language source code and questionable coding style, there may still be practical ways to incorporate the legacy module into an FPGA.
One approach, if the CPU in question is genuinely ancient, is to use an open-source register transfer level (RTL) model to reimplement the legacy microprocessor or microcontroller in your FPGA. Resources such as Github have Verilog models of many legacy processors, including the 6502, Z80, 6809, 68000, and 8086. But there are several issues to consider before designing in one of these cores.
The first question is legal. Just because the Verilog is available doesn’t mean you have the legal right to use the design in a commercial product. Some of these models were coded for researchers or hobbyists, with no thought to intellectual property rights. Some architectures from long ago and far away may actually be in the public domain. Others may not—anything designed by Arm being a case in point. It is your responsibility to find out.
The next question is the intent of the writer. Is the Verilog intended to be an approximate functional description of the architecture, maybe for educational purposes, or is it an instruction-accurate model, or a cycle-accurate model? Is it intended just to execute code in simulation, or to be wrapped in user control logic and I/O? Or does it include all the other hardware that goes into a microprocessor chip, like the original chip’s interrupt and direct memory access (DMA) controller, debug provisions, and memory management? You must match the features of the Verilog model to the needs of your legacy system, or you will be spending no small amount of time learning about the quirks of an old piece of silicon.
Then there are those details in which dwell the devil. As SiFive product manager Jack Kang points out, legacy CPUs–like modern ones–went through many revisions over the product life, each correcting a set of errors or quirks. Which version does the Verilog represent? Or is it an idealized model, representing the way the author assumed the chip should work? Finally, how careful was the designer? Has the model been run cycle-by-cycle against actual legacy silicon? Has it booted the operating system you need to use? Has it … uh … ever actually been synthesized successfully?
If a Verilog model doesn’t work out, there is still another option. Really old CPUs were so slow that an instruction-set simulator running on a tiny modern RISC core in a current FPGA can traverse legacy code in essentially real time—especially if any troublesome sequences are offloaded to a state machine elsewhere in the FPGA. This approach cannot easily be rendered cycle-accurate or timing-accurate, but it can be functionally correct. And it transforms the porting problem from the hardware domain to the software domain where, with access to a full debug bench, it can be far more tractable.
Implementing the CPU
Once we have discussed the feasibility and difficulty of moving legacy code into the new system, the next question is how to implement the CPU core in a low-end FPGA. We have already discussed the case of duplicating a legacy CPU so now we can look at the options for implementing modern, higher-performance CPUs.
The governing concern is that processor cores rely on some hardware structures—implemented in an ASIC as standard-cell or even custom logic—that are not easy to duplicate efficiently in FPGA fabric. As a result, we have to look at three distinct cases (Figure 3): the first is a CPU core that has simply been synthesized from a Verilog model intended for simulation or ASIC synthesis—the so-called out of the box scenario. Second, a core that has had its RTL hand-optimized for FPGA synthesis. And finally, a core whose architecture was developed from the beginning to be implemented in an FPGA. Each of these approaches will have different availability, size, and performance. All can be viable for modern low-end FPGAs.
Out of the Box
While not every vendor of CPU core intellectual property (IP) specifically targets FPGAs, there are at least two routes to an FPGA core available from most IP providers. The most obvious route is to license the RTL source for the core and run it through your FPGA vendor’s tool chain yourself. There are challenges along this path, all relating to the fact that this RTL is intended for ASIC synthesis, not FPGA synthesis.
Problem one is that—especially if you are the first one to try this code in an FPGA—there may be things in the source that don’t work with the FPGA synthesis tool. The code could be obscured or encrypted in a way incompatible with your synthesis tool. It could include pragmas your tool doesn’t recognize, signal-naming, or even comment conventions that break something. You can edit such things out but that raises problem two: licensing.
Odds are that if you are using a Verilog source intended for ASIC development, you will have to edit it. And that means you will need an unobscured source with full documentation, and/or a lot of support from the IP vendor. These are available, but they are intended and priced for customers with deep pockets, huge production volumes, and big legal departments. You can probably work out a suitable contract with a small IP vendor. But negotiating something like that with a company the size of Arm could be infeasible.
There is another path. Some IP vendors provide evaluation or development kits on which their CPU core is implemented in an FPGA. That core might not be highly optimized, but it is at least working and verified, and fast enough for software development.
SiFive product manager Jack Kang says that some of his customers have taken this approach. The company’s CoreDesigner tool allows you to start with any of a range of RISC V preconfigured cores, adjust the configuration to your specific needs, and then output RTL. But the tool also outputs a programming file for the FPGA on SiFive’s development kit. “Some people have used this in their product,” Kang says. Naturally there needs to be an agreement about royalties for production use of what was licensed as development IP but the company can accommodate such changes.
Kang says this FPGA implementation of the RISC V is not highly optimized for FPGA use, but still comes in under 20K look-up tables, and can hit around 100 MHz, depending heavily of course upon the configuration. That size fits in many low-end FPGAs with considerable space left over, providing a quick and easy route to dropping the highly popular open-source core into your system.
There are ways to improve on those figures but they require some work. The opportunity comes from the fact that there are structures in CPUs that don’t fit gracefully into FPGA logic fabric.
FPGAs, remember, use huge arrays of identical logic elements to implement logic. Each element contains some combination of look-up tables (LUTs)—usually around four inputs each—to synthesize logic functions, and one or more flipflops. For most random logic, pipelines, and simple state machines this arrangement works well. For logic with high fan-in, as can occur in arithmetic hardware and address decoders, synthesis tends to generate a long cascade of narrow logic elements, consuming interconnect and causing delay. And for memory-based functions like register files, caches, and associative memory, mapping the function into the logic elements’ flipflops one or two bits at a time can consume a lot of resources, even though vendor tools are smart enough to try to separate the logic elements’ LUTs from their flipflops and use them separately.
This mismatch became apparent long ago when FPGAs were first being applied to packet switching, digital signal processing, and similar applications. To fix it, FPGA vendors embedded large blocks of segmentable SRAM and hardware multiply-accumulate blocks into the logic fabric. By employing these resources, you can often substantially improve the size, and sometimes the performance of a CPU implementation. But doing so may require some knowledgeable intervention, either in the RTL source or during the synthesis process. Results may improve ever further if a skilled FPGA user goes through the RTL and tunes it for FPGA synthesis using a collection of know-best coding practices for FPGAs.
The good news is that this is not only possible but some vendors have already done it. Arm’s Cortex-M1, for example, is essentially a version of the highly compact Cortex M0 that has been optimized for use in low-end FPGAs. It is a very simple core implementing the v6M instruction set with a single execution pipeline and no caches, reducing the resource requirements. But it still employs a hardware multiplier, so it is not a trivial core. On an old Altera® Cyclone® III device—the M1 design is almost a decade old—one report from the Arm community states the core requires only 2600 logic elements and can exceed 100 MHz. The speed should be considerably higher in more modern small FPGAs. More recently, Arm has provided an FPGA version of the larger and more sophisticated Cortex-M3.
Industry-standard CPU cores such as the Cortex-M family or RISC V offer familiarity, established (or, in RISC V’s case, growing) ecosystems of tools and software, and the opportunity to move easily between FPGA vendors or to an ASIC implementation, or even in some cases to move to a third-party off-the-shelf SoC. But in return they extract a price: in fees, size, and sometimes performance.
If you wish to pursue FPGA optimization to its fullest, you need to go another step: to optimize not just the implementation, but the CPU and instruction set architectures themselves—starting with a clean sheet of paper and adding only structures that are FPGA-friendly. Long ago, when FPGAs first became large enough to accommodate CPU cores, both major FPGA vendors began such efforts. Those efforts—Nios® II processor in Intel’s case—have continued to thrive and evolve as proprietary CPU architectures, growing their own ecosystems of tools, software, and FPGA peripheral IP.
Today these core families reach from tiny microcores with minimal features—not unlike Arm’s Cortex M0–to full-blown Linux-capable CPUs. Many of these variants will fit comfortably in their vendor’s low-end FPGAs. For example, the compact Nios IIe core requires only around a thousand logic elements but can reach 75 MHz or more. A core scaled up with caches and memory management and capable of running Linux* operating system, at the other extreme, requires about five thousand elements. There are many options in between to fit specific needs. Even this full-blown configuration is small enough to put a multi-core CPU cluster inside a single low-end Intel® MAX® 10 device and still have plenty of room left.
There are, then, many paths to moving a legacy CPU’s functionality into a low-end FPGA, while still having the resources to support legacy interface or controller functions, IoT connectivity, security, or machine learning acceleration. Ancient machine code can run on an FPGA homage to an ancient CPU, or it can be run on an instruction-set simulator on a modern core.
High-level language code can be compiled for a modern core. Particularly challenging blocks of code can potentially be off-loaded into an accelerator block in the FPGA. Then a version of the modern CPU core can be licensed and implemented in the FPGA via a number of paths with varying degrees of optimization. And for best resource efficiency, vendor proprietary CPU cores reach the best compactness at a range of performance and capability points. There is a good fit out there for nearly any design situation.
For Further Reading
Learn about Intel’s Nios II CPU family.
Explore the Intel MAX 10 low-end FPGAs