There is a long tradition in system design for embedding special hardware to observe and manipulate the state of the system. From the beginning of digital computing, central processors have had hardware to support single-step, loading and examination of registers and memory, and setting of breakpoints for software debugging. Much later chronologically but early in their own history, integrated circuits began to include scan hardware for manufacturing test. FPGAs followed this idea with built-in logic analysis capability, allowing designers to examine their circuits in great detail.
As SoCs became more complex and inclusive, it became impractical or impossible to determine what was going on inside the system by merely observing the outside (Figure 1). So, designers experimented with building stimulus generators and checkers into their chip designs—in effect, assertions in silicon. This has become a necessary practice in some kinds of circuits such as high-speed serial transceivers, and has wider application when the SoC is implemented in an FPGA, as the specialized hardware can be removed from the design when it is no longer needed.
Today, the practice is taking on new directions. System designers are grappling with challenges quite different from block-level silicon bring-up or embedded software development. Four areas in particular are demanding new attention: system integration, run-time performance optimization, system security, and functional safety. Each is making its own demands on the observability and controllability of systems increasingly locked within the confines of an SoC die. And designers are responding by embedding more dedicated hardware to open windows of observability into the chips.
The Integration Challenge
Once, most of the effort in SoC verification was at the block level. System architectures tended to be simple and CPU-centric (Figure 2), with the blocks snapped into well-defined receptacles on an industry-standard bus. Once you had the blocks working, most of the work was done.
But today’s SoCs have turned that situation end-for-end. SoCs have several or many CPU cores with no one clear master, so the old CPU-centric organization is gone. Other blocks on the chip may be processing data and sharing memory, so even visibility into every CPU core on the die is no guarantee of success (Figure 3). Many levels of caches may be present, some or all participating in a coherency protocol, obscuring just what is actually going on with the chip, and what data is actually current. Peripherals may have direct memory access (DMA). And the old CPU-controlled synchronous bus has given way to layers of switched busses or to complex, globally asynchronous network on chip (NoC). Further, many of the blocks on the chip will be allegedly pre-verified intellectual property (IP), often from third parties who are reticent about revealing design details.
“Things are reversed now,” says Ultra SoC CEO Rupert Baines. “IP-level design tools and verification flows are excellent. There’s a very high probability the IP blocks you use will work as their designers intended. But systemic complexity has grown so that the challenge now is interactions among the blocks.”
These interactions can cause fatal system errors even when all the individual blocks are working correctly. And they can be astonishingly subtle. Caches can thrash due to interactions between tasks on different CPUs. Minor differences in the sequence of events on different parts of the SoC can cause huge differences in task latencies, as for instance when two processors deadlock, a high-priority interrupt service routine on one CPU calls a subroutine on another CPU that happens to have a lower priority, or a seemingly minor firmware change alters the order in which commands arrive at a shared DRAM controller, triggering a string of page misses and slashing effective memory bandwidth.
Against these sorts of time-dependent interactions even the best isolated CPU debug tools and bus monitors can be ineffectual, failing even to isolate the failure, never mind identifying a root cause. You need to be able to capture the full state of the system, set a trigger on a state—or more likely, a sequence of states—that defines the failure’s symptom, and then examine a trace buffer holding state history up to the trigger event. Often you may need to keep the system running at full speed during this process. In other words, you need the facilities of the best CPU hardware debug cores, but for the entire SoC, not just one core at a time.
What we are implying is, in essence, a custom logic analyzer built into the SoC, with state-monitoring or estimating hardware in each functional block of the chip. We are also implying a chip-wide interconnect network capable of bringing the data from these state detectors together, aligning them chronologically, and setting complex triggers on the resulting picture of the system state. Finally, we are suggesting a user interface that makes all of this intelligible to human users.
What most system designers have instead, Baines says, is an often-incomplete collection of siloed tools based on individual blocks and varying widely in quality. CPU IP vendors generally provide a debug module for software developers, allowing single-step, breakpoint, trace, and dump vie JTAG or a dedicated debug port. Such modules vary in quality from real-time and comprehensive to ad-hoc or absent altogether. They often have limited or no ability to see state outside the CPU core without considerable software intervention.
Once you get beyond the CPU cores, the situation even for sensing state gets more challenging. Vendors of DSP cores or dedicated accelerators, as for cryptography, video CODECs, vision processing, or neural-network inference may feel that access to their debug facilities, or even knowledge of the state of their engine, is too sensitive to share with any but the biggest customers. These blocks may be black boxes. Understanding the sate of a GPU may be possible, but so difficult and code-dependent as to render it a black box too, for all but skilled GPU programmers.
In-house IP, especially if reused from a previous project, can be even more challenging. If, for instance, a custom dataflow machine ever had a real-time debug module, and if it were adequately documented, it still might not suit a new application. Reuse guidelines aren’t always clear about reusable debug facilities.
Beyond this, there are utility blocks in SoCs—NoC switches and gaskets, DRAM controllers and network interfaces, DMA and interrupt controllers—not always intended to offer much visibility to system developers. Yet knowledge of their state may be vital to system integrators. Altogether, the problem of capturing the state of a full SoC, while technically possible, may be a design problem not a lot smaller than the original design itself.
Once the system is working, the need for deep visibility for system integration is—one hopes—over. But a new set of needs may arise: not for debug access, but for system optimization.
Certainly in the data center world, where workloads can change in milliseconds, it is clear that SoCs can benefit from continuous retuning. There are gross adjustments like how many cores are assigned to a task, which tasks share which cores, and how hardware accelerators are assigned. And there are finer adjustments, such as DRAM allocation, and even finer tweaks like interrupt priorities, client priorities in multi-client DRAM controllers, and the marvelous range of adjustments available in NoC switches.
As embedded systems move from dedicated, single-CPU architectures to dynamically allocated multi-core designs, many of these same considerations begin to apply. One might argue that the workload for an embedded system is known in detail at design time, and that is when the chip optimizations should be done. Often this is still true. But increasingly, the shape of an embedded workload is not obvious until after system integration—particularly with highly data-dependent tasks like neural-network inference. So embedded designs, like data-center servers, may need post-integration tuning.
And this sort of tuning also requires deep visibility into the SoC, but a different kind of visibility than debug or integration. Where integration needs to recognize and record sequences of system-wide state, tuning more often depends on aggregate or statistical data: data rates, device utilization percentages, idle-time profiles, and the like. In order to tune, you look for over- and under-utilized resources.
With one notable exception this sort of statistical information can be hard to come by. CPU debug hardware is generally designed to gather short bursts of trace data, not utilization or throughput statistics or cache profiles. Statistics may have to come from random sampling of trace data or from external monitors. Which brings us to that exception. NoCs, touching virtually all traffic between blocks in the SoC, can be ideal for collecting traffic and some activity statistics. Once again for this purpose much of the data may be directly or indirectly available, but it may come down to the design team to collect and assemble it.
With the growing awareness of cyber security, another set of run-time needs is arising for embedded systems designers. Designers of multitasking systems have long relied on the memory protection units (MPUs) attached to processor cores to protect one task’s memory from inspection or corruption by another task. That works, so long as all the memory accesses in the system go through CPUs and all the MPU registers are set correctly. But in a multicore system with numerous blocks doing DMA, and with cyber attacks, neither of those conditions is guaranteed.
One line of defense has been to only make MPU settings accessible from a secure operating mode such as ARM’s TrustZone. Theoretically, a task could only enter this privileged mode by presenting a valid credential, But as publicity about the Meltdown and Spectre vulnerabilities has shown, and as previous, less publicized incidents of attacks through hypervisors had warned, even secure execution modes can be compromised.
Such risks have led some developers to turn to embedded monitoring hardware in the SoC. Monitors on cache and system busses, DRAM controllers, and NoCs can be another line of defense, continuously validating that a task or peripheral is staying within the bounds of its assigned memory.
If we generalize this notion of monitoring the system for forbidden sequences of states, we get a much more powerful idea. Embedded monitoring, if it comprehends the state of the entire SoC, could recognize when the system is about to do something physically dangerous—like move a tool into an unsecured area or close a switch in an AC power grid without checking for phase matching—and could force the system into a safe state. This ability to anticipate and avoid bad outcomes is the essence of functional safety.
We have come full circle now, once again asking the embedded monitor to collect state information from all the significant blocks in the SoC, and to correlate this data into a coherent view of the chip’s overall state. We’ve seen that in some, but far from all, key blocks there is already circuitry in place to collect this data. It remains to bring the data together—a task that often cannot be relegated to software because of unpredictable latencies and contention for system resources, not to mention security questions.
We are left with the alternative of capturing the state in each significant block, time-stamping it at the source, and routing it to a central collection point using dedicated routing resources. The good news is that a number of vendors are working toward this goal.
One such effort is at ARM, where the CoreSight* developers, anticipating the challenges of multi-core debug, have extended the reach of their hardware-based tools across multiple instances of ARM* IP cores and busses. Another movement comes from the NoC vendors—for example, Arteris, Netspeed, and Sonics—who have a natural path to extend the profiling facilities already available in their endpoints and switches into a chip-wide state monitoring and reporting network.
A third source is an IP vendor dedicated to the problem, Ultra SoC. This company has developed the routing and collection stations to bring time-stamped state information together from across the SoC. They have developed gaskets to extract information from CoreSight and some other core vendors’ debug modules. And they are working with at least one NoC provider. Ultra SoC also develops visualization and analysis software so that state information uploaded from the SoC can be useful to humans.
That seems to be the current situation for commercial tools. There is still much to do in improving visibility into a wider range of processors. Ultra SoC is working with the RISC-V architecture, for instance, and there is obvious application to FPGA accelerators.
There are other kinds of system blocks for which there is no agreement about even what part of their internal state is relevant. And there are enticing questions. How much of the SoC’s internal state can be inferred from a few points rather than measured directly? Could the industry agree on a standard interface between processing elements and a monitoring network? Could some form of deep-learning network learn from masses of state data to infer root causes of failures, or to anticipate functional safety faults? There is much to do.