No other area of modern system design seems as perplexing as the apparently trivial subject of system management controllers—or chassis management, or shelf management, or board management, or any of a half-dozen other terms. The trouble begins with that old demon of design, feature creep.
“When I was first involved in this area,” recalls Hewlett Packard senior director of Moonshot Platform engineering Gerald Kleyn, “we controlled a fan with a thermistor and called it system management. Today, you can think of system management as the control plane sitting above the hardware handling the workload in a large system.”
“It has become a lot more than measuring voltages and temperatures and controlling fans,” asserts Pigeon Point Systems president Rich Vasse. “Some of these ‘controllers’ are running Linux, interacting with payloads, and collecting data for big-data analysis.”
This range of concepts helps explain the perplexity. Think of system management as a tangle of independently-developed point solutions and industry-specific standards. Now imagine a host of powerful system requirements—physical monitoring and control, remote configuration management, workload management, virtualization, reliability, and security—each grabbing a loose end of rope and pulling—hard. That is how we make a simple thermistor circuit into a microcontroller, an embedded Linux system, and eventually, a Gordian knot.
System management began with the recognition that CPU-intensive systems generate less heat when they are working less—so you could turn the fans down. As CPU boards became multi-SoC designs and as DRAM DIMMs developed heat issues of their own, temperature measurement required multiple sensors and a microcontroller (MCU). And, as Vasse observes, as long as you have a MCU there, you had just as well use its pulse-width modulation outputs to control the fan drivers. So our thermistor turned into a much more interesting embedded control design.
A similar evolution has taken place in voltage monitoring. Initially this task was just a matter of keeping the CPU reset until VCC was within spec, and asserting a power-fail halt if it got out of spec again. But SoCs began to require multiple supply rails with different voltage tolerances and often with strict up-down sequencing requirements. These needs spawned a variety of mixed-signal power-management controllers. Another option was to load these functions into the existing MCU.
A further development, dynamic voltage-frequency scaling, meant the controller might have to supervise changing of the supply voltage and clock frequency for a domain in the SoC on the fly, freezing the clock until the new supply level was stable. This delicate minuet has to be danced in real time upon command from CPU software. Again, the task can go to a dedicated chip or to the system management MCU that is already there. Further, some systems have become so delicate that periodic DC voltage measurements are not enough. Sensors must capture voltage waveforms or spectra and convey them to the controller over something like an I2C bus.
In some systems, including most mobile devices, the voltage management problem includes a whole new headache: battery management. Modern batteries provide decent energy density and cycle life in exchange for a whole host of behavioral issues, from opacity about their true level of charge to load and temperature sensitivity to a rather antisocial tendency to burst into flame when offended. They can require highly accurate voltage and current monitoring, use of complicated state-estimator algorithms such as Kalman filters, load-balancing algorithms, and current-switching within the battery stack during both charging and operation. Again, you can supervise the battery either with a dedicated battery-management controller or with yet another task on the system management controller.
Finally, large systems have a host of other physical measurement needs beyond temperatures and voltages. Kleyn cites quantities and events such as fan speeds, cabinet intrusions, and hot-plug events that need to be captured and reported. And there are quasi-physical events such as error flags on DIMMs and SoCs. It is a long list.
In small, autonomous systems all of these monitor and control functions can be purely local. But in larger and in mission-critical systems it is necessary to log routine data, report exceptions, and accept commands from a remote supervisor. For this reason, many board-management controllers have a communications protocol stack and a remote connection. This can be a simple serial port or, more often today, a sideband connection on the board’s system interface, be this PCI Express® (PCIe®) or Ethernet. It is necessary that this sideband port continue to function even if the board’s CPU is disabled.
Not surprisingly, since this is a system-level issue, much of the work on networked board management has been done either by standards organization, such as the PCI Industrial Computer Manufacturers Group (PICMG), creators of the Advanced Telecommunications Computing Architecture (ATCA) specification, or by data-center server developers such as Dell and HP. These organizations see a network of board-management processors as fundamental to the operation of large switching and computing systems. In short, they are using the network connections to build a control plane over the computing or switching hardware (Figure 1).
Once you have the network, more things become attractive beyond just logging physical sensor data. One idea is post-analysis. “You can realize savings at system level,” Vasse explains. “For example, you can send all that physical data you are logging into big-data analysis to predict failures.” Accurate predictions would allow the system operators to shift loads away from at-risk hardware before a failure, or to preemptively order replacement parts and schedule repairs.
Another fairly obvious opportunity, once you have connectivity, is remote firmware updating by giving the board-management controller write-access to the flash memory on the board. PICMG provides a standard interface for this—at least for the MCU firmware on the board—through the HPM.1 hardware platform management interface (HPMI).
Another, more controversial opportunity is to open a window into the CPU through the board-management port. Standards allow you to virtually attach an external CDROM drive or video/keyboard/mouse console to the CPU through the management port, or to monitor CPU status through a serial output, all the information being packetized and conveyed over the Ethernet sideband. In principle this allows an external device to monitor and direct operating-system and even application activity via the board management unit.
“Some shelf or chassis management processors will run Linux and interface directly into tasks in the payload,” Vasse says. “That allows the shelf manager to coordinate fail-over of an application onto a healthy CPU.”
At this point Kleyn’s analogy of a control plane becomes particularly useful. In network switching equipment, functions are often segregated between two distinct sets of hardware. Functions that must work at wire speed—such as packet buffering, header parsing, routing, prioritization, and content filtering—are done in dedicated, configurable hardware in the data plane. Supervisory functions—such as header mask setting, building routing tables, managing queues, collecting statistics, and handling exceptions—are done in software on CPUs in the control plane.
The analogy to the server world is increasingly strong. Applications run on server CPUs in the data plane. Supervisory functions—maintenance routines, configuration management, statistics, and some hypervisor activity, run on other server CPUs in a virtualized control plane. The connection between the two is the network of board-management processors that also manage the cabinet, cooling, and power.
At this point the board-management processor has grown far beyond a humble MCU monitoring an A-to-D converter. “We are collecting a ton of data through this embedded control system, monitoring configurations, and providing a remote console into the individual CPUs,” Kleyn says. “This control plane is the lifeblood of the data center: it is responsible for the health and provisioning of the system.” HP sees this function as so critical that it implements its board-management processors with proprietary ASICs and uses its own embedded operating system on them.
The more capable board-management processors become, the more tempting it is to give them power over fast-paced local configuration and application-allocation decisions. But the more power the devices have, the greater the risk that they will become targets of attacks. “You cannot allow back doors into application execution,” Kleyn warned. HP, for example, is tight-lipped about even the architecture of its management ASICs.
Even in much smaller systems, security is a major issue. For instance, authentication and encryption are necessary even in a single-board system just to insure that board-manager firmware updates are safe. “There is a lot of talk about security, but I would say the discussion is still not fully resolved,” Vasse said.
For designers in military and transportation systems, this scenario may be sounding vaguely familiar. We have a physically separate network of processors monitoring both physical quantities and the execution of the application code. The network can assist in resource allocation and in failover of critical tasks. In communications or computing, we could be talking about the growing role of system management hardware. In military or transportation areas, this same description might refer to a functional-safety subsystem.
Such mission-critical systems sometimes use a separate, high-reliability set of hardware to monitor the state of the system, to monitor the external environment for risks that the system might do harm—say, exceeding the safe speed on a railway segment, or starting a cutting tool while a foreign object is on the work surface—and to intervene.
Today in larger systems these functional-safety tasks are often grouped with application tasks on the main CPUs, and heavy redundancy is used to insure that even in cases of hardware failure the functional-safety tasks will make their deadlines. But there are advantages to running the functional-safety tasks in an isolated environment that is simpler, where it may be possible to prove formal assertions about the execution of the code.
Stepping back, then, we see the successive hitches and bends that have gone into this knot. A little embedded control loop becomes a multi-input data logger. It gets a network interface, and remote update and console capability. It begins to monitor software execution, and acquires an operating system. Perhaps it begins to work cooperatively with the system hypervisor to manage virtual machines. With these new powers come growing attractiveness to attackers—increasing the pull for high-reliability hardware, redundancy, and crypto processing. Perhaps the subsystem begins to assume some responsibility for functional safety. Grab all the loose ends and pull.
Now is it any surprise that system designers struggle to untangle that knot into a roadmap?