Taking Machine Learning to the Edge
Like it or not, machine learning networks are set to be the solution du jour in embedded systems for the foreseeable future. Their remarkable ability to find and classify patterns in noisy data offers designers the option of circumventing difficult algorithm development along with the opportunity to endow their designs with a new level of independence from human operators. And, marketing loves to deploy the phrase machine learning in their materials.
Now combine this observation with two others. First, machine learning comes from, and is still in some respects tied to, a world of massive data centers and cloud-computing services—alien to the constraints of embedded systems. Second, as embedded systems are increasingly connected to networks, embedded computing is becoming edge computing. No longer isolated and forced into self-sufficiency. Embedded systems are coming to depend on their high-latency, limited-availability umbilical back to the cloud.
Taking these three points together, we have what marketing calls a great opportunity—what system designers call a very difficult partitioning challenge. In response, we are seeing major efforts on three fronts to fit machine learning into the edge environment (Figure 1). New machine learning algorithms are trying to improve accuracy and robustness while making deep-learning models more compact. Compression techniques are working to reduce the size of deep-learning models so they can fit the memory and performance constraints of embedded systems. At the same time, hardware accelerator chips are trying to relax those constraints.
Figure 1. Machine learning is starting to peek over the network edge.
Inside Deep Learning Networks
In order to describe these efforts, we have to start by examining what goes on inside a deep-learning network: the memory requirements, data flows, and computations. This will lead naturally to a discussion of compression and acceleration strategies.
Let’s look at structure first. In its purest form, a deep-learning network is divided into layers, each layer being an array of nodes, or artificial neurons. The input to each node is the collected outputs of all the nodes in the previous layer, each value multiplied by a corresponding weight coefficient (Figure 2). The function of the node may be a summation, a non-linear operation, or a logic operation, depending on the architecture of the network and the particular layer in question. If you think of the input to a deep-learning network, for example, a pixelated 1280 by 1024 image, in the first layer each node could have a 1280 by 1024 array of weights and an input from each pixel in the image. Nodes in the second layer would have an input from each of the nodes in the first layer, and so on. At this scale the amount of data and computation involved can be huge.
Figure 2. In a traditional neural network, each node output from one layer becomes an input to every node in the next layer.
Usually these networks are laid out with non-linear layers interspersed among summation layers. As you get deeper into the network, the number of nodes per layer tends to decrease. Also the outputs of the nodes begin to make more sense to humans. In an image-classification network, for instance, the node inputs to the first layer are pixel values from the image. The outputs of this first layer are numbers inscrutable to humans representing the presence or absence of tiny fragments of images. As you move deeper into the network, the outputs begin to make more sense, until you reach the final layer. Each output corresponds to a tag used to train the network, such as dog, poodle, tractor, or blimp. These final outputs represent the probability that the corresponding tag is relevant to the image at the input. As you move through the network from input to output, the layers tend to become smaller, so the network tapers like a funnel that is wide enough to accept an entire input image in parallel, to one only wide enough to have an output for each tag.
There are many variations on this general picture, most of them developed to deal with specific applications. For instance, recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) have structures to feed back node outputs from internal layers to previous layers, or to save outputs in memory for later use. Such networks have been useful in applications like handwriting or speech recognition, where correct classification of a pattern requires some contextual data. RNNs and LSTMs are in other ways quite similar to our general picture, and most of what we say below will apply to them.
The architecture of the network—layers, connections, and functions, is selected by humans at the beginning of the design process. It is not altered unless the trained network turns out to perform poorly. The training process only determines the values in each node’s weight arrays, not the structure of the network.
At this point we should make it clear that there are two distinct tasks involved in using a deep learning network: training and inference. In nearly all examples today outside research labs, training is done first in a data center. This creates the trained model—that is, the network with all the weights determined—which is then used, in the data center or increasingly, at the edge, to compute inferences.
Training is demanding of both human labor and computing resources. Its most common form, supervised learning, begins with the selection of thousands to hundreds of thousands of input records. For example: photos of street scenes for a traffic or security application. Architects then decide what labels are needed to accurately and fully classify each scene for the application. Tags for the street scenes might include the identities of objects that can occur in the scenes, including characteristics such as motion or threat level. Humans then manually tag each record in this training data set with the appropriate tags.
Then, training begins. A record is applied to the inputs of the network, and the outputs computed, layer by layer. The result from the final layer is compared to the human-determined tags for that input. An algorithm, usually gradient descent, is used to adjust the weights of each node in the final layer to reduce the error between the output and tag values. Then the algorithm moves to the next previous layer and adjusts the weights on the inputs to those nodes to further reduce the output error. And so on back to the first layer of the network. Then the next input data record is applied, its tags are compared to the network output, and the process starts all over again.
With potentially millions of nodes, some in early layers with tens or hundreds of thousands of weights, and hundreds of thousands of records, this is a data-intensive task. Its result is a trained network with weights that will lead to accurate classification of even fairly novel inputs with high accuracy. It is this trained inference network that we want to export to the edge computing platform.
Making It Practical
Even though the amount of computing a deep-learning network requires to infer an output is much less than the amount required to train it, inference would still be a crushing load on a typical embedded system, especially in an application that had hard latency requirements. So designers have taken three routes to simplifying the inference computing task: architectural reductions, compression, and hardware acceleration.
An excellent presentation on this topic was made by MIT assistant professor Song Han in a workshop at the 2018 Hot Chips conference. Here we will be considerably more brief and simplistic than Professor Han’s 102 slides, but we will look at each of the three avenues.
Anything designers can do to reduce the number of layers or the number of connections between layers in the network directly reduces both the memory requirements and the number of computations for an inference. There is little to predict how well a given network design will work on a given problem and training set, except prior experience. The only way to be sure whether you, need all 16 layers in a particular deep-learning network design is to take a few layers out, train the network, and test it. The expense of such exploration tends to keep designers using network architectures they are familiar with; however exploring might yield significant savings.
One example is in static image classification—the now-famous ImageNet challenge. While the general case of a deep-learning network has each node getting weighted input from every node in the previous layer, researchers working on image classification found a huge simplification: the convolutional neural network (CNN) (Figure 3). In its early layers the CNN replaces the fully connected nodes with little convolution engines. Instead of a weight for each input, the convolution engine has only a small convolution kernel. It convolves the kernel with the input image, producing a feature map—a 2D array indicating the degree of similarity between the image and the kernel at each point on the image. This feature map then receives a non-linearization. The output from the convolutional layer is a three-dimensional array: a 2D feature map for each node in the layer. This array then goes through a pooling operation that reduces the size of the 2D feature maps by, in effect, reducing the resolution.
Figure 3. Recursive neural networks are often just simple neural networks that feed some of their intermediate state or output back to their input.
A modern CNN may have many convolution layers, each followed by a pooling layer. Toward the output end of the network, the convolution and pooling layers end and the remaining layers are fully connected. So the network tapers from the input side to the output side, ending in a fully-connected layer just wide enough to produce an output for each of the desired tags. Compared to a fully-connected deep-learning network of similar depth, the savings in weights, connections, and number of nodes can be very significant.
The machine-learning community uses the term compression to mean something quite different from conventional data compression. In this context, compression comprises a range of techniques that reduce the number and difficulty of computations required to generate an inference.
One such tactic is pruning. As it happens, training of deep learning networks usually results in many zeros or very small values in the weight matrices. In practice, that means there is no need to compute the input that will be multiplied by that weight, and so an entire branch can be pruned out of the data flow graph representing the inference computation. Experience has shown that if a network is pruned and then retrained, the accuracy can actually improve.
Another approach to compression is to reduce the number of bits in the weights. While data center servers are likely to keep all values in single-precision floating point, researchers have found that much lower precision for weights—as few as a couple of bits—is sufficient to achieve nearly the same accuracy as 32-bit floating point. Similarly the outputs of the nodes after non-linearity is applied, may require only a few bits. This is of little help if the inference model will be executed on a server. However it could be very helpful on an MCU, and an FPGA accelerator that can actually implement 2- or 3-bit multipliers very efficiently can really exploit this form of compression.
Altogether, pruning, significance reduction, and related techniques have been shown to achieve 20 to 50 times reductions in inference work on some occasions. By themselves they may put a trained network within the reach of some edge computing platforms. When compression is not enough a designer can turn to hardware acceleration, for which there is a growing portfolio of alternatives.
The computations required for inference are neither very diverse nor very complicated. Mainly, there are lots of sum-of-products—multiply-accumulate, or MAC—operations to multiply inputs by weights and add up the results at each node. There are also non-linear functions, such as the so-called rectified linear unit (ReLU) which simply sets all negative values to zero, hyperbolic tangent or sigmoid functions to instill nonlinearity, and max functions for pooling. Altogether, the job can look a lot like a typical linear algebra workload.
Hardware ideas from the supercomputing world get applied. The easiest way is to organize the inputs, weights, and outputs as vectors and use the vector SIMD units built into large CPUs. To go even faster, designers employ the massive arrays of shading engines in GPUs. By arranging the input, weight, and output data across the GPU’s hierarchy of memories to avoid thrashing or high miss rates is a far from trivial issue, but this has not prevented GPUs from becoming the most widely used non-CPU hardware for data-center deep learning. Recent generations of GPUs have evolved to improve their fit for the application, adding smaller data types and matrix-math blocks to supplement the floating-point shading units.
These adaptations illustrate the fundamental tactics used by acceleration hardware designers: reduce or eliminate instruction fetches and decodes, reduce data movement, employ as much parallelism as possible, and exploit compression. The trick is in doing all these things without them interfering with each other.
There are several architectural approaches to employing these tactics. Perhaps the simplest is to instantiate a large number of multipliers, adders, and small SRAM blocks on a die, and link them through a network-on-chip. This provides the raw resources for executing inferences, but it leaves the crucial challenge of getting data efficiently to and from the computing elements up to the programmer. Such designs are descendants of the many massively parallel computing chips of the past, all of which foundered on the shoals of inscrutable programming challenges.
Chips such as Google’s Tensor Processing Unit (TPU) take a more application-informed approach by organizing the computing elements in accord with the inherent structure of deep-learning networks. Such architectures view the input weight multiplications of the network as very large matrix multiplications, and create hardware matrix multipliers to carry them out. In the TPU, the multiplication gets done in a systolic array of multipliers, in which operands flow naturally through the array from unit to unit. The array is surrounded by buffers to feed in activation and weight values, and followed by activation-function and pooling hardware.
By organizing the chip to more-or-less automatically do matrix operations, the TPU relieves programmers of the need to schedule the movement of data through the computing elements and SRAMs at a detailed level. Programming becomes almost as simple as grouping the inputs and weights into matrices and pushing the button.
But therein lies a question. As we noted above, pruning can result in very sparse matrices, and simply feeding these into a device like the TPU will result in a lot of meaningless multiplications and additions. It may be necessary for the compression phase of model development to reorder these sparse matrices into much smaller, densely packed matrices in order to exploit the hardware.
A third approach models the inference task not as a series of matrix multiplications, but as a data flow graph. The accelerator is architected as a data-flow engine with data entering at one side, flowing through a graph-like network of processing elements via configurable links, and emerging to become the outputs. Such accelerators can be configured to perform only the operations necessary to the pruned network.
Once an architecture is chosen, the next question is implementation. Many architectures originate in FPGAs during development, for cost and schedule reasons. Some will stay there—for instance if deep-learning network models are expected to change so much that one accelerator design can’t handle all the changes. But if model changes will be minor—different arrangements of layers and changing weights, for instance—an ASIC or CPU-integrated accelerator may be preferred.
That brings us back to edge computing and its constraints. If the machine-learning network is going to execute in a bank of servers, execution on server CPUs, GPUs, FPGAs, or large ASIC accelerator chips are all viable options. But if the execution has to happen in a more constrained environment—a factory floor machine, a drone, or a camera, for instance—a small FPGA or ASIC will be necessary. For instance, Intel has recently announced an ASIC accelerator chip packaged in a memory-stick format to plug into a PC’s USB port.
For small deep-learning models in extremely constrained environments such as a handset, a low-power ASIC or an accelerator block built into the application processor SoC may be the only option. While so far these constraints have tended to push designers toward simple multiplier arrays, the excellent energy efficiency of neuromorphic designs may make them very important for the next generation of deeply embedded accelerators.
Whatever the case, machine learning is not just for data centers any more. Inferencing is moving to the edge. As researchers move beyond today’s conventional deep learning networks into more concepts such as continuous unsupervised learning—potentially able to adapt quickly to novel environments, but also potentially enormously greedy for resources—the problem of machine learning at the edge promises to be at the cutting edge of architectural development.
For Further Reading:
Get an overview of deep learning acceleration on FPGAs.
Explore the Intel Neural Compute Stick 2, a tiny but formidable machine learning accelerator.