Deep-learning networks have won. They have outscored humans in classifying still images—at least sort-of. They have defeated world champions at chess and go. They have become the tool of choice for big-data analysis challenges, from customer service to medical diagnosis. They have shown that, once trained, they can be compact enough to fit in a smart phone. So have we reached the end of history for artificial intelligence (AI)? Or is this just the crest of one wave in a much larger ocean?
One answer to that question might come from the plethora of other approaches to AI now jostling for attention. Granted there are always alternatives to a successful technology, if only because every PhD candidate has to find something unique to write about, and every patent has to be circumvented. But many of the alternatives to deep-learning networks today have grown over time out of real issues with conventional static networks like AlexNet. Many are solving real problems and showing up in production systems. If you are designing a system that incorporates machine intelligence, look before you leap (Figure 1).
The Broad Range
Before we look at some specific learning systems, let’s take a brief detour into theory. To engineers, a deep learning network might look like a digital emulation of what we think we know about living neurons. But to a mathematician, the networks look like graph representations of classical optimization problems that have centuries of history.
Imagine a 100 percent accurate network for image classification. Actually, it’s easy. Just build a giant look-up table with one location for every possible combination of pixel values in the input image. To train the network, all you have to do is fill each location with the labels that describe that particular image. For very constrained tasks like low-resolution optical character recognition, systems get built exactly in this way. But for, say, HD camera images, the table would be impossibly large and wasteful.
So here is the mathematicians’ point. A deep learning network is just a way to reduce the size of the table and the number of connections by adding more hidden layers. How many layers you add and how the locations in each layer connect to those in the previous layer are today matters of art and opinion—nobody knows how to optimize them. So everyone starts out with a network topology familiar to themselves or their team, or included in the platform they are using. But the weights assigned to each connection in each node can be partially optimized mathematically, by propagating the desired answers back through the network and applying an algorithm like gradient descent to find weights that minimize the error in the network output.
That process is just one from a huge family of optimization techniques used to minimize the error in an estimate. Others are as familiar as linear regression, or as esoteric as multinomial naïve Bayesian learning. The many techniques have different algorithms, requirements, and capabilities. But they are all essentially ways to map a very large number of possible inputs into a much simpler set of outputs, minimizing the number of connections in a graph of the computation and the error in the result. Which algorithm you choose may depend as much on your training and your surroundings as it depends on the suitability of the algorithm for your particular problem.
Within this huge space, most of the recent excitement has focused on just one small family: deep-learning networks. And much of that has focused on just one task–classification of static data—and just one subset of deep learning: convolutional neural networks (CNNs, Figure 2). But the limitations of CNNs are causing a migration away from this existing practice, toward novel implementations of CNNs, toward other kinds of networks, and toward other kinds of analysis tools. This migration will expand the range of techniques of which system architects should be aware, and it will increase the risk of having a design tied to the wrong hardware.
One example of this migration comes from just exploring the math that goes on inside a conventional CNN. When networks are implemented in software, it is common to represent the weights on the connections as floating-point numbers. This practice carries over to implementations on hardware accelerators such as graphics processing units (GPUs) and ASICs. But a lot of normalization goes on in the hidden layers of deep networks. One might wonder if one really needs 32 bit floating point for weights that are just going to get added up and normalized to a 1.
And in fact, recent work has shown that networks using trinary (-1, 0, +1) or binary weights can perform essentially as well as the same topology using floating-point weights. This discovery dispenses with a lot of multiplies and additions—if the underlying hardware is flexible enough to take advantage of it.
Another example concerns the topology of the network. Before training, the topology of a deep-learning network—how the nodes in one layer are connected to the nodes in the next layer—starts out as an assumption based on researchers’ experience. For researchers, there is considerably greater risk to not having enough connections than there is inconvenience in having too many.
As you train the network, the connections stay fixed, but some weights go to zero, and others are rendered irrelevant by the weights on other nearby connections or downstream from them. Thus training doesn’t remove connections, but it makes some of them unable to influence the network’s output. As training progresses, more and more connections become irrelevant, until the network in effect becomes quite sparse. Research suggests that computing inferences with such sparse networks can benefit from the same techniques used in sparse-matrix computations. The algorithms, and acceleration hardware, may be quite different from those you would use if you naively assumed every weight had to be optimized during training and applied during inference.
Memory Gets a Role
So far we have discussed tweaks in computing algorithms for deep-learning networks. But there are problems that require a different kind of network altogether. Static networks expect an input to hold still while they analyze it. And they assume that the next input will be unrelated to the current one. That is great for putting labels on a stack of photographs. But what if the task is to extract meaning from recorded speech? Or what if we are processing video streams to decide how to pilot a car?
In these cases, interpretation of one frame of data depends very much on what has gone before. And unsurprisingly, researchers report getting the best results with neural networks that have memory.
The simplest form of a neural network with memory is called a recurrent neural network (RNN, Figure 3). Basically, this is just a simple network—as simple as just one hidden layer—in which some of the output is fed back into the inputs. The RNN conditions its inferences about the current frame of input with its inferences, and perhaps its hidden-layer results, from the previous frame. Since we expect the real world to favor us with a degree of persistence and causality, this is a very reasonable approach for interpreting streams of real-world data.
Sometimes, though, you want a longer memory. For video, all the information that could improve your inference about the current frame is probably contained in the immediately previous frame or two. But in conversation or written text, the meaning of a word may be conditioned by a phrase that went by several sentences ago, or by the book’s introduction back in the pages with Roman numerals.
For these situations, researchers developed long-short-term memory (LSTM) networks. LSTM networks are a subset of RNNs with a more complex internal structure that allows selected data and inferences to recirculate for long periods of time. Natural-language processing seems to be one promising application, but the addition of slowly-evolving context to the sensor fusion for autonomous vehicles may be another.
An interesting point about RNNs is that, since they are a form of traditional neural networks, they can in principle be trained by back-propagation of the desired output, and gradient-descent calculation of optimal weights. But there are some issues. To account for the feedback loops, some researchers unroll the RNN into a series of connected conventional networks—much like loop unrolling in software optimization. This allows parallelization of a task that still requires a lot of computation. And since you are training the network to respond to sequences rather than static events, the training data must be sequences.
Hence training RNNs is not easy. Training times can become enormous. Network weights can fail to converge. The trained network may prove unstable. Some researchers have used optimization techniques other than simple gradient descent, such as using higher-order derivatives of the way the weights influence the error signal, or resorting to extended Kalman filters to estimate appropriate weights. None of these techniques is guaranteed, and the literature suggests that RNN training works best when it is carefully guided by humans with long experience. This observation does not bode well for some of the longer-term goals of machine learning, such as unsupervised or continuous learning modes.
RNNs with Memory
Some applications respond well to a different approach. While RNNs retain some selected data from previous cycles, there are other types of networks that can access large pools of RAM or of content-addressable memory. We might try a human analogy here. Most people have selective and rather abstract memories. Asked to recall a scene or event in literal detail, we usually do poorly. But a rare few individuals remember details exactly, down to the arrangement of every item in a room or a long random sequence of large numbers. This gives these individuals the ability to solve problems that most of us can’t, even though they may have no greater reasoning skills or reading comprehension than anyone else. Similarly, a network that can write or read a large memory block or file system can exactly record large data structures and draw upon them as additional input.
These networks may have a hybrid, or even very un-CNN-like structure. Hidden layers may perform explicit read and write operations on the memory, for example. This in effect makes the network into a complicated way of designing a large state machine, or even a Turing machine: universal, but not necessarily efficient or comprehensible. And again, the presence of large, alterable data structures makes training a potential mess.
At their root, the neural networks we have looked at so far are really pretty simple: from a few to lots of layers, relatively few connections between layers (at least compared to a fully-connected network), all signals flowing in the same direction from input to output, and every node in a layer having the same function. Only the weights on the connections get adjusted in training. RNNs complicate this picture only by allowing connections that go backward, toward the inputs, as well as forward.
These simple structures are not very much like the neurons that physiologists find in mammalian brains. In a mammalian cortex, each neuron typically has around ten thousand connections. The connection scheme appears to change over time, perhaps with learning. And the neurons can have many different functions—dozens of kinds of neuron cells have been observed. Signaling is by pulse codes rather than by voltage levels.
A growing movement is trying to get away from the simplified function-graph picture of neural networks and closer to the biological reality. These neuromorphic networks live in the compromise between the staggering complexity of biology and the limitations of software simulations and microcircuits. Accordingly, they offer rich interconnect, but still far less than ten thousand axons per neuron. They offer many different models of neuron behavior, and some level of programmability in both interconnect routing and functions, as well as weights. They can be simulated in software, but performance and size generally limit the complexity of networks you can explore without hardware acceleration. Hardware implementations have ranged from digital ASICs and FPGAs to mixed-signal and purely analog custom chips.
Many neuromorphic networks exhibit not only far greater connectivity than conventional deep learning networks, but also more variety of interconnect: connections can run not just forward to the next layer, but also within a layer or back into previous layers. In their pure form, neuromorphic networks dispense with the idea of layers altogether, becoming a three-dimensional sea of neurons with near-arbitrary connectivity, at least in the local neighborhood.
Neuromorphic networks have demonstrated an ability at image and speech recognition with accuracy similar to that of conventional CNNs like AlexNet. But IBM has claimed that implementations on its TrueNorth neuromorphic ASICs achieves many times the frame rate per Watt of AlexNet implemented on an Nvidia Tesla P4.
Given the much richer interconnect, possibility of feedback and wide range of neuron functions available in neuromorphic networks, training may be even more of an art than it is for RNNs. But the idea’s greatest promise may lie in training as well. There is some indication that these networks can go well beyond the supervised-learning model used in ordinary back propagation. The typical supervised training today is a separate phase in which the trainer must supply the correct response for each training input. Training, often an exhausting process, must be completed before the network can be used, and the quality of training largely limits the accuracy of the network.
There are other kinds of training, though. Trainers may just give hints, not answers: only reinforcing the network when it moves toward desirable outputs instead of back-propagating the correct answers for every input frame. Or the network may learn on its own by optimizing some statistical function that defines useful behavior. An example of such unsupervised learning would be a network that takes in live video of a robot arm and outputs commands to the arm motors. At the beginning of training the network would be sending random commands to the motors and watching the spasmodic results. But as it learned to correlate its outputs with the observed results, it could converge on a near-optimal control algorithm.
Unsupervised learning offers the obvious promise of freeing human operators from the very complex task of conducting supervised training: assembling a set of training inputs—often tens of thousands of them for image recognition—assigning the correct tags to each input, and then feeding them one by one into the network. But unsupervised learning also holds out other promises: the network may learn algorithms that the humans are unaware of, and may not understand. It may discover classification schemes not used by humans. And with adequate safeguards, the network can learn continuously after it is deployed, continually improving and adapting to novel inputs. This is a golden goals of machine learning.
Too Much Choice
From straightforward statistical analyses a human could do with graph paper to simple artificial networks like CNNs to networks with internal or external memory to potentially inscrutable neuromorphic systems, designers with a problem to solve are faced with a wide range of possible techniques. But there are few criteria for picking the right one, beyond what the designer herself did in graduate school. And it is often hard to escape the nagging suspicion that the result of designing and training the network will not be better than what could have been done with a well-known statistical analysis algorithm.
Unfortunately, there seems to be growing evidence that on a given problem all techniques will not give more or less equal results. They may have different accuracies for a given level of training, different energy consumption for a given rate of speed, very different hardware requirements if acceleration is required, and very different training needs. It is important at the outset to be sure you are not trying to solve a problem that has a known, deterministic solution already. And especially if the implementation requires hardware acceleration it is vital to find an approach that leaves as much algorithm flexibility as possible for as long as possible, and to include in the design team or its partners as wide a swathe of experience as possible. The game is not yet won. It is just beginning.