In late January the computing world passed—mostly without notice—two remarkable milestones. One marked the end of a beginning: Marvin Minsky, a pioneer and guiding light of artificial intelligence (AI) died. The other standing stone marked a beginning, perhaps of a new era. Only days after Minsky’s death, an article in the magazine Nature reported that a computer had defeated European Go champion Fan Hui in five of five formal games. AI, left for dead by the roadside in the 1970s, has returned (Figure 1), triggering renewed research, publicity-grabbing demonstrations, waves of fear among the robotophobic, and a major rethinking of some categories of systems designs.
Perhaps we should pause for a definition. The canonical description of AI is the Turing Test: roughly, “I can’t define it, but I’ll know it when I don’t see it.” Or, less formally, AI enables a system without human intervention to do tasks we normally associate with living things.
Whichever definition you prefer, AI has had a rollercoaster existence. Excitement peaked in the 1960s, when research teams like Minsky’s at MIT first showed mainframe computer software parsing natural-language text or recognizing objects with cameras and manipulating them with robotic arms. Then progress stalled, and for over a decade little seemed to be happening.
There was another flurry of interest in the 1980s around ideas such as expert systems, fuzzy logic, and neural networks. But this wave also faded, as the systems it produced failed to scale or generalize.
Today we are in yet another wave. New results on fronts as diverse as playing human games, identifying photos of objects, showing awareness of situations, and control of autonomous vehicles are all showing promise. Will this time be different?
To answer that question we need to look past labels, into algorithms. From this perspective we can see the history of AI as the interplay of three big ideas: rule-based systems, neurobiology, and massively parallel search (Figure 2).
Rule-based systems take the intuitively obvious approach to AI: if you want a system to perform a task, give it a set of rules to follow. The rules are often simple: if X is true, then do Y. From this simple form you can build quite complex contingency trees. And such structures have proved quite effective at solving certain kinds of problems, such as simple games, classification based on pre-defined features, manipulating formal logic, or ensuring that the patterns in an IC design are compatible with the process technology.
But these are all problems about which humans think at a conscious level. If asked, we can show our work. There are many tasks, including perception, judgement, awareness, or intuition, for which we are unaware of our thinking processes. “Rules come from smart people,” explains Intel fellow Pradeep Dubey. “But our understanding of our own reasoning is extremely shaky.”
Try to imagine a set of rules that could determine in any context that a group of pixels represented your mother’s face. Intuitively it seems there must be one. Yet the first wave of AI came up against such problems and simply ran out of ideas, and out of computing power. Still, it is so intuitive—many believed eventual success was just a matter of more rules, more memory, and more MIPS.
At about the same time—the mid 1960s—that Minsky and others were showing startling early results with rule-based systems, a new big idea emerged from an entirely different source. Neurobiologists began to unravel the fine structure of neuron cells, and they began to model the baffling tangles of neuron bodies, dendrites, and synapses not as living cells or electrochemical exchanges, but as electronic networks.
The idea may have had limited use for biologists, but it triggered an explosion in AI. A network model of a neuron, simplified to be manageable in the mainframes then available, became a topic of intense study and infinite thesis projects, championed by, among many others, Minsky. This Perceptron, as the most popular model was called, had far fewer connections than a real brain neuron. It “learned” by adjusting weighting coefficients on inputs to a simple non-linear aggregator, where real neurons appear to adapt by growing new connections and using a complex portfolio of time-dependent functions.
Despite these simplifications, researchers found that even a small number of Perceptrons could work together, being trained to perform impressively on simple object-recognition and cognitive tasks. At that point Perceptron fans, like researchers on rule-based systems, ran out of computing power. But they had achieved a strong suspicion that really big networks of Perceptrons—called neural networks, when biologists were out of earshot—could, given enough computing power, exceed rule-based systems on poorly-understood AI tasks.
So matters remained until the 1980s, with AI variously ignored or reviled for failing to deliver on over-grand promises. But in the ‘80s a new wave of optimism, born in Moore’s Law, washed a surge of venture investment across the industry. And researchers once again began to dream of AI.
Rule-based systems were reborn in the guise of expert systems: frameworks that helped human interviewers to capture how subject-matter experts believed they worked out problems, and to abstract those beliefs into rules. Neural-network researchers constructed larger, more complex networks and confirmed that with a lot more computing power, feats like real-world machine vision might just work. A related hybrid idea, fuzzy logic, showed some promise in control systems. But once again progress reached a plateau, and the industry’s attention drifted away.
The next big idea to impact AI came from an unexpected direction: Internet search engines. The need for effective search tools for the enormous number of Web pages coincided with economics that made massive data centers workable. In this environment evolved a fundamental, three-tiered structure for massively parallel search (Figure 3).
The top tier represents construction of a giant data pool. The data is built continuously by spiders that explore the Web, capturing searchable data and easily-identifiable keys, and loading these into a basically unstructured pool. The second tier filters the massive data set for relevance. When a query arrives, this tier constructs a filter that identifies pages that might have some relevance, based on easily-accessible features like metadata and patterns in the text. This filter is necessarily optimized for speed and inclusiveness. The filter is dispatched to a huge number of servers, to each of which is assigned a chunk of the page data pool. From these potentially tens of thousands of servers you get probably thousands of potentially relevant pages.
Very few search users would appreciate an unordered dump of a thousand vaguely qualified pages, so there needs to be one more tier: page ranking. Here, code apparently based on a combination of rules—some from search experts and some learned from the user’s previous click behavior—rank-orders the candidate pages, producing the listings you get on your screen. Developers are also applying neural networks to the ranking problem, but the mix of rules vs. networks is very proprietary.
It didn’t take long for clever people to recognize that with the right filtering and ranking algorithms, this three-tiered structure was capable of very intelligent-looking behavior. And a step further: search algorithms could be very good at playing some types of games.
Consider tic-tac-toe, for example. A simple algorithm could construct a data pool listing every legal game, move by move. That would be tier 1. Then as you play, you can query the pool for all the legal games that include the current state of the board using a tier-2 filter. Finally, a tier-3 ranking engine could select a game that leads to a win for you. Now you know your next move.
IBM used this structure, in a more complex way, to create the Jeopardy-champion computing system Watson. Jeopardy is a nearly formal game with many similarities to search. So unsurprisingly, Watson fits our three-tiered model rather well.
At tier 1, human experts selected several categories of web pages—all of Wikipedia, for example—and fed them into Watson to be digested into a massive data pool. They created a second, filtering tier that can pick candidate pages based on the presence of key words from a clue and on a semantic analysis of the clue’s structure. For example, is the clue asking for one particular instance of a category, like the tenth king of France? Or is it presenting a pun? Finally, candidate nuggets of information from the filters were ranked for exact fit to the clue. In a competition under the actual conditions of a televised Jeopardy game—except for the exclusion of a few types of clues for which Watson’s designers could not devise rules—Watson defeated two past human champions. To give an idea of the relatively modest scale, the triumphant Watson used over 2,500 servers, running Apache Unstructured Information Management Architecture and Hadoop: by no means a large system by today’s standards.
More traditional games present a different sort of challenge. For example, chess can be approached much like tic-tac-toe. But generating all the possible chess games in a data set is infeasible. So IBM’s Deep Blue, the chess-playing system that finally defeated grandmaster Gary Kasparov in a 1997 rematch, follows the same tiered structure as out hypothetical tic-tac-toe machine. But instead of an enormous data set of possible moves Deep Blue used dedicated hardware to generate possible moves from the current position on the fly. Think of it as an on-demand tier 1.
As it generated moves, Deep Blue evaluated them by playing all legal moves forward. Software on the master CPU of the system generated the thousand or so plausible sequences of the next four moves and evaluated them. The sequences that weren’t obvious blunders were then mapped to the remaining CPUs in the system. (The 1997 version of Deep Blue that defeated Kasparov comprised 30 RS/6000 CPUs, each attaching 16 chess-processing ASICs.) Each CPU started with its assigned sequences and generated continuing play for four more moves, evaluating each new sequence. Deep Blue was now looking at all plausible sequences of the next eight moves.
For these eight moves, the analysis was done in software, allowing IBM’s chess experts to change algorithms even during a match. An early adaptation allowed the software to follow particularly promising sequences all the way to end of game. The rest of the eight-move sequences—perhaps a million—were dispatched to the hardware chess chips for another four moves of play and analysis. Finally, in what we are calling tier-3, the scoring of all the 12-move sequences and complete play-throughs was compared, and the main CPU selected its next move from the highest-scoring sequence.
Deep Blue’s triumph in 1997 may have marked the high point for rule-based systems. There has been considerable work on expert systems since then—indeed, IBM continues to market Deep Blue using current POWER server hardware for applications as divergent as geological exploration and medical diagnosis. But the architectural direction of AI has been altered by another force—the return of neural networks.
This revival sprang from two factors. First was the appearance of massively parallel computing systems. In many respects neural networks, both in use and in their far more demanding training mode, are embarrassingly parallel problems. With ten thousand servers you can really contemplate the sort of huge, deeply-layered networks of which ‘80s researchers dreamed.
There is a problem, though. Conventional neural networks during their training phase are fully-connected: each neuron in a layer gets input from every neuron in the previous layer. Not only does this make computing a next state for an individual neuron tedious, if you split out neurons to individual servers, you create avalanches of network traffic. It would be great to have an a-priori way to reduce the connectivity in the network without reducing generality.
Fortunately, work in the machine vision field addressed this problem, providing the second factor in the revival. Researchers here had been using convolution kernels for years as, among other things, feature detectors. In this application, each little kernel scans only a small fraction—maybe a 16-by-16-pixel square—of the entire input image. Researchers found that not only could you reduce the scale of a neural network by placing a convolution plane on the front end, but you could mix in convolution planes deeper in the network as well, drastically reducing connectivity. Then they could train the convolution filters right along with the neuron input weights. The result is called a convolutional neural network (CNN), shown in Figure 4. They have proved highly successful at identifying and even interpreting 2D images when highly trained. But it would turn out that CNNs could be generalized even further.
One branch of the machine vision community picked up CNNs at once. Researchers working on automotive driver assistance systems (ADAS) and autonomous vehicles applied CNNs as a way to reduce images from the batteries of cameras, radars, and lasers sprouting from their vehicles.
More recently, another application has used CNNs on their way to startling results: DeepMind, with their master-defeating Go program. Go has some similarities to chess, and in fact previous Go-playing software has taken approaches similar to those used in chess programs— a combination of playing ahead and rule-based position evaluation to search for the likely best next move. But there is a difference of scale. In chess, playing ahead four moves—hardly above beginner’s skill—generates about a thousand possible positions. In Go, four moves early in the game would open up about three billion possible positions. Clearly an exhaustive search for even just a few moves is infeasible.
Programmers have attacked this obstacle with two strategies. The more familiar is to use a rule-based system to analyze patterns in the current position and propose a next move without trying to look ahead. If you have played Go, you can predict that this approach won’t take you much beyond the level of a promising beginner.
The other strategy is a Monte Carlo method: since you can’t play out all the sequences of moves from a position, select as large a number as you can play, either at random or through a strategy algorithm, play them out for some number of turns, and select the one that leads to the most promising position. Even though it sounds arbitrary—you can’t guarantee that you won’t miss all of the best sequences—the Monte Carlo approach actually converges on optimal play as the number of samples goes up, for many types of games. In Go, it gives intermediate players a reasonable opponent.
But DeepMind wanted to create a champion, not a sparring partner. The designers decided to blend the Monte Carlo approach with two different CNNs—one to govern strategy and one to evaluate positions. Using, roughly speaking, the strategy CNN to guide exploration of future moves and the evaluation network to place a quantitative value on the resulting positions, DeepMind’s system did indeed defeat a champion.
Employing CNNs creates an immediate problem—how to train them. “There are three fundamental training methods,” Intel’s Dubey explains, “supervised learning, reinforcement learning, and unsupervised learning.” DeepMind employed the first two. The designers supervised their two networks as they walked then through an extensive data set of actual games played by advanced players. They would present a board position and the actual next move taken by the human, over and over through many games, to train the network.
Then to broaden the training, they set the system to play games against randomly selected earlier versions of itself, using the outcome of each game as reinforcement. This not only broadened the CNNs’ experience, but it focused their training on the end result—winning or losing—rather than on mimicking human players. The designers used conventional gradient ascent or descent functions to adjust the networks’ convolution coefficients and neural weights during training.
The structures of both DeepMind’s networks are conventional: many convolution layers followed by many fully connected layers. Much of the uniqueness comes from the learning process, and especially the reinforcement learning, in which the system played against its earlier selves.
Dubey observes that we are just beginning to exploit such large learning networks. One promising future lies in the opportunity to train networks on massively parallel systems, and then replicate the trained network on a much smaller system. “Once trained, the model can be very compact,” he says. These compact models can then be loaded into smart phones or wearables and distributed to millions of users. Then the models can report back to the cloud when they encounter an unexpected result, allowing the big, trainable model to learn from the experiences of many trained networks in the field.
But the real goal may be to replace reinforcement learning with fully unsupervised, continuous learning. In this mode, a device in the field would learn continually—not from being given an input and a desired output, or even from being given a reward for correct results, but by optimizing complex functions within the network itself. This is leading-edge stuff in the research community, being investigated on massively parallel systems in data centers. But Dubey argues unsupervised systems won’t always be locked away in the dark.
“People have said we are reaching the level where everyone has more computing power than they can possible use,” Dubey observes. “But the challenge of unsupervised learning—to look not just at slope but at higher-order variations, and to learn rapidly-changing functions—that changes everything. You couldn’t ask for a better link between every-day problems and exaFLOPS.”
Perhaps it is stitched together from pieces of earlier work. Perhaps it was only reanimated by a massive jolt of computing power. But today AI is alive and growing. And it is showing a remarkable hunger for power.