The following segments were written one per day, as I watched the three days of IBM’s Watson computer play Jeopardy!. I’ll post my notes here now, for the three or four people who haven’t already seen more analyses than they care to think about.

The first day

Interesting.

Watson did very well on its first day. In order to have time to explain things and introduce the concept of Watson, they set it up so that only two games are played over the three days. The first day was for the first round, and the second day (this evening) will have the Double Jeopardy! and Final Jeopardy! rounds.

It wasn’t surprising that there were a few glitches, where Watson didn’t fully “get” the question — for instance, answering “a leg”, rather than “missing a leg”, in describing the anatomical oddity of an Olympic winner. And, as we knew might happen, Watson repeated an incorrect answer from Ken Jennings, because the computer has no way to know what the other contestants have said.

What I found interesting, though, is that Watson does have a very strong advantage with the buzzer — much stronger than I expected, given that they tried to control for it. Despite the attempts to smooth it out by setting up a mechanical system whereby Watson sends a signal to cause a button to be physically pushed, and despite whatever the humans can do through anticipation, it’s clear that people just can’t match the computer’s reactions. Almost every time Watson was highly confident of its answer — a green bar (see below) — it won the buzz. Surely, on things like the names of people in Beatles songs, Mr Jennings and Mr Rutter were as confident of the answer as Watson was, and had the answers ready well before Alex finished reading. Yet Watson won the buzz on every one of those.

It was fun to have a little of Watson’s “thought process” shown: at the bottom of the screen, we saw Watson’s top three answer possibilities, along with its confidence for each, shown as a percentage bar that was coloured red, yellow, or green, depending upon the percentage. That was interesting whether or not Watson chose to buzz in. On a Harry Potter question for which the answer was the villain, Voldemort, Watson’s first answer was “Harry Potter” — it didn’t understand that the question was looking for the bad guy, even though the whole category related to bad guys. But its confidence in the answer was low (red, and well below the “buzz threshold”), it didn’t buzz in, and Mr Rutter gave the correct answer (which had been Watson’s second choice).

Of course, they didn’t use any audio or video clues, according to the agreement — Watson can neither hear nor see — but they didn’t seem to pull any punches on the categories or types of questions. It feels like a normal Jeopardy! game.

Oh, and by the way: the TiVo has it marked as copy-protected, so I can’t put it on a DVD. Damn. I don’t know whether regular Jeopardy! games are that way or not; I’ve never recorded one before.

The second day

On her blog, The Ridger notes this:

I find looking at the second-choice answers quite fascinating. "Porcupine" for what stiffens a hedgehog’s bristles, for instance. There is no way that would be a human’s second choice (after keratin). Watson is clearly getting to the answers by a different route than we do.
That’s one way to look at it, and clearly it’s true that Watson goes about determining answers very differently from the way humans do — Watson can’t “reason”, and it’s all about very sophisticated statistical associations.

Consider that both humans (in addition to this one, at home) got the Final Jeopardy! question with no problem, in seconds... but Watson had no idea (and, unfortunately, we didn’t get to see the top-three analysis that we saw in the first two rounds). My guess is that the question (the “answer”) was worded in a manner that made it very difficult for the computer to pick out the important bits. It also didn’t understand the category, choosing “Toronto” in the category “U.S. Cities”, which I find odd (that doesn’t seem a hard category for Watson to suss).

But another way to look at it is that a human wouldn’t have any second choice for some of these questions, but Watson always does (as well as a third), by definition (well, or by programming). In the case of the hedgehog question that The Ridger mentions, “keratin” had 99% confidence, “porcupine” had 36%, and “fur” had 8%. To call “fur” a real “third choice” is kind of silly, as it was so distant that it only showed up because something had to be third.

But even the second choice was well below the buzz-in threshold. That it was as high as it was, at 36% confidence, does, indeed, show Watson’s different “thought process” — there’s a high correlation between “hedgehog” and “porcupine”, along with the other words in the clue. Nevertheless, Watson’s analysis correctly pushed that well down in the answer bin as it pulled out the correct answer at nearly 100% confidence.

In fact, I think most adult humans do run the word “porcupine” through their heads in the process of solving this one. It’s just that they rule it out so quickly that it doesn’t even register as a possibility. That sort of reasoning is beyond what Watson can do. In that sense it’s behaving like a child, who might just leave “porcupine” as a candidate answer, lacking the knowledge and experience to toss it.

No one will be mistaking a computer for a human any time soon, though Watson probably is the closest we’ve come to something that could pass the Turing test. However good it can do at Jeopardy! — and from the perspective of points, it’s doing fabulously (and note how skilled it was at pulling all three Daily Doubles) — it would quickly fall on its avatar-face if we actually tried to converse with it.

The third day

The third day — the second game of the two-game tournament — was perhaps even more interesting than the first two.

Watson seemed to have a lot more trouble with the questions this time, sometimes making runs of correct answers, but at other times having confidence levels well below the buzz-in threshold. Also, at many of those times its first answer was not the correct one, and sometimes its second and even its third were not either. Some of the problems seemed to be in the categories, but some just seemed to deal with particular clues, regardless of category.

Watson also did not have domination of the buzzer this time, even when it had enough confidence to buzz in. I don’t know whether they changed anything — I suspect not, since they didn’t say so. It’s likely that Mr Jennings and Mr Rutter simply were more practiced at anticipating and timing their button-presses by then (remember that the three days’ worth of shows were all recorded at the same time, a month ago).

Those factors combined to make Watson not the run-away winner going into the Final Jeopardy! round that it was in the first game. In yesterday’s final round (category: 19th-century novelists), all three contestants (and your reporter, at home) came up with the right answer, and Watson pulled far ahead with an aggressive bet that Mr Rutter didn’t have the funds to match. Mr Jennings, meanwhile, chose to be conservative: assuming he would lose to Watson (the first game’s results made that certain), he made his bet of only $1000 to ensure that he would come in second even if he got the answer wrong.

The result, then, was Watson winning the two-game match handily, and earning $1 million for two charities. Other charities will get half of Mr Jennings’s and Mr Rutter’s winnings (whether that’s before or after taxes, I don’t know; I also don’t know whether taxes will reduce Watson’s million-dollar contribution).

One other thing: in a New Scientist article the other day, talking about the second day and the first Final Jeopardy! round, Jim Giles makes a sloppy mistake (but see the update below):

Watson’s one notable error came right at the end, when it was asked to name the city that features two airports with names relating to World War II. Jennings and Rutter bet almost all their money on Chicago, which was the correct answer. Watson went for Toronto.

Even so, the error showed another side to Watson’s intelligence: knowing that it was unsure about the answer, the machine wagered less than $1000 on its answer.

Of course, Watson’s wager had nothing to do with how sure it was about the answer: it had to place the bet before the clue was revealed. Its wager had something to do with the category, but likely was far more heavily controlled by its analysis of the game position and winning strategy. In determining its bets, it runs through all the bets it and its opponents might make, and decides on a value that optimizes its own position. And its strategy in the second game was different from that in the first

Other thoughts

Gerhard Adam presents a cynical view of the value of this experiment. I have a more positive outlook on it, contingent upon IBM’s pursuing the technology and bringing it to a real, useful, marketable product (and not letting it fade away, as they did with Deep Blue after the chess match). No, this isn’t Skynet, and, as Gerhard says, it isn’t even real “intelligence” — Watson answers the questions by collecting associations and correlations, and without understanding the questions at all. But there are many applications in the real world that will benefit from a machine that can do this, picking apart natural-language questions and coming up with natural-language answers (along with an assessment of its confidence in the answer). I look forward to seeing where IBM takes it.


Update: The New Scientist article was updated shortly after it was published. It now says this:

Even so, the error did not hurt Watson too much. Knowing that it was far ahead of Jennings and Rutter, the machine wagered less than $1000 on its answer.