In 1997, IBM's Deep Blue computer beat Garry Kasparov at chess. He had won their first encounter in 1996, with 3 wins, 1 loss and 2 draws (4-2), so the team of programmers and chess experts tweaked the program and in 1997 came out ahead 3.5-2.5, a big achievement for programming because chess is 'creative'.
Another creative task even more relevant to scientists than chess is extracting data from scientific publications to create a database cataloging results from tens of thousands of individual studies. It sounds like it would be easy but much of cataloging would ordinarily be placed in the 'subjective' camp.
Researchers recently used the DeepDive machine reading system and the HTCondor distributed job management system to create PaleoDeepDive and it recently squared human scientists who had manually entered data into the Paleobiology Database, a repository for data from paleontology studies. The data is fragmented into hundreds of thousands of publications yet many research questions require what first author Shanan Peters, a professor of geoscience at University of Wisconsin-Madison calls a "synthetic approach: For example, how many species were on the planet at any given time?"
PaleoDeepDive mimics the human activities needed to assemble the Paleobiology Database. "We extracted the same data from the same documents and put it into the exact same structure as the human researchers, allowing us to rigorously evaluate the quality of our system, and the humans," Peters says.
Computers often have trouble deciphering even simple-sounding statements. Co-author Christopher Ré of Stanford uses as an example a study containing the terms Tyrannosaurus rex and Alberta, Canada. Is Alberta where the fossil was found, or where it is stored? "We take a more relaxed approach: There is some chance that these two are related in this manner, and some chance they are related in that manner."
In these large-data tasks, PaleoDeepDive has a major advantage, Peters says. "Information that was manually entered into the Paleobiology Database by humans cannot be assessed or enhanced without going back to the library and re-examining original documents. Our machine system, on the other hand, can extend and improve results essentially on the fly as new information is added."
Further advantages can result from improvements in the computer tools. "As we get more feedback and data, it will do a better job across the board," Peters says.