A Semantic Web Way To Analyze Raw Government Data

Want to kill about 8,000 hours of your life? Go to data.gov and start looking around. In the interest of transparency (well, sometimes - see Recovery Accountability and Transparency meeting not open to the public), the Obama Administration has posted over 270,000 sets of raw data from its departments, agencies, and offices on the World Wide Web.

Good luck figuring it out, right? The best place to hide stuff is in plain sight, skeptics will claim.

Nope, some folks at Rensselaer are here to make it simple. They figured out how to find relationships among the literally billions of bits of government data, pulling pieces from different places on the Web, using technology that helps the computer and software understand the data, then combine it in new and imaginative ways as “mash-ups,” which mix or mash data from two or more sources and present them in easy-to-use, visual forms.

It's a semantic web project, a concept which is either just coming into form or already a flop (your call).

Say you want to know...

Which White House staffer has the most visitors?

How do politics influence U.S. Supreme Court decisions?

How many earthquakes occurred worldwide recently?

Which states have the cleanest air and water?

...and then make some weird correlation about Mayans in 2012, the LHC and global warming. Well, they can't help with the correlation, but they can help with at least finding data you can then put together into a conspiracy theory.

James Hendler is the Tetherless World Research Constellation professor of computer and cognitive science at Rensselaer Polytechnic Institute, which might be the coolest title I will see in 2010. He says the fun and informative part can be finding new unexpected relationships. He has been named the “Internet Web Expert” by the White House, and the Web Science team at Rensselaer includes some of the world’s top Web researchers.

Unlike the semantic web today, their work will mean you don't need expertise in semantic programming. As a bonus, it means better sharing of data among government agencies.

“An unfathomable amount of data resides on the Web,” says Hendler. “We want to help people get as much mileage as possible out of that data and put it to work for all mankind.”

The Web site developed by Hendler, Deborah McGuinness and Peter Fox, also professors in the Tetherless World Research Constellation, and their students, provides examples of what this approach can accomplish. It also has video presentations and step-by-step do-it-yourself tutorials for those who want to mine the treasure trove of government data for themselves.

Hendler started Rensselaer’s Data-Gov project in June 2009, one month after the government launched Data.Gov, when he saw the new program as an opportunity to demonstrate the value of Semantic Web languages and tools. Hendler and McGuinness are both leaders in Semantic Web technologies, sometimes called Web 3.0, and were two of the first researchers working in that field.

Using Semantic Web representations, multiple data sets can be linked even when the underlying structure, or format, is different. Once data is converted from its format to use these representations, it becomes accessible to any number of standard web technologies.

One of the Rensselaer demonstrations deals with data from CASTNET, the Environmental Protection Agency’s Clean Air Status and Trends Network. CASTNET measures ground-level ozone and other pollutants at stations all over the country, but CASTNET doesn’t give the location of the monitoring sites, only the readings from the sites.

The Rensselaer team located a different data set that described the location of every site. By linking the two along with historic data from the sites, using RDF, a semantic Web language, the team generated a map that combines data from all the sets and makes them easily visible.

That data presentation, or mash-up, that pairs raw data on ozone and visibility readings from the EPA site with separate geographic data on where the readings were taken had never been done before. This demo and several others developed by the Rensselaer team are now available from the official US data.gov site: http://data.gov/semantic.

Other mash-up demos on the http://data-gov.tw.rpi.edu/wiki/Demos site include:

The White House visitors list with biographical information taken from Wikipedia and Google (now also available in a mobile version through iTunes);
U.S. and British information on aid to foreign nations;
National wild fire statistics by year with budget information from the departments of Agriculture and Interior and facts on historic fires;
A state-by-state comparison of smoking prevalence compared with smoking ban policies, cigarette tax rates, and price;
The number of book volumes available per person per state from all public libraries;
An integration of basic biographical information about Supreme Court Justices with their voting records from 1953 to 2008, with a motion chart that looks at justices’ decisions over the years on issues such as crime and privacy rights.

The aim is not to create an endless procession of mash-ups, but to provide the tools and techniques that allow users to make their own mash-ups from different sources of data, the Rensselaer researchers say. To help make this happen, Rensselaer researchers have taught a short course showing government data providers how to learn to do it themselves, allowing them to do their own data visualizations to release to the public.

The same Rensselaer techniques can be applied to data from other sources. For example, public safety data can show a user which local areas are safe, where crimes are most likely to occur, accident prone intersections, proximity to hospitals, and other information that may help a decision on where to shop, where to live, even areas to avoid at night. In an effort McGuinness is leading at Rensselaer along with collaborators at NIH, the team is exploring how to make medical information accessible to both the general public and policy makers to help explore policies and their potential impact on health. For example, one may want to explore taxation or smoking policies and smoking prevalence and related health costs.

The Semantic Web describes techniques that allow computers to understand the meaning, or “semantics,” of information so that it can find and combine information, and present it in usable form.

“Computers don’t understand; they just store and retrieve,” explains Hendler. “Our approach makes it possible to do a targeted search and make sense of the data, not just using keywords. This next version of the Web is smarter. We want to be sure electronic information is increasingly useful and available.”

“Also, we want to make the information transparent and accountable,” adds McGuinness. “Users should have access to the meta data – the data describing where the data came from and how and when it was derived — as well as the base information so that end users can make better informed decisions about when to rely on the information.”

The Rensselaer team has also been working to extend the technique beyond U.S. government data. They have recently developed new demos showing how this work can be used to integrate information from the U.S. and the U.K. on crime and foreign aid, to compare U.S. and Chinese financial information, to mashup government information with World Bank data, and to apply the techniques to health information, new media, and other Web resources.

Related articles

Comments

Know Science And Want To Write?

Donate or Buy SWAG

Books By Writers Here