Wikipedia is a free, online, user-generated database of articles on topics and people. Because of its popularity, it has become the default first link in Google search, which means it is frequently read and cited, making it even more authoritative in Google search.

There are numerous problems with the veracity of content, deliberate vandalism and incomplete entries - the Science 2.0 entry, for example, has been repeatedly hijacked and reverted by one user and the first reference is from a Wired article written in 2012 even though Science 2.0 came into existence in 2006. There are no links to Science 2.0 and that user even refuses to allow mention that it is a registered trademark and in the US and the European Union.

Given the issues that plague Science 2.0's entry, it is difficult to imagine how controversial issues are rated for quality so how will the public know? Writing in the International Journal of Information Quality, computer scientists in China say they have devised a software algorithm that can automatically check a particular entry and rank it according to quality.


Image: Freeman Lab

Jingyu Han and Kejia Chen of Nanjing University of Posts and Telecommunications, explain that the quality of data on Wikipedia has long been the focus of user attention. Its detractors suggest that it can never be a valid information source in the way that a proprietary encyclopedia might be because the contributors and editors are anonymous and have no real interest in quality control. 

Its supporters suggest that the social nature of contributions and edits and the online tracking of changes is one of Wikipedia's greatest strengths rather than a weakness. Yet the evidence shows that Wikipedia is primarily managed by white, young American men. Women, minorities and foreign participants are overwhelmingly turned off by the angry, vengeful culture, and that means that the quality is dramatically skewed.

It would quiet the detractors if there were a way to quantify the quality of Wikipedia entries in an objective and automated manner. Han and Chen turned to Bayesian statistics to help them create just such a system. The notion of finding evidence based on an analysis of probabilities was first described by 18th Century mathematician and theologian Thomas Bayes. Bayesian probabilities were then utilized by Pierre-Simon Laplace to pioneer a new statistical method. Today, Bayesian analysis is commonly used to assess the content of emails and to determine the probability that the content is spam, junk mail, and so filter it from the user's inbox if the probability is high.

Han and Chen have now used dynamic Bayesian network (DBN) to analyze in a similar manner the content of Wikipedia entries. They apply multivariate Gaussian distribution modeling to the DBN analysis, which gives them a distribution of the quality of each article so that entries might be ranked. Very low-ranking entries might be flagged for editorial attention to raise the quality. By contrast, high-ranking entries could be marked in some way as the definitive entry so that such an entry is not subsequently overwritten with lower quality information.

The team has tested its algorithm on sets of several hundred articles comparing the automated quality assessment by the computer with assessment by a human user. Their algorithm out-performs a human user by up to 23 percent in correctly classifying the quality rank of a given article in the set, the team reports.

The use of a computerized system to provide a quality standard for Wikipedia entries would avoid the subjective need to have people classify each entry.  It also has a great deal of value for the future of Science 2.0. Take concern about pesticides, for example. Some studies will say they harm people and the environment while the EPA may not. How does the EPA determine quality? They are career scientists with a great deal of experience, and Good Laboratory Practice is a mark of quality, but it's still somewhat variable.  In a Science 2.0 approach, the public could duplicate that kind of work and rank the veracity of claims.

More on point, it could improve the standard as well as provide a basis for an improved reputation for Wikipedia. It won't help fix quality problems when someone hijacks an entry, however.


Citation: Han, J. and Chen, K. (2014) ‘Ranking Wikipedia article’s data quality by learning dimension distributions‘, Int. J. Information Quality, Vol. 3, No. 3, pp.207-227, DOI: 10.1504/IJIQ.2014.064056. Source: Inderscience Publishers