A growing number of scholars are using social media data to write articles about both online and offline human behavior - it's cheap, it's as accurate as surveys if properly controlled, and no one ever has to leave the office.

But surveys are not science for an obvious reason and yet, in recent years, studies have claimed the ability to predict everything from summer blockbusters to fluctuations in the stock market. They all get mainstream media attention despite obvious evidence of flaws in many of these studies.

Some scholars may not know any better, and even if they do statistics can be difficult, but for those desiring to be accurate, wariness of serious pitfalls working with huge social media data sets is key. The problem is not small: thousands of research papers each year are now based on data gleaned from social media.

An article in Science highlights several issues involved in using social media data sets - along with strategies to address them. Among the challenges:

  • Different social media platforms attract different users - Pinterest, for example, is dominated by females aged 25-34 - yet researchers rarely correct for the distorted picture these populations can produce.

  • Publicly available data feeds used in social media research don't always provide an accurate representation of the platform's overall data - and researchers are generally in the dark about when and how social media providers filter their data streams.

  • The design of social media platforms can dictate how users behave and, therefore, what behavior can be measured. For instance, on Facebook the absence of a "dislike" button makes negative responses to content harder to detect than positive "likes".

  • Large numbers of spammers and bots, which masquerade as normal users on social media, get mistakenly incorporated into many measurements and predictions of human behavior.

  • Researchers often report results for groups of easy-to-classify users, topics, and events, making new methods seem more accurate than they actually are. For instance, efforts to infer political orientation of Twitter users achieve barely 65% accuracy for typical users - even though studies (focusing on politically active users) have claimed 90% accuracy.

Many of these problems are well-known in other fields, such as epidemiology, statistics, and machine learning, so they can be worked around if scholars want to try and find truth rather than going for the newspaper headline.

"The common thread in all these issues is the need for researchers to be more acutely aware of what they're actually analyzing when working with social media data," says Derek Ruths, an assistant professor in McGill University's School of Computer Science.

Surveys have had to tackle these challenges before. "The infamous 'Dewey Defeats Truman' headline of 1948 stemmed from telephone surveys that under-sampled Truman supporters in the general population," Ruths notes.

Compare that to recent elections. Averaging polls has been so accurate that Nate Silver rode it to success in the New York Times and in 2012 polls were so accurate that Europeans betting on Intrade got all of the state votes correct, even though they know nothing about American politics.

The technology to be accurate is there if sociologists and psychologists want to embrace it.

"Rather than permanently discrediting the practice of polling, that glaring error led to today's more sophisticated techniques, higher standards, and more accurate polls. Now, we're poised at a similar technological inflection point. By tackling the issues we face, we'll be able to realize the tremendous potential for good promised by social media-based research," said Ruths.