Webometrics faces many challenges, not the least of which is a dearth of tools capable of measuring the Web with any degree of accuracy.  Most academic and professional Webometrics analysts alike have had to rely on a mix of search engine downloads and query operators.  Yet even analysts from organizations with their own crawlers are challenged by the limitations of methodologies and technologies.

For several years one of the most popular tools used in measuring the Web has been the Yahoo! Site Explorer tool.  However, Yahoo!'s business model has evolved through the years and its alliance with Microsoft's Bing search service has forced Webometrics analysts to re-evaluate their methods.  One possible alternative that has been proposed is the analysis of URL citations (unlinked page URLs used in articles and papers found on the Web).

In fact, in 2009 Mike Thelwall wrote: "... if Yahoo and its affiliates withdraw their advanced hyperlink queries so that URL citations would have to be used instead of links for many studies, then this would seriously undermine the power of link analysis, by a factor of about ten, except for academic spaces. This would probably mean that only very large scale link analyses and academic link analyses could be sustained."

He does not overstate the magnitude of the problem because URL citation analysis follows a pattern that has been criticized in the study of scientific literature; in 2008 the Math Union argued that "using citation data to assess research ultimately means using citation‐based statistics to rank things—journals, papers, people, programs, and disciplines. The statistical tools used to rank these things are often misunderstood and misused."  Citation analysis is as problematic when applied to the Web as to scientific literature, for much the same reasons outlined by the Math Union paper.

Let's take Webometrics analysts' dependence on Yahoo!'s features, for example.  The search engine limits the amount of data provided by any query to 1,000 listings.  Those listings may all come from a single Website (such as a large blog that includes a link in its blogroll or footer).  Some listings may be included because links were embedded in Javascript advertising.  Other "ghostly links" in Yahoo! data have defied explanation as investigation failed to disclose their presence on the listed pages.

However, these transient links are not peculiar to Yahoo!.  They may also be found in Bing and Google.  Yet many researchers are either unaware of these issues or do not scrub their data.

If the link data provided by Yahoo! was unreliable, then how reliable is the URL citation data that a search engine like Bing, Google, or Yahoo! might provide?  Again, search results are limited to 1,000 listings.  As with links, some listings contain or represent "ghost impressions".  The URL citations are simply not there -- either because the content on the pages has changed or because the search engines misinterpreted the data provided to them.

The blogosphere complicates link and URL citation analysis because millions of blogs are configured to scroll their content from the front page (or a category/tag root page) to secondary pages.  Hence, a link or URL citation that is found on the root page may scroll down to the 2nd page quickly, leaving a ghost image in the search engines' databases.  Again, if research data drawn from search engines is not scrubbed it leads to questionable analysis.

Further complicating these types of studies is the inability of search engines and similar tools to determine the purpose of a link or URL citation.  Links may be navigational, referential (combined with favorable or unfavorable sentiment or in a neutral state), detrimental (deployed in an attempt to associate a Website with "spam"), conspicuous, inconspicuous, transient, or annotated (such as with the "rel='nofollow'" attribute).  If link/URL citation analysis is being used to determine IMPACT or VALUE or IMPORTANCE then the formulae researchers rely upon (typically modeled after Google's PageRank algorithm) are inherently flawed.

Another complication arises from the misuse of extra-index resources to analyze search listings.  Web marketers frequently use third-party tools to count links on the assumption that any two indices built from Web crawls may be composed of similar data.  However, the crawlers use undisclosed seed sets that virtually assure high degrees of variation between databases.  Additionally, each crawling service defines its own filtering criteria.

Numerous researchers have constructed models of searcher behavior from data accidentally leaked to the Web from AOL's search index.  The AOL index is populated with Google-collected data but AOL's search results have historically been adjusted according to AOL's proprietary criteria.  The AOL data has also been used to analyze Website popularity and PageRank distribution.  However such analyses fail to take into consideration the AOL user demographic of 2006 (the year the data was released), the relatively small size of the sample (approximately 20 million queries from 650,000 users collected over 3 months), and the changing trends in search and Web publication in the years since.


If we cannot obtain reliable data from the search engines themselves, then how are we to conduct the large-scale analysis that Thelwell proposes?  One solution is for researchers to retain the services of crawl providers, and in fact many commercial tool providers do just this.  The crawler services may obtain their data by scraping search engine results (typically a violation of search engine guidelines) through anonymous networks; or they may crawl the Web with their own crawlers; or they may aggregate data provided by other providers, including search engines willing to perform commissioned crawls.

Not everyone can pay for a crawl of the Web, however, and such crawls will not be complete.  So that forces researchers to look at alternative methods, including sampling data.  In fact, data sampling is widely used by analytics tools that collect visitor data for individual Websites.  Web traffic analysis for high profile sites relies extensively on statistical short cuts that may hide significant behaviors in user and Website activity.

There are correlations between activity and value and hence between referral data and interconnectivity between Websites.  It is possible to define relationships between Web content in terms of "strong", "weak", "average", etc.  We can even use social media URL citation analysis to confirm or challenge the relationships deduced from referral data.  But these measurements are all centered on individual Websites or at least require access to internal data.  How does one measure the Web without access to the "real" data?

Some critics argue that it is impossible to measure the Web (in fact, I am one of them) but there is value in measuring portions of the Web.  However, we cannot allow ourselves to assume that one section of the Web resembles another.  The linking patterns of academia, for example, may not resemble the linking patterns of "mommie bloggers" or the patterns of journalists.  By the same token, the publishing habits of different sectors vary considerably.

Sampling Web communities is therefore perfectly fine as long as the samples are used in context.  We must create a panorama of sample-based studies in order to arrive at an approximation of what large portions of the Web may look like.  Using a patchwork of studies of different verticals allows our analyses to follow the trends in behaviors without forfeiting the intrinsic data that is common across all sectors.

The future of Webometrics may follow a quilted patchwork of research and analysis because only localized research can be managed efficiently given the currently available tools and methods.