Webometrics faces many challenges, not the least of which is a dearth of tools capable of measuring the Web with any degree of accuracy. Most academic and professional Webometrics analysts alike have had to rely on a mix of search engine downloads and query operators. Yet even analysts from organizations with their own crawlers are challenged by the limitations of methodologies and technologies.
For several years one of the most popular tools used in measuring the Web has been the Yahoo! Site Explorer tool. However, Yahoo!'s business model has evolved through the years and its alliance with Microsoft's Bing search service has forced Webometrics analysts to re-evaluate their methods. One possible alternative that has been proposed is the analysis of URL citations (unlinked page URLs used in articles and papers found on the Web).
In fact, in 2009 Mike Thelwall wrote: "... if Yahoo and its affiliates withdraw their advanced hyperlink queries so that URL citations would have to be used instead of links for many studies, then this would seriously undermine the power of link analysis, by a factor of about ten, except for academic spaces. This would probably mean that only very large scale link analyses and academic link analyses could be sustained."
He does not overstate the magnitude of the problem because URL citation analysis follows a pattern that has been criticized in the study of scientific literature; in 2008 the Math Union argued that "using citation data to assess research ultimately means using citation‐based statistics to rank things—journals, papers, people, programs, and disciplines. The statistical tools used to rank these things are often misunderstood and misused." Citation analysis is as problematic when applied to the Web as to scientific literature, for much the same reasons outlined by the Math Union paper.
However, these transient links are not peculiar to Yahoo!. They may also be found in Bing and Google. Yet many researchers are either unaware of these issues or do not scrub their data.
If the link data provided by Yahoo! was unreliable, then how reliable is the URL citation data that a search engine like Bing, Google, or Yahoo! might provide? Again, search results are limited to 1,000 listings. As with links, some listings contain or represent "ghost impressions". The URL citations are simply not there -- either because the content on the pages has changed or because the search engines misinterpreted the data provided to them.
The blogosphere complicates link and URL citation analysis because millions of blogs are configured to scroll their content from the front page (or a category/tag root page) to secondary pages. Hence, a link or URL citation that is found on the root page may scroll down to the 2nd page quickly, leaving a ghost image in the search engines' databases. Again, if research data drawn from search engines is not scrubbed it leads to questionable analysis.
Further complicating these types of studies is the inability of search engines and similar tools to determine the purpose of a link or URL citation. Links may be navigational, referential (combined with favorable or unfavorable sentiment or in a neutral state), detrimental (deployed in an attempt to associate a Website with "spam"), conspicuous, inconspicuous, transient, or annotated (such as with the "rel='nofollow'" attribute). If link/URL citation analysis is being used to determine IMPACT or VALUE or IMPORTANCE then the formulae researchers rely upon (typically modeled after Google's PageRank algorithm) are inherently flawed.
Another complication arises from the misuse of extra-index resources to analyze search listings. Web marketers frequently use third-party tools to count links on the assumption that any two indices built from Web crawls may be composed of similar data. However, the crawlers use undisclosed seed sets that virtually assure high degrees of variation between databases. Additionally, each crawling service defines its own filtering criteria.
Numerous researchers have constructed models of searcher behavior from data accidentally leaked to the Web from AOL's search index. The AOL index is populated with Google-collected data but AOL's search results have historically been adjusted according to AOL's proprietary criteria. The AOL data has also been used to analyze Website popularity and PageRank distribution. However such analyses fail to take into consideration the AOL user demographic of 2006 (the year the data was released), the relatively small size of the sample (approximately 20 million queries from 650,000 users collected over 3 months), and the changing trends in search and Web publication in the years since.
If we cannot obtain reliable data from the search engines themselves, then how are we to conduct the large-scale analysis that Thelwell proposes? One solution is for researchers to retain the services of crawl providers, and in fact many commercial tool providers do just this. The crawler services may obtain their data by scraping search engine results (typically a violation of search engine guidelines) through anonymous networks; or they may crawl the Web with their own crawlers; or they may aggregate data provided by other providers, including search engines willing to perform commissioned crawls.
Not everyone can pay for a crawl of the Web, however, and such crawls will not be complete. So that forces researchers to look at alternative methods, including sampling data. In fact, data sampling is widely used by analytics tools that collect visitor data for individual Websites. Web traffic analysis for high profile sites relies extensively on statistical short cuts that may hide significant behaviors in user and Website activity.
There are correlations between activity and value and hence between referral data and interconnectivity between Websites. It is possible to define relationships between Web content in terms of "strong", "weak", "average", etc. We can even use social media URL citation analysis to confirm or challenge the relationships deduced from referral data. But these measurements are all centered on individual Websites or at least require access to internal data. How does one measure the Web without access to the "real" data?
Some critics argue that it is impossible to measure the Web (in fact, I am one of them) but there is value in measuring portions of the Web. However, we cannot allow ourselves to assume that one section of the Web resembles another. The linking patterns of academia, for example, may not resemble the linking patterns of "mommie bloggers" or the patterns of journalists. By the same token, the publishing habits of different sectors vary considerably.
Sampling Web communities is therefore perfectly fine as long as the samples are used in context. We must create a panorama of sample-based studies in order to arrive at an approximation of what large portions of the Web may look like. Using a patchwork of studies of different verticals allows our analyses to follow the trends in behaviors without forfeiting the intrinsic data that is common across all sectors.
The future of Webometrics may follow a quilted patchwork of research and analysis because only localized research can be managed efficiently given the currently available tools and methods.
- PHYSICAL SCIENCES
- EARTH SCIENCES
- LIFE SCIENCES
- SOCIAL SCIENCES
Subscribe to the newsletter
Stay in touch with the scientific world!
Know Science And Want To Write?
- How A Former Naturopath Can Help Unravel The Trickery of Alternative Medicine
- Tidal Disruption Event: Black Hole Eats Star, Beams Signal To Earth
- Dear California: Why Is Farming Science A Bad Thing Again?
- Psychiatric Diagnoses Not Valid For African-Americans, Says Sociologist
- 5 Laws About Corpses That May Spook You
- 9,000 Years: Origin Of Farmed Rice Gets Pushed Back
- Dengue Virus Exposure May Amplify Zika Infection
- "I've known a few people with fake celiacs and I've known a few real celiacs. You can find a lot..."
- "This is an interesting set of lecture notes by Cranmer from TASI, https://sites.google.com/a/colorado..."
- "Criminal - Clinton Pathological Liar - Clinton Narcissist Egomaniac - Trump - Clinton Lied when..."
- "The only sensible numbers are zero, one, and infinity. Government works the same way. Zero - true..."
- "Hi John,I'll let your comment live out of respect for the effort you put in writing it, but I disagree..."
- ACSH Applauds Media Awareness of the Fentanyl Crisis
- Counting Bites Examined, to Help Decrease Food Intake
- The Safe And Unsafe Nutty Treats For Your Pup
- Mr. Potato Head Needs a New Warning Label!
- Shark Finning is Banned in the US; Banning Trade in Fins May Be Next
- Move Over Zika, It’s Yellow Fever’s Turn
- Should I stay or should I go?
- New cancer immunotherapy drugs linked to arthritis in some patients
- Simulations foresee hordes of colliding black holes in LIGO's future
- Analysis of genetic repeats suggests role for DNA instability in schizophrenia
- Analysis of media reporting reveals new information about snakebites and how and when they occur