Deep Web Interferometry compares curves in trendlines from synchronized multiple data sources. Interferometric analysis of Web metrics data increases the clarity of meaningful data points by isolating events. For example, given two Websites that track a popular sport, one Website may experience a weekly peak in traffic on Monday and the other site may have two smaller peaks on Wednesday and Saturday. However, during a major tournament both sites experience sharp peaks during the games. These game-driven data spikes appear in both sites' trendlines.
We view event trendlines on the deep Web by looking at data sources beyond our control. That is, the Web metrics we use to observe our local Web space constitute an insular pool of data. We have to compare our local data to local data from other sources. Deep Web data sources might be reporting tools made available by third party services, academic studies published through universities, white papers and case studies published by companies that crawl the Web, and so forth.
Social media services may publish data that can be used to plot trendlines that are suitable for comparison to typical Web analytics data. However the more differentiated the type of data is, the more reliable the comparative analysis becomes in identifying significant events. Events are data points. If you click on a link, that's an event. If you view a Web page, that's an event. If you log on to the Internet, that's an event. Your session is an event, the things you do within the session are events, and YOU are an event.
That is, webometry seeks to measure everything including the number of people using the Web, the number of Web "sites" that exist, the number of queries people use on search engines, the number of sessions a typical surfer initiates within a specific timeframe, the interactions per session per visitor per Website -- everything that can be logged is measurable. These measurements, however, are not very reliable.
For example, how many people are accessing the Internet from the computer on my lap right now? Even my Internet service provider has no way of knowing. I could switch users to allow someone else to briefly search for a recipe and then switch back again. If I have a router I could allow multiple computers to attach to my Internet account, each one performing different tasks. In short, my Internet service provider has no way of knowing how many of me there are at any time.
Lacking any means of determining who is real and who is not real (Mr. Schrodinger might appreciate the irony in this limitation), analytics companies have settled on the convention of tracking "sessions". We can statistically reduce some of the confusion by looking for behaviors that a human is more likely to engage in than a robot or a cat. These behaviors might include revisiting the same site more than once under different conditions; interacting with widgets on a Website, activating alternative software to perform additional functions, etc. In short, the more a visitor acts with purpose, the less likely the visitor is to be non-human.
But even robots can act with purpose so we'll never know just how many people are on the Internet and how many are Internet users are not people at all. Estimates of the number of people with access to the Internet are vague and inspecific. If a three-year-old child beating happily on a keyboard is good enough to be counted as an Internet user, our Web metrics will have us singing "ring around the rosey" all day long. That's too hokey pokey.
Webometry needs something like a Turing Test to "prove" that a visitor is neither a robot nor a cat nor a teething three-year-old playing with the wireless mouse. Such a Turing Test has not yet been devised (it would make an interesting challenge, though). Hence, in the absence of the surety of knowing who or what is visiting a Website at a particular time we have to use interferometric strategies to align our data with other, dissimilar data that confirms what we suspect.
Let's assume we know the traffic curve for a Website for a 60-day period. Around the 40th day the traffic on the site spikes for several hours and then plunges for 3 days. What happened? Isolated metrics cannot explain the events (the spike and the drop in traffic). We might turn to social media to see if there was a corresponding increase in discussion around topics found on the Website. If Twitter experiences a huge increase in traffic around a World Cup event and our sports site (offering related information) experiences such a peak as well, then we can infer that there may be a connection between the two events.
We would have to analyze the content that visitors actually looked at to get a better understanding of the possible relationship between the World Cup and the Website's content. It could be there is no correlation, in which case we have to look elsewhere for data. Maybe an alternative sports Website went offline. In fact, it often happens that traffic shifts from one Website to another when users cannot access a favorite destination.
Web data centers don't often go offline but they can and do experience strategic failure events. Traffic patterns shift immediately when users cannot access their favorite Websites. Internet Service Providers may also experience strategic failures, in which huge segments of the online population vanish from the grid. These types of correlations may be easily determined from news reports if the outages are widespread enough. It's the smaller outages that reduce the quality of the data we measure.
Other types of data must be equally validated or cleansed. For example, how many real users are there on Facebook or Twitter? Marketers have set up millions of fake accounts on both services (and many others). The number of incidental accounts that are created also run into large numbers. An incidental account might exist because a user has multiple email addresses. Login sessions within a recent timeframe provide a much better measure of how many people may be using a service, but some robots bypass the APIs and attempt to manipulate interfaces as if they were actual visitors clicking on links.
It is easy enough to autogenerate content for social media, for blogs, for email. All the major news and information services do this, so the amount of email being passed around the Internet, the number of Tweets and Facebook updates posted per day all provide very noisy, ambiguous data. If we want to know how many people are actually online we have to monitor their activity through multiple data sources. If we want to know how many people are using social media, we have to verify who the real users are through some process other than merely counting sessions and activity.
Of course we can attempt to analyze our data streams and apply traditional statistical measures to them, looking for non-standard behavior. However, if robots constitute a significant minority of a service's users, what is "standard" behavior, especially if many of the robots are careful enough to mimic only minimal real user behavior? A social media robot can be downloaded for less than $100. There are free Wordpress plugins that one can use to set up blogs which copy content from other blogs and then post automated notifications to Twitter and Facebook.
Some marketers aggregate information from other Websites; a few marketers aggregate aggregated data. Some marketers "spin" content from source texts, creating hundreds, thousands, or hundreds of thousands of variations of pseudo-articles and comments. The search engines and social media services filter out some of this junk but a lot of it gets through.
All the old myths about using links to determine what is "high quality" or trustworthy have long since been dispelled by the massive growth in spam software. Even real human visitors can fake their own traffic to Websites by downloading toolbars that secretly "visit" other Websites on their behalf. The humans may not even realize they have downloaded the toolbars.
These issues are not new. Web analysts have dealt with them in varying degrees for years. Counter-measures are developed almost as rapidly as vulnerabilities are identified and exploited. But the adoption of counter-measures is neither uniform nor consistent, so data streams are polluted at an unpredictable rate. That complicates the task of interferometric analysis, although the more trendlines from disparate sources you compare in your analysis the more reliable your determination of events will be.
These challenges don't invalidate Webometrics. They merely make the task of measuring things more interesting, if somewhat less reliable. However, knowing what we are dealing with, we are better equipped to understand the data we extract from our measurements. We may not know how large the Web is, how many people use it, or whether all of today's activity is "normal" or not, but we can build a pretty good picture of what to expect and what to look for as we analyze more trendlines and look for both agreement and contradiction or dissimilarity in the curves.