This is the virtual universe we live in every day. We call it the Internet. The bubble around you is the collection of Websites you read and comment on. The sea through which your bubble drifts is the Searchable Web. The darkness beyond the Searchable Web is everything else -- everything that you cannot get to, or which has no meaning to you.
Millions of people view the Searchable Web through the lens of Google. Millions of other people use the lens of Yandex (the leading Russian-language search engine). Millions more use the lens of Baidu (the leading Chinese-language search engine).
Through any one of these lenses you may see a virtual universe of Websites but you won't see the same universe from all three. Yandex and Baidu both index non-core language Websites. Yandex made a very public entrance into the English Web, for example; Baidu has made a less public entrance into that Web.
How large is the Internet? How large is the World Wide Web? How large is the Searchable Web? Several studies have attempted to answer these questions but their methods and estimates are obsolete. The Web is such a dynamic environment that it cannot be measured.
Imagine that the speed of light is not a limit to our ability to observe the universe. Imagine that we can see as much of the universe as we think we have mapped in real-time. Now imagine that every day whole galaxies vanish and whole galaxies appear in different parts of the universe. Some of these galaxies allow us to look into them; other galaxies obscure their stars from outside observation.
In this chaotic real-time universe where things pop into and out of existence other things are morphing. What was a spiral galaxy yesterday may morph into a protogalaxy -- a mass of unshaped gas -- today. A protogalaxy may suddenly morph into a minor young galaxy. And the filaments of gravity and dark matter that connect the galaxies and protogalaxies are constantly shifting, changing.
That is the World Wide Web. It operates according to its own set of laws. There are real principles of cause and effect, true limits to what can be accomplished. The Web is a virtual universe that exists and functions and lives like a universe within the boundaries of the machines that we connect together through the Internet protocol.
I have participated in attempts to estimate the size of the Web, or portions of the Web. It's impossible to know at any one time how many servers and clients are connected to the Internet but there is a finite limit. We know the limit is real but we cannot find it. It is impossible to know at any one time how many Websites are hosted on the connected servers even though there is a real limit to that number, and we cannot find that limit, either.
Using a search engine to estimate the size of the Web is equivalent to using a pie plate to hold all the food in a restaurant. You can do it. You just cannot place all the food there at once. You can see all of the Web through a search engine; you just cannot see it all at once.
Not only do Websites morph, vanish, and burst into existence continually, search engines are continually filtering their data, changing their definitions of what constitutes discrete Web content, and limiting the information they share with searchers.
To date, search engines have publicly claimed to have crawled about 1,000,000,000,000 URLs. Many of those URLs were really ghost images -- duplicate URLs served with session IDs, or presented through alternate taxonomies, or otherwise generated dynamically. The 1 trillion number is probably only a minority fraction of the whole Web and perhaps as much as 10-15% of those claimed URLs have already vanished.
We cannot find all these URLs but we can almost instantly connect to any URLs we find. Our browsers are even permitted to see URLs that the search engines are not permitted to see. And on some Websites we see content that the search engines are not permitted to see. This "cloaked" content looks like one thing to a search engine and another to us, but it is served over the same URL.
The Web is a living mechanism, evolving, growing, processing and converting new material. We have not yet developed a science capable of documenting this living mechanism beyond the crude measurements that Webometrics have offered. We simplify our view of the Web by thinking in terms of documents, links, and hosts. We limit our interpretation of the available data by borrowing concepts from network theory. We are still struggling to find the right metaphors to help us understand the abstractions we need to interpret what we see.
There are philosophical questions that have received little attention. For example, does a Website exist? If so, where does it exist? Does it exist only on the server that stores its logical patterns? What if that pattern is scattered across multiple servers? Does it exist in the client machines that connect to the Website (mobile phones, PCs, search engine crawl servers, etc.)? Is the cache image your browser creates part of the Web or just an echo of it? There are links in that cache that your browser follows to connect to other Web content.
We can argue that the Web exists only in the cache files our browsers create and maintain -- except there are live streams of data (audio and video files) being served that are not cached. Is a streamed file part of the Web or is it just a filament of information transported by, through, or upon the Web that is separate from it?
If we do not address these philosophical questions then our attempts to study the Web, to identify the laws that govern its existence and its operations, are flawed. We need the philosophy to tell us what we are studying, to define what is and is not the Web, part of the Web, beyond the Web.
This is my universe -- the universe of questions about the Web, what it is, what it does. It is immense. It is personal. It is ever changing. It is fascinating.