Fake Banner
    Reflections of the Realized Imagination
    By Michael Martinez | August 10th 2011 11:07 AM | 6 comments | Print | E-mail | Track Comments
    About Michael

    Michael Martinez has a Bachelor of Science degree in Computer Science, an Associate of Science degree in Data Processing Technology, and a few certifications...

    View Michael's Profile
    Imagine you are living within a bubble.  Your bubble stretches in any direction but you are always contained within it.  Your bubble is adrift in a sea that has no bottom, only a surface.  The surface surrounds your bubble, so the sea itself is only a bubble.  Beyond the surface lies darkness.

    This is the virtual universe we live in every day.  We call it the Internet.  The bubble around you is the collection of Websites you read and comment on.  The sea through which your bubble drifts is the Searchable Web.  The darkness beyond the Searchable Web is everything else -- everything that you cannot get to, or which has no meaning to you.

    Millions of people view the Searchable Web through the lens of Google.  Millions of other people use the lens of Yandex (the leading Russian-language search engine).  Millions more use the lens of Baidu (the leading Chinese-language search engine).

    Through any one of these lenses you may see a virtual universe of Websites but you won't see the same universe from all three.   Yandex and Baidu both index non-core language Websites.  Yandex made a very public entrance into the English Web, for example; Baidu has made a less public entrance into that Web.

    How large is the Internet?  How large is the World Wide Web?  How large is the Searchable Web?  Several studies have attempted to answer these questions but their methods and estimates are obsolete. The Web is such a dynamic environment that it cannot be measured.

    Imagine that the speed of light is not a limit to our ability to observe the universe.  Imagine that we can see as much of the universe as we think we have mapped in real-time.  Now imagine that every day whole galaxies vanish and whole galaxies appear in different parts of the universe.  Some of these galaxies allow us to look into them; other galaxies obscure their stars from outside observation.

    In this chaotic real-time universe where things pop into and out of existence other things are morphing.  What was a spiral galaxy yesterday may morph into a protogalaxy -- a mass of unshaped gas -- today.  A protogalaxy may suddenly morph into a minor young galaxy.  And the filaments of gravity and dark matter that connect the galaxies and protogalaxies are constantly shifting, changing.

    That is the World Wide Web.  It operates according to its own set of laws.  There are real principles of cause and effect, true limits to what can be accomplished.  The Web is a virtual universe that exists and functions and lives like a universe within the boundaries of the machines that we connect together through the Internet protocol.

    I have participated in attempts to estimate the size of the Web, or portions of the Web.  It's impossible to know at any one time how many servers and clients are connected to the Internet but there is a finite limit.  We know the limit is real but we cannot find it.  It is impossible to know at any one time how many Websites are hosted on the connected servers even though there is a real limit to that number, and we cannot find that limit, either.

    Using a search engine to estimate the size of the Web is equivalent to using a pie plate to hold all the food in a restaurant.  You can do it.  You just cannot place all the food there at once.  You can see all of the Web through a search engine; you just cannot see it all at once.

    Not only do Websites morph, vanish, and burst into existence continually, search engines are continually filtering their data, changing their definitions of what constitutes discrete Web content, and limiting the information they share with searchers.

    To date, search engines have publicly claimed to have crawled about 1,000,000,000,000 URLs.  Many of those URLs were really ghost images -- duplicate URLs served with session IDs, or presented through alternate taxonomies, or otherwise generated dynamically.  The 1 trillion number is probably only a minority fraction of the whole Web and perhaps as much as 10-15% of those claimed URLs have already vanished.

    We cannot find all these URLs but we can almost instantly connect to any URLs we find.  Our browsers are even permitted to see URLs that the search engines are not permitted to see.  And on some Websites we see content that the search engines are not permitted to see.  This "cloaked" content looks like one thing to a search engine and another to us, but it is served over the same URL.

    The Web is a living mechanism, evolving, growing, processing and converting new material.  We have not yet developed a science capable of documenting this living mechanism beyond the crude measurements that Webometrics have offered.  We simplify our view of the Web by thinking in terms of documents, links, and hosts.  We limit our interpretation of the available data by borrowing concepts from network theory.  We are still struggling to find the right metaphors to help us understand the abstractions we need to interpret what we see.

    There are philosophical questions that have received little attention.  For example, does a Website exist?  If so, where does it exist?  Does it exist only on the server that stores its logical patterns? What if that pattern is scattered across multiple servers?  Does it exist in the client machines that connect to the Website (mobile phones, PCs, search engine crawl servers, etc.)?  Is the cache image your browser creates part of the Web or just an echo of it?  There are links in that cache that your browser follows to connect to other Web content.

    We can argue that the Web exists only in the cache files our browsers create and maintain -- except there are live streams of data (audio and video files) being served that are not cached.  Is a streamed file part of the Web or is it just a filament of information transported by, through, or upon the Web that is separate from it?

    If we do not address these philosophical questions then our attempts to study the Web, to identify the laws that govern its existence and its operations, are flawed.  We need the philosophy to tell us what we are studying, to define what is and is not the Web, part of the Web, beyond the Web.

    This is my universe -- the universe of questions about the Web, what it is, what it does.  It is immense.  It is personal.  It is ever changing.  It is fascinating.


    Gerhard Adam
    For example, does a Website exist?  If so, where does it exist?  Does it exist only on the server that stores its logical patterns? What if that pattern is scattered across multiple servers?
    Most of these questions seem somewhat trivial and seem to be minor reflections of the older "brain/mind" issues in philosophy.  In the case of the servers and computers, we know that there are specific data patterns and storage areas that physically exist, so that part is rather straight-forward.  The idea of replicated data is affected more by the ability to refresh data, than it is the transfer.  Therefore, we have to consider the aging affect of data that may reside on different servers at different levels of currency.

    There is no "pattern scattered across multiple servers".  Data resides in one place only, or copies exist.  It may be retrieved from multiple locations in the same way that you might reference more than one book to look up some piece of information.  The primary element that is unique here, is that it represents a diverse set of symbols which can be propagated to systems where they can be interpreted by an individual.  In that sense, the entire process is really a large abstraction.

    However, in the end, I'm not convinced that there's anything more philosophical involved.  We've essentially traded the black/white printed page, for the 0's and 1's of an electronic page.
    Mundus vult decipi
    Michael Martinez
    "Data resides in one place only, or copies exist."

    And that IS a pattern scattered across multiple machines (servers or clients).  Search engines deal with this problem all the time.  However, ecommerce servers also have to deal with it.

    In fact, I am logged in to this Website from two different machines in two different physical locations.

    What seems trivial on first glance is in fact far from trivial.  A great deal of programming effort has been expended through the years to manage persistence, duplication, uniqueness, and states.  There is not just a man behind the curtain, there is a multitude of men behind the curtain.

    However, what we can't agree on is what to measure.  The project I mentioned above was informally called "the size of the universe" because it arose from a discussion in which I compared the Web to the universe.  The Web, I pointed out, is constantly expanding (changing would be more accurate).  We were tasked with measuring that constantly expanding universe.  We found that we lacked the appropriate axioms and definitions to compile reasonable measurements.

    It's far from trivial because so many little details get in the way.  I plan to write about this more in the future.

    Thanks for commenting.
    Gerhard Adam
    In fact, I am logged in to this Website from two different machines in two different physical locations.
    I'm not sure why that is significant.  It's still a single location with you simply being treated as two users.
     A great deal of programming effort has been expended through the years to manage persistence, duplication, uniqueness, and states.
    Not really.  Most control occurs in centralized systems after which agents that copy data must take responsibility for their own synchronization. 
    Mundus vult decipi
    Michael Martinez
    "I'm not sure why that is significant.  It's still a single location with you simply being treated as two users."

    If I am logged into Science20.com from 2 computers there are three nodes in the network consisting of the Science20.com server and the 2 clients.  There is nothing trivial about any of the three nodes.  If you're going to measure the Web you have to decide whether to include the clients (after all, while I am typing this comment into the form the form only exists on my client computer) or the proxies and mirrors.  Even proxies and mirrors have unique states and may even create unique data streams.

    "Most control occurs in centralized systems after which agents that copy data must take responsibility for their own synchronization. "

    It's not the process that is important but the instantiation of the data.  For example, a search engine has to decide whether to show a searcher 1 copy or 1,000 copies of a document; however, 100 search engines may make 100 different choices.  Hence, 100 copies of the document become significant even though only one is authentic.  Those 100 copies may not be the only distinct copies, however.  People interact with Websites through other resources besides search engines.

    If I set out to count the number of Websites that exist, should I include all the copies of Websites or only the authentic original versions?  If I include copes should I include only the copies which are visited by people or should I include copies for which domain name space and server resources are allocated?  What if a server publishes the same information under 100 domain names but all the data is stored in a single database?  What if a mirror site receives the majority of the comments or visits?

    These are not trivial questions because any attempt to measure the Web must allocate sufficient resources to count all the components that are definable if not necessarily accessible.  Look at the attempts to estimate the size of the "Dark Web".  They have to make assumptions about many things you assume are trivial.  Those trivialities, however, carry immense weight within the calculations.
    Gerhard Adam
    I guess I'm not clear on what you're trying to measure or why.  Most of these questions don't rise to any philosophical significance, but rather are dependent on what you're trying to measure.

    If the point is to consider the number of servers, then that represents one consideration regarding the existence of copies.  If the point is simply to determine which items should be cached, then copies become irrelevant.  If the purpose is to measure traffic then we have a metric that is more appropriate for servers themselves.  Similarly, multiple domain names simply represent server alias' so, that metric is also attainable.

    So, my point remains ... what are you trying to measure regarding the web and why?  More to the point, why should any particular measurement be meaningful in any way?
    Mundus vult decipi
    Michael Martinez
    I measure different things, based on the project, the need, or the goal.
    A project might be to measure the size of the social Web.  That requires definitions.  Those definitions require that choices and decisions be made.  Those choices and decisions require that opinions be sought out (actively and passively through research) and evaluated.  It eventually leads to "meaning of life stuff".

    A need might be to explain to someone how it is that publishing an article on an American Website can have an impact in Europe.  That requires following traces of references, links, copies, etc.  Finding and tracing those trails requires looking at the depth of indexing any particular search tool allocates to a particular Website.

    A goal might be to understand when an event occurs.  An event might be something like a grassroots uprising in Egypt, a flash mob in Miami, a change in Website coding practices, the adoption of new marketing techniques, changes in search engine algorithms, or the arise of discussion around a new marketing campaign, scientific report, or natural disaster.

    Even a simple question like "How many Tweets are there?" doesn't have a simple answer that is demonstrably correct.  Does a Tweet cease to be a Tweet if it is copied to another Website?  Does it cease to exist if it is deleted from Twitter but still referred to?  Does it become something else if copied into a blog or Web forum?  If you try to measure the impact that an "influencer" has on the social Web, or that a natural disaster has on the Internet community, where do your metrics begin and end, and why do you make those choices?

    I can cite examples but providing the definitions is not so easy.

    In the long-term I am trying to measure how people use the Web.  Yes, that's extremely ambitious and it needs more hands than I can bring to the task, but that is a large part of what I do.

    In the intermediate term I need to understand how processes interact with each other on the Web.  There are patterns of use, patterns of traffic, patterns of publication -- all of which leave footprints in the data.

    People ask "Why does this happen" and "how does this work"?  I want to understand as best I can before I form and express an opinion.