Fake Banner
    Seeking Shape and Cardinality in the 0-Dimensional Web
    By Michael Martinez | August 16th 2011 04:32 AM | 1 comment | Print | E-mail | Track Comments
    About Michael

    Michael Martinez has a Bachelor of Science degree in Computer Science, an Associate of Science degree in Data Processing Technology, and a few certifications...

    View Michael's Profile
    What is the shape of a Website?  How does one determine "shape" from a collection of links and electronic files?  Web designers, search engineers, and marketing consultants use geometric shapes like circles, rectangles, pyramids, and network diagrams to visually depict Website shapes but such illustrations fall short of representing the nature of Websites.

    If you want to measure the World Wide Web you need to be able to measure a Website.  It is not enough to merely count the number of registered domains, the number of domains actually serving content through HTTP or HTTPS protocol, the number of active servers, the number of IP addresses that return HTTP status codes, etc.  But Websites have neither height nor width nor length.  Hence, the Web also lacks height, widith, and length.  So is it possible to measure the shape or determine the cardinality of the Web?

    A single domain may consist of one document or it may comprise millions of documents spread across hundreds of thousands of sub-domains, each residing on its own server, all sharing one IP address.  A Website may be served by one or more sub-domains, may include all the sub-domains of a root domain, or may include content served by multiple domains from multiple servers.

    We can neither diagram the shapes of Websites nor map their sizes.  Despite these limitations we attempt to quantify the Web in rational terms.  In essence we try to linearize the Web and all that it touches.  1, 2, 3 ... we start counting here and keep going until we run out of resources.  And our efforts to document Websites are hindered by their depth, which as noted above cannot be measured.

    Hence, what are we counting?  Is it enough to enumerate Universal Resource Locators (URLs)?  Should we include all the incidental URLs such as style sheets and Javascripts, or do we only want to count "pages"?  What is a Web page, for that matter?  Is it a physical HTML file (and can an HTML file be "physical")?  What about pages that are constructed on the fly from components stored in databases, created by CGI scripts, and enhanced by Javascript, Flash, and other widget technologies?

    To all that add the "structure" of the folder metaphor.  Although originally derived from actual Web server file systems URL folders (folder/folder/page) can be just as virtual and dynamic as the HTML code, images, and text from which pages are composed.  Still, we speak of Websites having "flat" or "hierarchical" structures.  These structures are determined by how many internal links are embedded on each page, and how many pages must be chained together in order to discover all the URLs associated with a specific Website.

    A perfectly flat Website embeds every link to every page on every page.  All the pages link to each other.  A perfectly vertical or linear Website embeds only one link on a page and the entire chain must be followed from root to farthest leaf node.  A perfectly hierarchical Website, therefore, must lean at a 45 degree angle, right?  That seems to be where the pyramid shape comes from.  But a pyramid is no more relevant to a Website's shape than a swiss cheese model.

    When diagramming Websites we may discover that some sections have more content than others, that some leaf nodes drop directly from the root, and that many folders contain mixes of folders and leaf nodes.  The pages branch out like family trees in a genealogical table.  We even speak of parent-child page relationships, siblings, and pages can be assigned neighbors or cousins while Websites are grouped together in neighborhoods, micronets, and classes.

    As the Web grows the challenge of counting Websites, Web pages, and Web components becomes increasingly complex and unachievable.  By some estimates Google uses over 1 million servers across all its data centers; by their own admission Google has documented at least 1 trillion URLs.  And yet if we want to study Google's knowledge of the Web we are limited to 1,000 listings for any given query.  The estimates they publish of number of "hits" for any keyword are unreliable.

    Not only can Google NOT count all the Websites and pages, it cannot reach them all.  Billions of pages are changed, deleted, blocked, or simply not linked to within the scope of the Google-indexed Web.  The same limitations apply to all other major search engines.  No one has successfully crawled the entire Web in more than 10 years, if ever anyone has done it.

    There is no first Website and no final Website.  There is no beginning to the Web and no end.  Without such parameters, how do we estimate its size?  Is there a true foundation for cardinality if there is no ordinality in the set of things that comprise the Web?  Even if we imagine the Web as a circle of connected objects (and such a mathematical construct cannot be proven), what is the radius of the circle?  What is its circumference?  And can we define cardinality by either the number of sites or documents contained on the Web?

    When measuring the Web through crawl the accepted convention is to simply choose a "seed set" and start crawling from there, arbitrarily assigning document IDs.  But there are unsolvable problems that quickly arise from such a process.  The most well-known such problem is the Calendar Paradox, which presupposes that an infinite number of crawlable pages may exist, leading search engines to NOT crawl the pages.

    A Web calendar application generates pages for as many dates as a user can imagine.  Unless the software arbitrarily refuses to go any further, a calendar application will march into the unfathomable future through link after link.  Search engines may avoid crawling future dates within calendar tools and only crawl expired dates to which non-calendar content links.  However, calendar applications are mostly used to document future events and activities.

    Hence, search engines deliberately avoid crawling and documenting the information that is of most value to searchers while crawling and documenting information that has lost nearly all its significance.  1 calendar can extend to more dates than there are pages on the Web, and yet millions of Websites provide calendar tools.

    We must therefore ask if a Web page exists simply because it could be served, should someone follow enough links in a chain to reach the page, or if it only exists when someone visits the page.  But if the page only exists when it is served to a visitor, then what is a visitor?  When pages are copied into proxy caches are new pages created and, if so, do the Websites from which the pages are made take on another dimension?

    A cached image may only link to the "live" site and the live site rarely links to the cached image, but search engines provide links to cached images as well as to live sites.  The agreement or congruency between a live site and its many cache images may vary considerably.

    So then what are we counting, and do all page and Website enumerations carry equal weight?  These questions, among others, of course assume that the Web simply "is what it is" but there is one dimension to the Web that we share in our physical world: time.

    The impact of time on Web measurement is immense, critical, and inescapable.  But that discussion will have to wait.

    Comments

    Hello Michael, I just wanted to let you know I'm reviewing this site. Keep up the great work pal!