Web Crawling 1.0
Feb 6, 2008 / By Sheeri Cabral
In Stuffing Six Million Pages Down Google’s Throat, Tim O’Reilly brings up a point, and some questions:
. . . just how poorly the big search engines index small sites with large collections of data . . .
But it’s worth thinking about absolute (and temporary) limits to the growth of Web 2.0. What constraints do we take for granted? What constraints are invisible to us?
For me, “Web 2.0″ is a paradigm where users provide the bulk of the content. The “2.0″ moniker, while overused, is an accurate description of the “new version! completely revamped!” mentality. I think the biggest constraint we take for granted is the fact that we rely mostly on centralized places like Google and Yahoo to provide web crawling.
Part of me is surprised the paradigm was able to make such a big shift. Part of me is surprised at where the paradigm hasn’t shifted. We’re still using “web crawling 1.0″.
Web crawling refers to the process, whereas “stuffing [a search engine] with data” refers to the outcome. In Web 1.0, the outcome of “having accurate and frequently updated/new content” process was accomplished by the process of “companies uploading written articles”. In Web 2.0, this outcome is accomplished by the process of “people providing their own content.” Web 1.0 leans towards the accurate side, whereas Web 2.0 leans towards the frequently updated/new side.
So to me, the question is more, “How can we get decentralized uploading of content, to lessen the need for web crawling?” To expand my mind to the possibilities, I asked myself certain questions.
Question #1: “If a company can assume trustworthiness of uploaded data, in what ways can decentralization occur?” Web browser software could upload data based on cached internet content and a browser’s history. Web server software that uploads data based on access logs and cached page output. Webmaster tools that are run on a web server, that provide services to the webmaster with the option to upload results — ie, a tool that spiders a site checking for broken links, which uploads information on page links. Or a tool that monitors a web site for changes, alerting a webmaster when changes imply brokenness, and uploading data on what changed and when.
Question #2: “Why would I want to upload this information?” Folks use places like www.flickr.com and www.facebook.com to be able to share their content in a free and easy manner, and to be able to look at the content generated by others in a free and easy manner. Free e-mail services store mail on a centralized server, check for viruses and spam, and offer high availability.
Currently, most people want to be able to look at the content generated by others in a free and easy manner (search results!). Most web owners/managers want to share their data, because of the benefits they receive. Very few individuals, however, desire to share their own content.
Question #3: “Why wouldn’t I want to upload the data?”
a) I want to keep the data hidden/unshared for its own sake. I feel I have nothing to gain if I share this data.
I do not use a site such as CookShare to share my recipes. Why not? I have nothing to lose, as I’m not selling cookbooks of my recipe collection. There are recipes I “own”, so it’s not a matter of the legality of posting the content. I just have no reason to use it, whereas I have reason to share my photos.
b) I want to keep the data hidden/unshared for privacy/security issues. I feel like I have something big to lose, or at least more to lose than to gain, if I share this data.
I use www.sheeri.com to host my photos, instead of Flickr, because to me loss of control over my photos is a big detractor. However, I use Gmail, Google’s free e-mail offering, because to me the high availability and good interface are worth the loss of control. Losing a year’s worth of e-mails is a huge disaster compared with losing a year’s worth of photos. Having people not be able to see my photos because my photo site is down is an annoyance; people not being able to contact me because my e-mail server being down is a very large risk.
The biggest hurdle to overcome lies in this question, I think. I don’t think it would be difficult to implement the software I came up with in my answer to Question #1, and website owners/managers have lots of motivation such as the answers Question #2. Regular end-users have less motivation in that regards; some due to apathy (as in “a”) and others due to possible loss (as in “b”).
Things that might motivate end users to upload data:
1) As a requirement to being able to access the shared data of others. A search engine that is better than all the rest, but requires uploading data for usage, would motivate many from the apathy camp. This is similar to www.paperbackswap.com, where before I request that someone sends me a book, I have to send a few of my own books first. I gain the right to request 1 book for each book I send.
2) Reduce the risk or amount of loss. Secure protocols such as https have made more people feel secure about purchases via the web, because there has been an actual reduction in the risk a credit card number transmitted in plain text on the route from the desktop to the ordering server.
3) Increase the benefit or amount of gain — offer a convenient service in exchange for the data. Free blogging sites offer posting rights in exchange for my e-mail address.
This is different from the first method of motivation in that the first method exchanges similar data (“I’ll show you mine if you show me yours [first]“), and that this one adds more benefit to offset the risk.
Question #4: “How can the uploaded data be verified?” Question #1 assumed trustworthiness, but that was just to clear my mind. Many sites use social engineering to verify data — I can post information to the MySQL Forums, and that information is verified by what others think and say about that information. The problem is, of course, that social engineering may “verify” wrong information if nobody posts a refute, and correct information “wrong” because someone refutes it.
“Supply and demand” is a social engineering framework. The “worth” or “value” of an object such as a sweater, car, or painting are abstract ideas meaning “what someone is willing to pay”, and verification occurs when someone buys the object, signifying that they are willing to pay that amount. This may seem out of place, however I wanted to remind the reader that verification of data does not need to be automatic, and some of the best verification systems are ones that constantly re-verify based on feedback.
That all being said, verification of web statistics and cached content can be done by aggregating results. Cached content from one user can be compared to another, differing results can be further compared. Weblogs claiming that Planet MySQL links to www.sheeri.com, and that there were many visits referred by that link, can be verified by cached content (verifying a link exists) and browser history (verifying that a user visited sheeri.com after visiting Planet MySQL). This isn’t 100% airtight verification, but I believe it is “good enough” for search engine companies to trust it for indexing, ranking and content caching.