Web Crawling 1.0

Posted in: Technical Track

In Stuffing Six Million Pages Down Google’s Throat, Tim O’Reilly brings up a point, and some questions:

. . . just how poorly the big search engines index small sites with large collections of data . . .

But it’s worth thinking about absolute (and temporary) limits to the growth of Web 2.0. What constraints do we take for granted? What constraints are invisible to us?

For me, “Web 2.0” is a paradigm where users provide the bulk of the content. The “2.0” moniker, while overused, is an accurate description of the “new version! completely revamped!” mentality. I think the biggest constraint we take for granted is the fact that we rely mostly on centralized places like Google and Yahoo to provide web crawling.

Part of me is surprised the paradigm was able to make such a big shift. Part of me is surprised at where the paradigm hasn’t shifted. We’re still using “web crawling 1.0”.

Web crawling refers to the process, whereas “stuffing [a search engine] with data” refers to the outcome. In Web 1.0, the outcome of “having accurate and frequently updated/new content” process was accomplished by the process of “companies uploading written articles”. In Web 2.0, this outcome is accomplished by the process of “people providing their own content.” Web 1.0 leans towards the accurate side, whereas Web 2.0 leans towards the frequently updated/new side.

So to me, the question is more, “How can we get decentralized uploading of content, to lessen the need for web crawling?” To expand my mind to the possibilities, I asked myself certain questions.

Question #1: “If a company can assume trustworthiness of uploaded data, in what ways can decentralization occur?” Web browser software could upload data based on cached internet content and a browser’s history. Web server software that uploads data based on access logs and cached page output. Webmaster tools that are run on a web server, that provide services to the webmaster with the option to upload results — ie, a tool that spiders a site checking for broken links, which uploads information on page links. Or a tool that monitors a web site for changes, alerting a webmaster when changes imply brokenness, and uploading data on what changed and when.

Question #2: “Why would I want to upload this information?” Folks use places like www.flickr.com and www.facebook.com to be able to share their content in a free and easy manner, and to be able to look at the content generated by others in a free and easy manner. Free e-mail services store mail on a centralized server, check for viruses and spam, and offer high availability.

Currently, most people want to be able to look at the content generated by others in a free and easy manner (search results!). Most web owners/managers want to share their data, because of the benefits they receive. Very few individuals, however, desire to share their own content.

Question #3: “Why wouldn’t I want to upload the data?”

a) I want to keep the data hidden/unshared for its own sake. I feel I have nothing to gain if I share this data.

I do not use a site such as CookShare to share my recipes. Why not? I have nothing to lose, as I’m not selling cookbooks of my recipe collection. There are recipes I “own”, so it’s not a matter of the legality of posting the content. I just have no reason to use it, whereas I have reason to share my photos.

b) I want to keep the data hidden/unshared for privacy/security issues. I feel like I have something big to lose, or at least more to lose than to gain, if I share this data.

I use www.sheeri.com to host my photos, instead of Flickr, because to me loss of control over my photos is a big detractor. However, I use Gmail, Google’s free e-mail offering, because to me the high availability and good interface are worth the loss of control. Losing a year’s worth of e-mails is a huge disaster compared with losing a year’s worth of photos. Having people not be able to see my photos because my photo site is down is an annoyance; people not being able to contact me because my e-mail server being down is a very large risk.

The biggest hurdle to overcome lies in this question, I think. I don’t think it would be difficult to implement the software I came up with in my answer to Question #1, and website owners/managers have lots of motivation such as the answers Question #2. Regular end-users have less motivation in that regards; some due to apathy (as in “a”) and others due to possible loss (as in “b”).

Things that might motivate end users to upload data:
1) As a requirement to being able to access the shared data of others. A search engine that is better than all the rest, but requires uploading data for usage, would motivate many from the apathy camp. This is similar to www.paperbackswap.com, where before I request that someone sends me a book, I have to send a few of my own books first. I gain the right to request 1 book for each book I send.

2) Reduce the risk or amount of loss. Secure protocols such as https have made more people feel secure about purchases via the web, because there has been an actual reduction in the risk a credit card number transmitted in plain text on the route from the desktop to the ordering server.

3) Increase the benefit or amount of gain — offer a convenient service in exchange for the data. Free blogging sites offer posting rights in exchange for my e-mail address.

This is different from the first method of motivation in that the first method exchanges similar data (“I’ll show you mine if you show me yours [first]”), and that this one adds more benefit to offset the risk.

Question #4: “How can the uploaded data be verified?” Question #1 assumed trustworthiness, but that was just to clear my mind. Many sites use social engineering to verify data — I can post information to the MySQL Forums, and that information is verified by what others think and say about that information. The problem is, of course, that social engineering may “verify” wrong information if nobody posts a refute, and correct information “wrong” because someone refutes it.

“Supply and demand” is a social engineering framework. The “worth” or “value” of an object such as a sweater, car, or painting are abstract ideas meaning “what someone is willing to pay”, and verification occurs when someone buys the object, signifying that they are willing to pay that amount. This may seem out of place, however I wanted to remind the reader that verification of data does not need to be automatic, and some of the best verification systems are ones that constantly re-verify based on feedback.

That all being said, verification of web statistics and cached content can be done by aggregating results. Cached content from one user can be compared to another, differing results can be further compared. Weblogs claiming that Planet MySQL links to www.sheeri.com, and that there were many visits referred by that link, can be verified by cached content (verifying a link exists) and browser history (verifying that a user visited sheeri.com after visiting Planet MySQL). This isn’t 100% airtight verification, but I believe it is “good enough” for search engine companies to trust it for indexing, ranking and content caching.


Interested in working with Sheeri? Schedule a tech call.

4 Comments. Leave new

Yep, reduce loss and increase gain. But with decentralized “web crawling”, who would provide the innovations needed for this reduced loss and increased gain?



Isn’t the best place for this data collection “in the cloud”? If ISPs (or network carriers) could catch the referer / referee for each link traversed, this information could be sold back to the search engines.

The web crawler looks for possible links between sites – and evaluates the popularity of sites based on essentially static information. This information can be enhanced by collecting search results and analysing click-through at the search engine.

A decentralised cloud could actually capture the real inter/intra-site links that are followed; these may be programmatically constructed (so not amenable to static analysis). Pushing the decentralisation right out to the individual web site is probably too far – as you say, it raises issues of trust and competence. Letting the ISP or network carrier take the strain (in many countries they are obliged to capture a lot of this traffic anyway for security purposes) might actually work; it would provide economies of scale and reduce traffic at the edges.

As always the main problems are setting standards, and ensuring fair access to new entrants into the search market. The data has to be suitably anonymised.

And the individual site owner can kick the whole thing off simply by looking at her own site…


Nigel — The problem is that the ISP’s could sell the information, as you say. They’d sell it back to the search engines, but would they also sell it to the individual web host owners? What about selling to governments?

Since most web site owners want to be able to characterize the flow of a user on their website, they will want the results anyway. And web site owners would trust their own results a lot more than a search engine’s.

The problem is that while there are individuals that aren’t trustworthy, there are a lot more corporations that aren’t, percentagewise. Which of the search engines do you trust not to fudge the numbers for their own gain?

Much like there are incorrect articles on wikipedia, but overall the articles are high quality and unbiased, so would be the linking information. Most people out there are trustworthy.

And if the individual site owner is going to get the information anyway, why duplicate the work?


Dara — who currently provides innovations for things like wordpress and blogger, and joomla, and drupal? The software would be innovated by the entities that produce them. So everything from free programs like awstats to for-pay programs like summary. Similar to how we get web stats now!

And each search engine would have to provide their own innovations to *use* the data given. If a program updates to add a feature the search engine will have to update to use that feature.

But that’s not much different from HTML and JavaScript and browsers — HTML innovations are provided and browsers have to update to use those features.


Leave a Reply

Your email address will not be published. Required fields are marked *