<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.3.2" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>
<channel>
	<title>Comments on: Web Crawling 1.0</title>
	<link>http://www.pythian.com/blogs/797/web-crawling-10</link>
	<description>News and views from Pythian DBAs</description>
	<pubDate>Sat, 22 Nov 2008 05:31:28 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.2</generator>
		<item>
		<title>By: Sheeri Cabral</title>
		<link>http://www.pythian.com/blogs/797/web-crawling-10#comment-163170</link>
		<dc:creator>Sheeri Cabral</dc:creator>
		<pubDate>Mon, 18 Feb 2008 13:16:48 +0000</pubDate>
		<guid>http://www.pythian.com/blogs/797/web-crawling-10#comment-163170</guid>
		<description>Dara -- who currently provides innovations for things like wordpress and blogger, and joomla, and drupal?  The software would be innovated by the entities that produce them.  So everything from free programs like awstats to for-pay programs like summary.  Similar to how we get web stats now!  

And each search engine would have to provide their own innovations to *use* the data given.  If a program updates to add a feature the search engine will have to update to use that feature.

But that's not much different from HTML and JavaScript and browsers -- HTML innovations are provided and browsers have to update to use those features.</description>
		<content:encoded><![CDATA[<p>Dara &#8212; who currently provides innovations for things like wordpress and blogger, and joomla, and drupal?  The software would be innovated by the entities that produce them.  So everything from free programs like awstats to for-pay programs like summary.  Similar to how we get web stats now!  </p>
<p>And each search engine would have to provide their own innovations to *use* the data given.  If a program updates to add a feature the search engine will have to update to use that feature.</p>
<p>But that&#8217;s not much different from HTML and JavaScript and browsers &#8212; HTML innovations are provided and browsers have to update to use those features.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sheeri Cabral</title>
		<link>http://www.pythian.com/blogs/797/web-crawling-10#comment-163169</link>
		<dc:creator>Sheeri Cabral</dc:creator>
		<pubDate>Mon, 18 Feb 2008 13:14:10 +0000</pubDate>
		<guid>http://www.pythian.com/blogs/797/web-crawling-10#comment-163169</guid>
		<description>Nigel -- The problem is that the ISP's could sell the information, as you say.  They'd sell it back to the search engines, but would they also sell it to the individual web host owners?  What about selling to governments?  

Since most web site owners want to be able to characterize the flow of a user on their website, they will want the results anyway.  And web site owners would trust their own results a lot more than a search engine's.  

The problem is that while there are individuals that aren't trustworthy, there are a lot more corporations that aren't, percentagewise.  Which of the search engines do you trust not to fudge the numbers for their own gain?  

Much like there are incorrect articles on wikipedia, but overall the articles are high quality and unbiased, so would be the linking information.  Most people out there are trustworthy.

And if the individual site owner is going to get the information anyway, why duplicate the work?</description>
		<content:encoded><![CDATA[<p>Nigel &#8212; The problem is that the ISP&#8217;s could sell the information, as you say.  They&#8217;d sell it back to the search engines, but would they also sell it to the individual web host owners?  What about selling to governments?  </p>
<p>Since most web site owners want to be able to characterize the flow of a user on their website, they will want the results anyway.  And web site owners would trust their own results a lot more than a search engine&#8217;s.  </p>
<p>The problem is that while there are individuals that aren&#8217;t trustworthy, there are a lot more corporations that aren&#8217;t, percentagewise.  Which of the search engines do you trust not to fudge the numbers for their own gain?  </p>
<p>Much like there are incorrect articles on wikipedia, but overall the articles are high quality and unbiased, so would be the linking information.  Most people out there are trustworthy.</p>
<p>And if the individual site owner is going to get the information anyway, why duplicate the work?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nigel Thomas</title>
		<link>http://www.pythian.com/blogs/797/web-crawling-10#comment-162804</link>
		<dc:creator>Nigel Thomas</dc:creator>
		<pubDate>Sun, 17 Feb 2008 12:56:57 +0000</pubDate>
		<guid>http://www.pythian.com/blogs/797/web-crawling-10#comment-162804</guid>
		<description>Sheeri

Isn't the best place for this data collection "in the cloud"? If ISPs (or network carriers) could catch the referer / referee for each link traversed, this information could be sold back to the search engines.

The web crawler looks for possible links between sites - and evaluates the popularity of sites based on essentially static information. This information can be enhanced by collecting search results and analysing click-through at the search engine. 

A decentralised cloud could actually capture the real inter/intra-site links that are followed; these may be programmatically constructed (so not amenable to static analysis). Pushing the decentralisation right out to the individual web site is probably too far - as you say, it raises issues of trust and competence. Letting the ISP or network carrier take the strain (in many countries they are obliged to capture a lot of this traffic anyway for security purposes) might actually work; it would provide economies of scale and reduce traffic at the edges. 

As always the main problems are setting standards, and ensuring fair access to new entrants into the search market. The data has to be suitably anonymised. 

And the individual site owner can kick the whole thing off simply by looking at her own site...</description>
		<content:encoded><![CDATA[<p>Sheeri</p>
<p>Isn&#8217;t the best place for this data collection &#8220;in the cloud&#8221;? If ISPs (or network carriers) could catch the referer / referee for each link traversed, this information could be sold back to the search engines.</p>
<p>The web crawler looks for possible links between sites - and evaluates the popularity of sites based on essentially static information. This information can be enhanced by collecting search results and analysing click-through at the search engine. </p>
<p>A decentralised cloud could actually capture the real inter/intra-site links that are followed; these may be programmatically constructed (so not amenable to static analysis). Pushing the decentralisation right out to the individual web site is probably too far - as you say, it raises issues of trust and competence. Letting the ISP or network carrier take the strain (in many countries they are obliged to capture a lot of this traffic anyway for security purposes) might actually work; it would provide economies of scale and reduce traffic at the edges. </p>
<p>As always the main problems are setting standards, and ensuring fair access to new entrants into the search market. The data has to be suitably anonymised. </p>
<p>And the individual site owner can kick the whole thing off simply by looking at her own site&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dara</title>
		<link>http://www.pythian.com/blogs/797/web-crawling-10#comment-158466</link>
		<dc:creator>Dara</dc:creator>
		<pubDate>Wed, 06 Feb 2008 23:17:03 +0000</pubDate>
		<guid>http://www.pythian.com/blogs/797/web-crawling-10#comment-158466</guid>
		<description>Yep, reduce loss and increase gain. But with decentralized "web crawling", who would provide the innovations needed for this reduced loss and increased gain?</description>
		<content:encoded><![CDATA[<p>Yep, reduce loss and increase gain. But with decentralized &#8220;web crawling&#8221;, who would provide the innovations needed for this reduced loss and increased gain?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
