If you want to make search bad, call it scraping. If you want to make it sound good, use crawling.
The interesting nuance has become even more pronounced during the furor over Craigslist not allowing Oodle to crawl its listings.
In the comments of a Silicon Beat post Dave McClure of Simply Hired articulates a great definition of each, which I would unequivocally agree with. But people are using it in a tabloid news way to further a point of view.
Tom Foremski, a former Financial Times journalist, says that crawlers add little value and take up lots of resources. After analyzing his log files he comes to the conclusion that “The search-and-scrapers sucked out one-third of my bandwidth and provided just 3.7 percent of the traffic!”.
Tom makes the horrible mistake of thinking that one page of crawling equals one page of viewing. Firstly, and most obviously, a human referred to by the ‘search-and-scrap’ sites is likely to look at more than one page.
Is it more than 10 pages? That is also a futile argument. The cost to serve a page is miniscule compared to the monetization rate of each page. Even with the most crappy of techniques people should be able to get $2 CPM, which equates to .2 cents per page. It does not cost anywhere near .2 cents to serve a page. The rate of bandwidth cost is also declining at a Moore’s-law-like rate.
So even with a 3.7% referral volume and one third bandwidth, Tom still ends up on top by a mile. The fact that he only has 3.7% referral volume from the search engines should also worry Tom, as that is markedly below what you would normally expect.
There is also the delicious irony that Tom searches and scrapes his own site. Everyone who has an RSS feed does. What it is essentially doing is creating a feed of summaries that link back to the main site. Tom even includes the full text in his RSS feed. Don’t get me wrong, I read and enjoy his musings every day.
RSS has the potential to radically change the crawling world. Crawlers can essentially subscribe to feeds (Oodly should do this in the absense of not being allowed to crawl Craigslist). A better pinging and notification mechanism is needed to cut down on the bandwidth so that the crawlers don’t have to dumbly guess when to come back but I have faith that will be solved.
In the meantime, ignorant arguments over scraping and stealing will continue to be made, however in jest they may have originally intended to be.
