Hey folks, figured I'd drop in and talk about what I've been working on today. I am working on a separate project that required I extract records from about 2 million pages from a website with no published API. After agonizing over how I was going to accomplish this using bash or Python, I settled on the Scrapy framework.
http://scrapy.org/ Some of the unique challenges I had involved bypassing anti-spidering mechanisms built-in to the site (I checked the TOS and it did not forbid automated harvesting of site data) and filtering out a ton of cruft I did not need. I managed to reduce my requests by about 500,000 by building the spider using Scrapy. With a 200 ms delay between requests that saved me over a days worth of requests.
I'm not going to post all my code here, but wanted to highlight the things I did to overcome the sites defense mechanisms.
(these things go into the x_spider.py file)
DOWNLOAD_DELAY = 2 #added to address rate limiting controls, can take decimal entries. This is 200 ms so recommend reducing this or commenting out entirely if possible (scapy will randomize from .5 to 1.5 by default)
COOKIES_ENABLED = False #added to defeat mechanisms which were detecting the cookie I presented and fingerprinting me based on that. Default value is true
(this went into settings.py file)
USER_AGENT = 'Googlebot/2.1'
The default user agent is the project name, with version 1.0. If you needed to you could also randomize the user agent with code like:
http://snippets.scrapy.org/snippets/27/Another gotcha I ran into, is when using the CrawlSpider Rules, if you are using multiple rules you need to put the deny entries in each rule.
Anyway, check it out. Was super easy. My Python-Fu is really weak and I managed to learn the framework and bust out a script in only a couple hours.