Writing your own webspider with Scrapy

Viewing 3 reply threads
  • Author
    Posts
    • #7856
      tturner
      Participant

      Hey folks, figured I’d drop in and talk about what I’ve been working on today. I am working on a separate project that required I extract records from about 2 million pages from a website with no published API. After agonizing over how I was going to accomplish this using bash or Python, I settled on the Scrapy framework. http://scrapy.org/

      Some of the unique challenges I had involved bypassing anti-spidering mechanisms built-in to the site (I checked the TOS and it did not forbid automated harvesting of site data) and filtering out a ton of cruft I did not need. I managed to reduce my requests by about 500,000 by building the spider using Scrapy. With a 200 ms delay between requests that saved me over a days worth of requests.

      I’m not going to post all my code here, but wanted to highlight the things I did to overcome the sites defense mechanisms.

      (these things go into the x_spider.py file)

      DOWNLOAD_DELAY = 2 #added to address rate limiting controls, can take decimal entries. This is 200 ms so recommend reducing this or commenting out entirely if possible (scapy will randomize from .5 to 1.5 by default)

      COOKIES_ENABLED = False #added to defeat mechanisms which were detecting the cookie I presented and fingerprinting me based on that. Default value is true

      (this went into settings.py file)

      USER_AGENT = ‘Googlebot/2.1’

      The default user agent is the project name, with version 1.0. If you needed to you could also randomize the user agent with code like:

      http://snippets.scrapy.org/snippets/27/

      Another gotcha I ran into, is when using the CrawlSpider Rules, if you are using multiple rules you need to put the deny entries in each rule.

      Anyway, check it out. Was super easy. My Python-Fu is really weak and I managed to learn the framework and bust out a script in only a couple hours.

    • #49508
      rance
      Participant

      That’s awesome. I’ve never really used Scrapy, but I saw your tweet earlier and it definitely got me interested in revisiting it.

    • #49509
      Jamie.R
      Participant

      Good post thanks for the info tturner I never really used Scrapy before but think I have to take a look to understand this better.

    • #49510
      cyber.spirit
      Participant

      awsome! Thanx

Viewing 3 reply threads
  • You must be logged in to reply to this topic.

Copyright ©2021 Caendra, Inc.

Contact Us

Thoughts, suggestions, issues? Send us an email, and we'll get back to you.

Sending

Sign in with Caendra

Forgot password?Sign up

Forgot your details?