Image
 
linkedin_logo.png rss_logo.jpg
twitter_logo.png youtube_logo.jpg
Latest Additions
 
EH-Net Login
Welcome Guest.






Lost Password?
No account yet? Register
Who's Online
We have 37 guests and 1 member online
 
Advertisement

You are here: Home arrow Ethical Hacking Discussions and Related Certificationsarrow Programmingarrow Writing your own webspider with Scrapy
EH-Net
May 24, 2013, 09:10:42 AM *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
News: Go back to The Ethical Hacker Network Online Magazine Home Page
 
   Home   Help Calendar Login Register  
Pages: [1]   Go Down
  Print  
Author Topic: Writing your own webspider with Scrapy  (Read 3413 times)
0 Members and 1 Guest are viewing this topic.
tturner
Sr. Member
****
Offline Offline

Posts: 432


View Profile WWW
« on: August 30, 2012, 02:39:24 PM »

Hey folks, figured I'd drop in and talk about what I've been working on today. I am working on a separate project that required I extract records from about 2 million pages from a website with no published API. After agonizing over how I was going to accomplish this using bash or Python, I settled on the Scrapy framework. http://scrapy.org/

Some of the unique challenges I had involved bypassing anti-spidering mechanisms built-in to the site (I checked the TOS and it did not forbid automated harvesting of site data) and filtering out a ton of cruft I did not need. I managed to reduce my requests by about 500,000 by building the spider using Scrapy. With a 200 ms delay between requests that saved me over a days worth of requests.

I'm not going to post all my code here, but wanted to highlight the things I did to overcome the sites defense mechanisms.

(these things go into the x_spider.py file)

DOWNLOAD_DELAY = 2 #added to address rate limiting controls, can take decimal entries. This is 200 ms so recommend reducing this or commenting out entirely if possible (scapy will randomize from .5 to 1.5 by default)

COOKIES_ENABLED = False #added to defeat mechanisms which were detecting the cookie I presented and fingerprinting me based on that. Default value is true

(this went into settings.py file)

USER_AGENT = 'Googlebot/2.1'

The default user agent is the project name, with version 1.0. If you needed to you could also randomize the user agent with code like:

http://snippets.scrapy.org/snippets/27/

Another gotcha I ran into, is when using the CrawlSpider Rules, if you are using multiple rules you need to put the deny entries in each rule.

Anyway, check it out. Was super easy. My Python-Fu is really weak and I managed to learn the framework and bust out a script in only a couple hours.
Logged

Certifications:
CISSP, CISA, GPEN, GWAPT, GAWN, GCIA, GCIH, GSEC, OPSE, CSWAE, CSTP, VCP

WIP: OSWP, GSSP-JAVA, GXPN

Udacity on hold, again. I suck.

http://sentinel24.com/blog  @tonylturner http://bsidesorlando.org
rance
Full Member
***
Offline Offline

Posts: 212


<censored>


View Profile
« Reply #1 on: August 30, 2012, 09:00:02 PM »

That's awesome. I've never really used Scrapy, but I saw your tweet earlier and it definitely got me interested in revisiting it.
Logged

Poking at security since 1986.  +++ATH
Jamie.R
Sr. Member
****
Offline Offline

Posts: 429


View Profile
« Reply #2 on: August 31, 2012, 03:14:38 AM »

Good post thanks for the info tturner I never really used Scrapy before but think I have to take a look to understand this better.
Logged

OSWP | Hackingdojo Nidan | eCPPT
fred
Sr. Member
****
Offline Offline

Posts: 351


The World is sick, Save your mind...


View Profile
« Reply #3 on: September 02, 2012, 03:35:11 PM »

awsome! Thanx
Logged

ICS Academy Network Security Certified
Pages: [1]   Go Up
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines
Joomla Bridge by JoomlaHacks.com
Valid XHTML 1.0! Valid CSS!
Page created in 0.064 seconds with 22 queries.
 
Exclusive Deal

sansfire13_245x90_cw90.jpg
SANSFIRE 2013
June 15 - 22

5% Off w/ Code: EHN_5

SANS Deals 4 EH-Netters
5% OFF Any SANS Course in Any Format!
Coupon Code: EHN_5 Including SANS Rocky Mountain 2013 & SANS Boston 2013
Polls
Compared to this year, 2013 will be:
 
Recent Forum Topics
EH-Net News Feeds
Latest Additions
 
         
Advertisement

© 2013 The Ethical Hacker Network
Joomla! is Free Software released under the GNU/GPL License.