.

Writing your own webspider with Scrapy

<<

tturner

User avatar

Sr. Member
Sr. Member

Posts: 435

Joined: Thu Jun 26, 2008 4:50 pm

Post Thu Aug 30, 2012 2:39 pm

Writing your own webspider with Scrapy

Hey folks, figured I'd drop in and talk about what I've been working on today. I am working on a separate project that required I extract records from about 2 million pages from a website with no published API. After agonizing over how I was going to accomplish this using bash or Python, I settled on the Scrapy framework. http://scrapy.org/

Some of the unique challenges I had involved bypassing anti-spidering mechanisms built-in to the site (I checked the TOS and it did not forbid automated harvesting of site data) and filtering out a ton of cruft I did not need. I managed to reduce my requests by about 500,000 by building the spider using Scrapy. With a 200 ms delay between requests that saved me over a days worth of requests.

I'm not going to post all my code here, but wanted to highlight the things I did to overcome the sites defense mechanisms.

(these things go into the x_spider.py file)

DOWNLOAD_DELAY = 2 #added to address rate limiting controls, can take decimal entries. This is 200 ms so recommend reducing this or commenting out entirely if possible (scapy will randomize from .5 to 1.5 by default)

COOKIES_ENABLED = False #added to defeat mechanisms which were detecting the cookie I presented and fingerprinting me based on that. Default value is true

(this went into settings.py file)

USER_AGENT = 'Googlebot/2.1'

The default user agent is the project name, with version 1.0. If you needed to you could also randomize the user agent with code like:

http://snippets.scrapy.org/snippets/27/

Another gotcha I ran into, is when using the CrawlSpider Rules, if you are using multiple rules you need to put the deny entries in each rule.

Anyway, check it out. Was super easy. My Python-Fu is really weak and I managed to learn the framework and bust out a script in only a couple hours.
Certifications:
CISSP, CISA, GPEN, GWAPT, GAWN, GCIA, GCIH, GSEC, GSSP-JAVA, OPSE, CSWAE, CSTP, VCP

WIP: Vendor WAF stuff

http://sentinel24.com/blog @tonylturner http://bsidesorlando.org
<<

rance

User avatar

Full Member
Full Member

Posts: 212

Joined: Thu Jan 03, 2008 5:24 pm

Location: Earth

Post Thu Aug 30, 2012 9:00 pm

Re: Writing your own webspider with Scrapy

That's awesome. I've never really used Scrapy, but I saw your tweet earlier and it definitely got me interested in revisiting it.
Poking at security since 1986.  +++ATH
<<

Jamie.R

User avatar

Sr. Member
Sr. Member

Posts: 435

Joined: Mon Aug 06, 2012 9:57 am

Location: UK

Post Fri Aug 31, 2012 3:14 am

Re: Writing your own webspider with Scrapy

Good post thanks for the info tturner I never really used Scrapy before but think I have to take a look to understand this better.
| OSWP | eCPPT Silver and Gold | eWPT |

I'm an InterN0T'er
<<

cyber.spirit

User avatar

Sr. Member
Sr. Member

Posts: 356

Joined: Sun Feb 26, 2012 8:07 am

Location: in your heart!

Post Sun Sep 02, 2012 3:35 pm

Re: Writing your own webspider with Scrapy

awsome! Thanx
ICS Academy Network Security Certified

Return to Programming

Who is online

Users browsing this forum: No registered users and 0 guests

.
Powered by phpBB® Forum Software © phpBB Group.
Designed by ST Software