Image
 
linkedin_logo.png rss_logo.jpg
twitter_logo.png youtube_logo.jpg
Latest Additions
 
EH-Net Login
Welcome Guest.






Lost Password?
No account yet? Register
Who's Online
We have 29 guests online
 
Free Business and Tech Magazines and eBooks

You are here: Home arrow Ethical Hacking Discussions and Related Certificationsarrow Programmingarrow [Python] Parsing text from a webpage
EH-Net
May 22, 2013, 05:01:04 PM *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
News: Go back to The Ethical Hacker Network Online Magazine Home Page
 
   Home   Help Calendar Login Register  
Pages: [1]   Go Down
  Print  
Author Topic: [Python] Parsing text from a webpage  (Read 7329 times)
0 Members and 1 Guest are viewing this topic.
jakx
Newbie
*
Offline Offline

Posts: 14


View Profile
« on: January 27, 2009, 06:58:55 PM »

I am trying to write a program that takes a web page and basically finds all the strings that a user puts in and returns them to standard output. As the topic says i am doing this with python because i want to learn it as well as write this program. I have looked for documentation on functions like findall and search but could not find any good documentation on them.Here is what i have so far. Any suggestions would be great. Thanks.




   
Code:
import urllib, sre, re, sys,


    print "Enter The website: "
    url = raw_input()

    data = urllib.urlopen(url).read()

    print "Please enter a topic for me to find: "
    topic = raw_input()

    matches = re.findall(data, topic)
    print matches
Logged
adamj
Newbie
*
Offline Offline

Posts: 17



View Profile
« Reply #1 on: January 28, 2009, 12:02:30 AM »

I'm new to Python too, but how about this?
Same as yours, but it should strip out HTML tags.

import urllib2, sre, re, sys, string

def remove_html_tags(data):
   p = re.compile(r'<[^<]*?>')
   return p.sub('', data)

print "Enter The website: "
url = raw_input()

response = urllib2.urlopen(url)
data = remove_html_tags(response.read())

print "search word"
topic = raw_input()

matches = re.findall(topic, data)
print matches

Logged
jakx
Newbie
*
Offline Offline

Posts: 14


View Profile
« Reply #2 on: January 30, 2009, 10:45:45 AM »

Awesome! Thanks for the input!
Logged
munkeyfreenix .batcat
Newbie
*
Offline Offline

Posts: 11



View Profile
« Reply #3 on: March 11, 2009, 06:49:50 PM »

One thing to keep in mind about your script regarding secure practices is how you use raw_input.

in my experience, raw_input() is way better than just input(), but you MUST run checks on it and scrub the data. otherwise your program will be buggy at best and most likely insecurely coded.

Logged
geo
Newbie
*
Offline Offline

Posts: 2


View Profile
« Reply #4 on: March 14, 2009, 05:44:31 AM »

I think you should rather rely on a HTML parser or on XPath. I wrote an article last month about web scraping techniques : http://ssscripting.wordpress.com/2009/02/15/web-scraping-techniques/ . Even though the code samples are written in ruby, you can use beautifulsoup to do the same type of scraping.
Logged
Pages: [1]   Go Up
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines
Joomla Bridge by JoomlaHacks.com
Valid XHTML 1.0! Valid CSS!
Page created in 0.064 seconds with 23 queries.
 
Exclusive Deal

sansfire13_245x90_cw90.jpg
SANSFIRE 2013
June 15 - 22

5% Off w/ Code: EHN_5

SANS Deals 4 EH-Netters
5% OFF Any SANS Course in Any Format!
Coupon Code: EHN_5 Including SANS Rocky Mountain 2013 & SANS Boston 2013
Polls
Compared to this year, 2013 will be:
 
Recent Forum Topics
EH-Net News Feeds
Latest Additions
 
         
Advertisement

© 2013 The Ethical Hacker Network
Joomla! is Free Software released under the GNU/GPL License.