.

[Python] Parsing text from a webpage

<<

jakx

Newbie
Newbie

Posts: 14

Joined: Mon Aug 11, 2008 9:20 am

Post Tue Jan 27, 2009 7:58 pm

[Python] Parsing text from a webpage

I am trying to write a program that takes a web page and basically finds all the strings that a user puts in and returns them to standard output. As the topic says i am doing this with python because i want to learn it as well as write this program. I have looked for documentation on functions like findall and search but could not find any good documentation on them.Here is what i have so far. Any suggestions would be great. Thanks.




   
  Code:
import urllib, sre, re, sys,


    print "Enter The website: "
    url = raw_input()

    data = urllib.urlopen(url).read()

    print "Please enter a topic for me to find: "
    topic = raw_input()

    matches = re.findall(data, topic)
    print matches
<<

adamj

User avatar

Newbie
Newbie

Posts: 17

Joined: Wed Jan 23, 2008 11:49 pm

Location: Maryland

Post Wed Jan 28, 2009 1:02 am

Re: [Python] Parsing text from a webpage

I'm new to Python too, but how about this?
Same as yours, but it should strip out HTML tags.

import urllib2, sre, re, sys, string

def remove_html_tags(data):
  p = re.compile(r'<[^<]*?>')
  return p.sub('', data)

print "Enter The website: "
url = raw_input()

response = urllib2.urlopen(url)
data = remove_html_tags(response.read())

print "search word"
topic = raw_input()

matches = re.findall(topic, data)
print matches
<<

jakx

Newbie
Newbie

Posts: 14

Joined: Mon Aug 11, 2008 9:20 am

Post Fri Jan 30, 2009 11:45 am

Re: [Python] Parsing text from a webpage

Awesome! Thanks for the input!
<<

munkeyfreenix.batcat

User avatar

Newbie
Newbie

Posts: 11

Joined: Mon Mar 09, 2009 10:09 pm

Post Wed Mar 11, 2009 6:49 pm

Re: [Python] Parsing text from a webpage

One thing to keep in mind about your script regarding secure practices is how you use raw_input.

in my experience, raw_input() is way better than just input(), but you MUST run checks on it and scrub the data. otherwise your program will be buggy at best and most likely insecurely coded.
<<

geo

Newbie
Newbie

Posts: 2

Joined: Fri Mar 13, 2009 5:45 pm

Post Sat Mar 14, 2009 5:44 am

Re: [Python] Parsing text from a webpage

I think you should rather rely on a HTML parser or on XPath. I wrote an article last month about web scraping techniques : http://ssscripting.wordpress.com/2009/0 ... echniques/ . Even though the code samples are written in ruby, you can use beautifulsoup to do the same type of scraping.

Return to Programming

Who is online

Users browsing this forum: No registered users and 0 guests

.
Powered by phpBB® Forum Software © phpBB Group.
Designed by ST Software