I recently had to write a script that takes a link to an article and returns a title and brief excerpt or description of that article. Ideally, the excerpt should be the first few sentences from the body of the article.
The first thing I struggled with was something I thought would be trivial: fetching the contents of the webpage.
>>> import httplib2 >>> http=httplib2.Http() >>> status,response=http.request("http://www.mercurynews.com/karendsouza/ci_12510394") >>> print response '' >>> print status['status'] '200'
Ugh. Why is the response empty but the status code is 200? Well, for some reason even though we’re not logging in, this page requires that you accept a cookie and then redirects you, but doesn’t provide a 304 request so httplib2 knows to follow. So, the simplest solution was to ditch httplib2 and use urllib2 and cookielib (by the way, what’s with having urllib, urllib2, httplib, and httplib2?)
import urllib2 import cookielib >>> cj = cookielib.CookieJar() >>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) >>> doc = opener.open(url).read()
Once we have the contents of the page, we load everything into BeautifulSoup and make sure that we have some valid HTML. cleanSoup is just a helper function to filter out HTML that I’m not interested in or could munge up my results.
soup = cleanSoup(BeautifulSoup(doc,parseOnlyThese=SoupStrainer('head'))) if not soup.get_starttag_text(): print "Invalid input" return None try: title = soup.head.title.string except: title = None description = '' for meta in soup.findAll('meta'): if 'description' == meta.get('name', '').lower(): description = meta['content'] break
Getting the title is easy, and it seems to be a pretty safe assumption that every page has a title defined in head. Sometimes you’ll get lucky and a description will exist in the meta tag, and bam, you’re done. More often than not though, you’ll have to come up with a method to parse the HTML and figure out which part of that is the article you’re interested in. I tried a couple of different approaches and this one seemed to produce the best results for its relative simplicity. I considered some natural language parsing and machine learning methods, but I really don’t have time to build something that complicated for this project.
def removeHeaders(soup): [[tree.extract() for tree in soup(elem)] for elem in ('h1','h2','h3','h4','h5','h6')] return soup
if not description: soup = removeHeaders(cleanSoup(BeautifulSoup(doc,parseOnlyThese=SoupStrainer('body')))) text = ''.join(soup.findAll(text=True)).split('\n') description = max((len(i.strip()),i) for i in text).strip()[0:255] return (title, description)
First I parse out all the text from <body> and remove h1, h2, etc headers because they’re likely to contain information like the title, author, date, etc that are not part of the body of the article. Thankfully, BeautifulSoup does most of the heavy lifting here. I then try to merge adjacent paragraphs into one long string of text. You have to be careful when joining bodies of text together, because if you branch too far out, you end up merging in junk text. Then the first 255 characters of the longest resulting string of text are returned as the article excerpt. In most cases I found that this does a pretty good job of finding the first couple sentences from the article, or at least a reasonable excerpt.
Some examples are shown below. Facebook has a widget that does this too, so I included the output of Facebook’s widget as a comparison. If you have any ideas on how to improve this, I’d love to hear them.
Update: Now on github http://github.com/dziegler/excerpt_extractor/tree/master
Thanks for the DRY suggestions to make the code prettier.