I recently had to write a script that takes a link to an article and returns a title and brief excerpt or description of that article. Ideally, the excerpt should be the first few sentences from the body of the article.
The first thing I struggled with was something I thought would be trivial: fetching the contents of the webpage.
>>> import httplib2
>>> print response
>>> print status['status']
Ugh. Why is the response empty but the status code is 200? Well, for some reason even though we’re not logging in, this page requires that you accept a cookie and then redirects you, but doesn’t provide a 304 request so httplib2 knows to follow. So, the simplest solution was to ditch httplib2 and use urllib2 and cookielib (by the way, what’s with having urllib, urllib2, httplib, and httplib2?)
>>> cj = cookielib.CookieJar()
>>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
>>> doc = opener.open(url).read()
Once we have the contents of the page, we load everything into BeautifulSoup and make sure that we have some valid HTML. cleanSoup is just a helper function to filter out HTML that I’m not interested in or could munge up my results.
from BeautifulSoup import *
[[tree.extract() for tree in soup(elem)] for elem in ('script','noscript','style')]
# get rid of doctype
subtree = soup.findAll(text=re.compile("DOCTYPE"))
[tree.extract() for tree in subtree]
# get rid of comments
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
soup = cleanSoup(BeautifulSoup(doc,parseOnlyThese=SoupStrainer('head')))
if not soup.get_starttag_text():
print "Invalid input"
title = soup.head.title.string
title = None
description = ''
for meta in soup.findAll('meta'):
if 'description' == meta.get('name', '').lower():
description = meta['content']
Getting the title is easy, and it seems to be a pretty safe assumption that every page has a title defined in head. Sometimes you’ll get lucky and a description will exist in the meta tag, and bam, you’re done. More often than not though, you’ll have to come up with a method to parse the HTML and figure out which part of that is the article you’re interested in. I tried a couple of different approaches and this one seemed to produce the best results for its relative simplicity. I considered some natural language parsing and machine learning methods, but I really don’t have time to build something that complicated for this project.
[[tree.extract() for tree in soup(elem)] for elem in ('h1','h2','h3','h4','h5','h6')]
if not description:
soup = removeHeaders(cleanSoup(BeautifulSoup(doc,parseOnlyThese=SoupStrainer('body'))))
text = ''.join(soup.findAll(text=True)).split('\n')
description = max((len(i.strip()),i) for i in text).strip()[0:255]
return (title, description)
First I parse out all the text from <body> and remove h1, h2, etc headers because they’re likely to contain information like the title, author, date, etc that are not part of the body of the article. Thankfully, BeautifulSoup does most of the heavy lifting here. I then try to merge adjacent paragraphs into one long string of text. You have to be careful when joining bodies of text together, because if you branch too far out, you end up merging in junk text. Then the first 255 characters of the longest resulting string of text are returned as the article excerpt. In most cases I found that this does a pretty good job of finding the first couple sentences from the article, or at least a reasonable excerpt.
Some examples are shown below. Facebook has a widget that does this too, so I included the output of Facebook’s widget as a comparison. If you have any ideas on how to improve this, I’d love to hear them.
The blood-and-thunder tale of two idealistic but naive artists caught up in the repressive political machinery of Rome during the Napoleonic wars, “Tosca” comes around like clockwork in San Francisco (as it does to most opera companies). It always thrills
Puccini’s “Tosca” includes a number of surefire arias, and you can hear them sung vividly and well in the San Francisco Opera’s revival, which opened the company’s summer season at the War Memorial Opera House on Tuesday night.
The diversity of the Bay Area can be witnessed in many different ways, from the variety of the cuisine offered in its restaurants to the multitudinous kinds of topography. One less obvious way to explore the radical differences that coexist in this part o
Facebook actually doesn’t return a description for this page.
How do you spell L-O-C-A-L P-R-I-D-E? Ramya Auroprem, of 2009 Scripps Spelling Bee fame, will get another shot in the spotlight on June 6. The San Jose schoolgirl will join the cast of San Jose Repertory Theatre’s “The 25th Annual Spelling Bee”
How do you spell L-O-C-A-L P-R-I-D-E? Ramya Auroprem, of 2009 Scripps Spelling Bee fame, will get another shot in the spotlight on June 6. The San Jose schoolgirl will join the cast of San Jose Repertory Theatre’s “The 25th Annual Spelling Bee” for the 3 p.m. performance. You go girl!
President Obama and his critics have a major disagreement. He says his accelerated economic stimulus efforts will create 600,000 jobs by the end of the summer. Senate Republican Leader Mitch McConnell of Kentucky, however, doubts “the spending binge
President Obama and his critics have a major disagreement. He says his accelerated economic stimulus efforts will create 600,000 jobs by the end of the summer. Senate Republican Leader Mitch McConnell
Update: Now on github http://github.com/dziegler/excerpt_extractor/tree/master
Thanks for the DRY suggestions to make the code prettier.