Getting up and running with NLTK

A while ago I tweeted a note to Digitial Humanities Questions and Answers about putting together a Python or R script for getting a word frequency distribution for a text. The short explanation for why I want to do this is because it is one way to develop a drop, aka stop, lists in order to tweak network analysis of texts of visualization of those texts using techniques like a word cloud. I am interested in a Python or R script in particular because I want my solution to be platform independent, so that students in my digital humanities seminar can use the scripts no matter what platform they use. (I had come across some useful `bash` scripts, but that limits their use to *nix platforms like Mac OS X or Linux.)

Handily enough, a word frequency distribution function is available as part of the Python Natural Language Toolkit (NLTK) — the same functionality is also baked into R, as John Anderson demonstrated — but I am focusing any scripting acumen development for now on Python.

### Getting up and running with NLTK

To get up and running with NLTK in Python, you first need a fairly recent version of Python: 2.4 or better. (My MacBook is running 2.6.1, which is acceptable, and I’m not good enough, yet, to update.)

In addition to a recent version of Python, and in addition to the NLTK (more on that in a moment), you also need PyYAML. All the downloads for PyYAML are available here: (Please note that from here on out I am describing the installation process for Mac OS X: the Windows routine uses different flavors of these resources — there is a PyYAML executable installer, for example.)

Download the tarballed and gzipped package and unpack it some place convenient. (YOu are going to delete when you are done, and so the place doesn’t matter.) I put my copy on the desktop, and so, having unpacked it, I navigated to its location in a terminal window:

`% cd /Users/me/Desktop/PyYAML-3.09`

(Please note that the presence of the `%` sign is simply to indicate that we are using the command line.) Once there, you run the setup module:

`% sudo python install`

From there, a whole lot of making and installing takes place as your terminal window scrolls quickly. It’s done within seconds. Now you need to download the appropriate NLTK file, mine was here:


This time it’s a GUI-based installer package. Follow the instructions, click on things, and you are done.

To check to make sure everything got done that needed to get done, return to your terminal window and invoke the Python interpreter:

`% python`

At the Python interpreter prompt (`>>>`), type:

`>>> import nltk`

If everything went well, all you will get is a momentary pause, if any, and another interpreter prompt. Congratulations!

Leave a Reply