The Complete Python for Text Analysis

The following set of commands assume that you begin with a Mac OS X that does not have any of the necessities already installed. You can, thus, skip anything you have already done, e.g., if you have already installed Xcode, skip to Step 2.

Step 1: Install the Xcode development and command line tool environment. You’ll have to get Xcode from the Mac App Store. Supposedly, you can avoid this by simply installing the command line tools (see command below), but I have come across at least on instance where it seemed like I needed to go inside Xocde itself and download and install things from within preferences. (This was the old way of doing it.) Here’s the terminal command to install the Command Line Tools (a bit redundant isn’t it?):

xcode-select --install

Nota bene: I continue to see warnings when installing Python and its modules when I have not installed the complete Xcode from the App Store. They look like this:

Warning: xcodebuild exists but failed to execute
Warning: Xcode does not appear to be installed; most ports will likely fail to build.

I am installing the complete setup now on another machine, I will update this post if anything is borked.

Step 2: Install MacPorts.

If, like me, you have recently upgraded your operating system and things are borked, then you need to clean out the old installation(s). This means downloading the installer and running it like you did when you were young. It’s still fast and easy. The uninstallation is also fast and easy. Cleaning, however, takes some time. The steps below first document what you have installed before working you clean everything out:

port -qv installed > myports.txt  
sudo port -f uninstall installed  
sudo port clean all  

You can use the myports document as your list. (The migration page at MacPorts does have a way to automate the re-installation process using this document. Try it, if you like.)

At any rate, once you have MacPorts installed, pretty much everything else you need is going to be found and then installed via port search and then port install.

Step 3: Now you can start installing the stuff you want to install, like [Python 2.7][python]:

sudo port selfupdate  
sudo port install python27  
sudo port install python_select  
sudo port select --set python python27  

Step 4: Install everything needed for the NLTKnumpy, scipy, and matplotlib:

sudo port install py27-numpy  
sudo port install py27-scipy  
sudo port install py27-matplotlib  
sudo port install py27-nltk  

At this point, if you are only interested in NLP (natural language processing), you are done.

Optional: If you are going to pull anything from websites, then you can make your life easier by getting Beautiful Soup, which parses HTML for you:

sudo port install py27-beautifulsoup4

(Check for versions, as it may have incremented up.)

Step 5: If, however, you are also interested in network analysis as well as topic modeling and other forms of “big” data analysis, you can also install three Python modules built to do so — NetworkX, Gensim, and pandas:

sudo port install py27-networkx
sudo port install py27-gensim
sudo port install py27-pandas

Step 6: You have a pretty powerful analytical toolkit now at your disposal. If yo would like to make the user interface a bit more “friendly,” let me suggest that you also install iPython, an interactive Python interpreter, and, the best thing since someone sliced something in order to serve it the iPython notebook:

First, iPython:

sudo port install py27-ipython  
port select --set ipython ipython27  

Then, the iPython notebook components:

sudo port install py27-jinja2  
sudo port install py27-sphinx  
sudo port install py27-zmq  
sudo port install py27-pygments  
sudo port install py27-tornado  
sudo port install py27-nose  
sudo port install py27-readline  

I can’t tell you what a joy iPython notebooks are to use: you can copy complete scripts into a code cell and get results by simply hitting SHIFT + ENTER. And everything is captured for you in a space where you can also make notes on what you are doing, or, in my case, trying to do, in markdown. Everything is saved to a modified JSON file with the extension ipynb. Even better, you can transform the file, using the nbconvert utility, into HTML or LaTeX or PDF. It is very, very, nice.

Options: if you want that LaTeX option for nbconvert to work, you are going to need a functional TeX installation:

sudo port install texlive-latex

Nota bene: In my experience, any TeX installation is big, so if you are in a hurry, either open up another terminal window (or tab), do something in the GUI, or go fix yourself a cup of coffee. It’s going to take a while, and unless staring at the installation log as it scrolls by is your thing, and, hey, it could be, I suggest you let the code take its course and get some other things done.

And, if you need to convert scanned documents into text, the open source OCR application Tesseract is available:

sudo port install tesseract

You’ll need to install your preferred languages, in my case:

sudo port install tesseract-eng

See this search for tesseract for all the languages available.

Afterword: There is also, sigh!, a machine learning module for python called SciKit that does all kinds of things that at this moment in time both excites me and makes my head hurt.