Python Text Processing for Freshmen

In my freshman introduction to academic writing, we do some reading, because, after all, you need something about which to write. I focus on a small group of texts because they can hold the evidence in their hands and because I teach how to think and work with texts when I am not introducing people to academic writing. That is, I assume that a biologist teaching an introduction to academic writing would use biological data as the basis for her course. That English professors are uniquely situated to teach academic writing, broadly construed, is something for another conversation. Or perhaps it is an empirical move on the English department’s part, to claim all of academic writing when what we know how to do, and thus can claim to teach, is writing about texts.

So we read a small number of texts, two of which are short stories and two of which are screenplays. All four of the texts are available to the students as both plain texts and PDFs. I have, in the past, used a collection of Java apps (applets) that allow students to do things like create word frequency lists, create word clouds, or examine word collocations. (For the latter I am entirely indebted to James Dombrowski for his excellent Wordij.) While running these apps does introduce students to the command line, it does not do much beyond that, and I would like, no matter how silly this seems, at least to introduce them to the idea that they can use a scripting language to do things in various dimensions of their lives. Plus I hope that, like me, they discover that learning to code is also a way to learn another way to think.

And so I have begun a hunt for a collection of Python scripts that do some of the things we already do in class and perhaps some scripts that take us new places.

*Please note that as of December 21 — Happy Mayan Apocalypse Day! — this post is still in process and this material is not yet curated. Plus, I’m really looking for feedback from readers on what kinds of text analysis they would want students to do. Keep in mind that this is not “big data” but single texts or a very small collection of texts.*

Okay, first thing we already do is generate a word frequency list, which we visualize both as bar charts and as word clouds. What good does this do? Well, first, it introduces the idea of *function words*, words which must be present in discourse for it to go but to which we, apparently, attribute very little meaning. Just as important as this idea is the idea that in addition to function words there is a list of other words within a text which do not have a significant impact on its meaning and which can be ignored: stopword lists are great for this because students get to make this happen, quite mechanically, and then see the results in their much more focused, and interesting, word clouds.

One thing that might be useful to add here is a script that lemmatizes the words in a text, or its resulting list of words.

Someone asked a question on StackExchange about [how Wordle creates its word clouds][], and they got an answer from a lot of people, including Wordle’s own Jonathon Feinberg. In particular, Reto Aebersold posted a link to his [PyTagCloud on GitHub][]. There is also a link to someone creating a word cloud with Processing, but that’s for another time. (I am thinking, for a technical writing course, how we could take some of these outputs, feed them into Processing, and then have some sort of real world output, using something like Arduino. Oh, yeah, I’m ready to have some fun.)

And then there’s this interesting bit of code, [Story Statistics on DaniWeb][]. I’m

### About the Texts

For those who are curious, the texts are:

* Richard Connell’s “The Most Dangerous Game”
* Frederic Brown’s “Arena”
* Star Trek (The Original Series) “Arena”
* Star Trek: The Next Generation “Darmok”

[how Wordle creates its word clouds]:
[PyTagCloud on GitHub]:
[Story Statistics on DaniWeb]: