Nice to see that the text analytics stuff is picking up traction. This is from my WordPress stats for yesterday:
Not huge numbers, but steady over the past few days.
Marketing ethnography, not selling it, but doing it in order to sell things, is profiled in _The Atlantic_’s [“Anthropology, Inc.”] The adaptation of ethnographic methods, “a movement to deploy social scientists on field research for corporate clients,” to market research seems pretty straightforward:
> The vodka giant Absolut had contracted with ReD to infiltrate American drinking cultures and report back on the elusive phenomenon known as the “home party.” This [party] was the latest in a series of home parties that Lieskovsky and her colleagues had joined in order to write an extended ethnographic survey of drinking practices, attempting to figure out the rules and rituals—spoken and unspoken—that govern Americans’ drinking lives, and by extension their vodka-buying habits.
The journalist somewhat tips his hand in his use of *infiltrate* — and later passages describe his “horror” at the intimacy that ethnographic methods can inculcate, but its otherwise an interesting read. For a time during my management development days, and occasionally since then, I toyed with the idea of trying to build a business around doing ethnographic research within organizations. For a time, I was going to do some of that kind of research on an open source project, and I may yet, but the problem with any kind of contracted research, different, I think, from sponsored research, is that ultimately you confront two possibilities:
1. You may not be able to do much with the work of your labor
apart from hand it over to your employer, which reveals, I
guess how tied I am to the idea that my production should
always remain within my control — as well as my fortune
that I remain in a position to do so (though the University
of Louisiana System seems hellbent on reversing that fortune)
2. You will, in all probability, have much control over the
outcomes of your research. For me, ethnography is a kind
of socratic literature: good people trying to do their
best given their circumstances. If there is a failing,
then it is a larger systemic failing. Few organizations
or few managers ever really want to take responsibility
Those two things have always kept me from pursuing any kind of applied ethnography in the private sector, but, I confess, as the state and my university continue to try to squeeze faculty at every turn — through salaries, through removal of resources (travel budgets, library budgets, increasing inane travel restrictions) — I find myself both revisiting the basic idea and re-considering the concerns above.
Here is the fact of the matter: if we stay here, we will shortly not be able to afford to send out daughter to the private school she loves so much. We are not, in sending her to a private school, able to afford to save for retirement, or for her college fund, the way we should. (Which is to admit, not much at all.) This means two things:
1. find another job where the pay and benefits make it possible
to do those things, or
2. take on additional jobs.
Applied anthropology looks better and better in this light, and so like the quants who took over the stock market starting in the seventies when there were more physicists with advanced degrees than there were jobs, I wonder if we won’t see some similar transformations as we see more humanists and human scientists with advanced degrees than there are jobs.
I occasionally read sociobiological or evolutionary psychological scholarship. I find it regularly thought-provoking, and also regularly a little too quick to draw grand conclusions from samples, but I had no idea that there was [so much contention around the fields]. Horgan’s recent piece for _Scientific American_ captures some of the contention, but the contention actually plays out in the comments. Fascinating! I read Chagnon long ago, while in graduate school. Reading that Pinker’s work, and others’, owes a debt to Chagnon put a couple of pieces of my own intellectual history together for me.
I woke from a dream this morning in which I had gotten a nice note from my editor, Craig Gill, outlining ways to be productive — he wanted to make sure I delivered the book on time. One nice bit of advice oneiric Gill gave me was: “don’t think of writing for ‘due by’ but rather week by week.” It was a longer sentence, and better put, but I remember the play on the word *by*. Given that the email had something like five or more bullet points in them, it’s interesting that the one that stayed with me is the one that has the more poetic dimension to it. We are discussing the second chapter of David Rubin’s *Memory in Oral Traditions* today in class: this would seem to support the assertion that formula create stability in memory.
Tor.com’s discussion of the first Star Wars sequel, [_Splinter of the Mind’s Eye_], is a case study in how story worlds work in genre fiction. Issues of canonicity abound. Does a story fit within the “known universe” or is it consigned to an “expanded universe”? If it fits, is it made to fit by being *retconned*, which is the act of retroactively fitting a story into the continuity of a universe. Retconning has a slight negative connotation, somewhat akin to “explaining away” something. It’s the *away* that marks the difference.
[_Splinter of the Mind’s Eye_]: http://www.tor.com/blogs/2013/02/the-star-wars-sequel-that-never-quite-was-splinter-of-the-minds-eye
I have, for the past several years now, introduced my undergraduate students to some elements of textual analysis using computational methods. I use text analytics here only tentatively: many readers will perhaps be more familiar with, and indeed prefer, the older term of text mining, but for me that term is close to data mining, which usually refers to working with a great deal more texts than I am going to discuss here — and that is why I suppose the term data mining has morphed into big data. (More on this anon.)
The number of texts here is one. That’s right one text, and, in fact the one text is a short story. Keeping the text small is one way I have found to keep any psychological barriers to entry low when I introduce students to text analytics. The particular short story I have chosen in the past, and which I use here, is Richard Connell’s “The Most Dangerous Game.” The 1924 story has two advantages when working with students: first, it is in the public domain, and, second, the text’s story has been so widely adapted that students are already familiar with it and have probably seem an adaptation of the story in some fashion within their own recent memory. E.g., only a year or so ago, the FX Network’s adult cartoon series, Archer, featured a version of the story entitled, “El Contador” (The Accountant).
If we are to use a computer to make possible certain kinds of analysis of texts, what are the kinds of things we might like to know?
With that list in mind, I would like to introduce Python for Text Analytics, or PyTA — pronounced more like the genre of painting than the flatbread. The repository contains the text of the story as well as the scripts that will produce the results outlined above. Please note that, at this stage, the scripts are designed to be run with the target text inside the same folder (directory) as they are. If you want to use a different text, simply copy and paste it into the folder and change the filename in the script. The ReadMe file explains how to save the output of the scripts to a file, which will come in handy for Step 3 above.
For those readers already familiar with Python, and by familiar I mean you already have it installed and know how to access it, you can skip the next bit. For everyone else, a bit of review won’t hurt. Some people are going to want to know why I am doing all this in Python and not using off-the-shelf solutions. This is not the moment to engage in a recapitulation of all the usual arguments in favor of open source software, how it not only parallels the academy’s ideals but how it practically makes possible the spread of ideas in a world sometimes hostile to such spread. What matters here is:
The scripts are, in fact, in the public domain under the Creative Commons license for doing so. The links for the scripts take users to a GitHub page from which they can be downloaded, or, if you have a GitHub account yourself, you can fork the repo itself. Please feel free to do either.
Finally, please note that a basic working installation of Python will let you perform Steps 1-3 above. If you are interested in looking at words in context and in examining other kinds of relationships between words, Step 4, then you are going to need to have the Natural Language Toolkit installed. It’s not difficult to do so. If you are using a Mac, I posted instructions last year. Instructions for getting Python installed on a Windows PC are available, with further instructions for the NLTK also available. I assume everyone else is running Linux or BSD and, really, you don’t need my help. (Please note that there are a variety of suggested ways of getting the NLTK installed on a Mac, but the MacPorts route is really the way to go. Trust me: I’ve gone some of the other ways.)
Now we can start working with an actual text and looking at some actual numbers. Every time I do this with students I find it helpful to have a conversation about how these numbers won’t in and of themselves tell us much about the text, but the various features they reveal or the questions they lead us to ask are useful. These numbers can’t draw conclusions, that’s the job of the human analyst, but they can provoke inquiry. And, sometimes, they reveal dimensions of a text that maybe we would not have thought about without seeing it quantified.
With that noted, let’s plunge into some rough stats for “the Most Dangerous Game” which we get, simply enough by runing
And it prints out the following, which we can copy and paste anywhere:
COUNTS Paragraphs : 205 Section Breaks: 0 Sentences : 717 Words : 7959 AVERAGES Sentences per paragraph: 3 Words per paragraph: 38
So an 8000 word story told in two hundred paragraphs and seven hundred sentences. (According to Lee Masterson, this puts MDG in the territory of the novelette, which seems odd to me or an index of how things have changed. If you want to know the other counts, see this Askville Answer.)
The other counts, as I called them in this script — feel free to change that — are for paragraphs and sentences. These numbers in and of themselves aren’t terribly interesting, until you play with them a bit, as I did to get the two averages: neither of which is something we typically discuss when examining texts — and probably the only reason most of us, and especially our students, are familiar with word counts is because we have had to deal with either minimums or maximums. We rarely think of them as having any kind of significant descriptive power. And yet when we combine some of these counts we end up with some interesting averages.
The first one, sentences per paragraph, is striking. Three? That seems like a terribly small number, which drives most readers to look at the story more closely. What they discover, as they skim through the pages of the PDF version of the story is that a great deal of the story is told in dialogue. There is so much dialogue that you have to scour the story for the moments of non-talking action. There are, in fact, two passages of extended narration of action: the first occurs when the ship on which we first meet Rainsford sinks and he has to make his way to the island and the second is the famous game itself.
These moments of action help to delimit the principle sections of the text:
A fun thing to do is to copy and paste the text of these three sections into an image so students can “see” the story in its entirety:
For those familiar with the Hollywood Formula, the story meets the idealized ratio of 1-2-1 of content pretty closely. Closer to our topic at hand, the quick visualization also lets us see that the first and second sections have a lot of thin lines, representing paragraphs that are made up mostly of dialogue, and that the third section has some fatter lines. If we take our new insight into the text and do a little counting of paragraphs and words in these sections, we get the following results:
Words per paragraph is an odd measure, but one that reveals that the “action” parts of the story actually takes place in longer paragraphs.
Let’s find a bit more about the words themselves by running the next script,
words.py. This script produces quite a bit of output, and so my best advice is to capture it by entering the following at the command line:
python words.py > mdgwords.txt
This tells the command shell to send the output of the script to the file
mdgwords.txt. You can name the file anything you want. Or you don’t need to direct the output to a file: you could just watch the output fly by and then copy and paste it into a file. I am asking you to handle output like this, instead of writing it to a file for you because I am working on making it possible for you to work with texts of your own choosing without editing the script. (I’m getting there, I’m getting there.)
At the top of the file you’ll see a line of redundant information and a line of new information:
Words in text: 7959 Unique words: 1987
This is some new data. For fun, I like to give students the task of figuring out the mean for the words in a story: here it’s obviously something like four occurrences per word. (4.0055 to be exact.) But that’s obviously not what really happens. Below a dashed line in the text, students will see a list of words that begins like this:
Sorted by highest frequency first: the,505 he,248 a,248 of,172 and,162 i,154 to,148 was,140 his,137 rainsford,117
The occurs five hundred times in this short story? (It’s every sixteenth word by my count.) At this point, I like to talk about the list in its raw form above or I will ask students to trim off the first few lines of the file above and save the document as a comma-separated value file (
.csv). Once the file consists of only the word, number pairing seen above, it can be imported into Excel, where it can be easily turned into a bar chart. The first eight words in the list dominate any visualization — the same thing can be seen in a word cloud when no common words are dropped (or stopped). (A built-in word cloud script is in the works, but is not currently available.)
It may sound strange, but I have found it very effective to work my way through the list of words with a class with an Excel spread sheet projected at the front of the room: we highlight the words we think are interesting. Regularly what we find is that we have to scroll past the first screen of words before the first words that seem interesting us, based on our reading of the story, begin to turn up:
We have to go even further down before we begin to see a large percentage of words that seem significant. (Please note that the current version of the script sorts words first by their frequency and then alphabetically.)
This interesting middle range of words continues for a while until words begin to drop in usage: I regularly find that three occurrences is about the threshold for short stories for most readers.
Again, all of this simply prompts readers to ask more and better questions. As someone for whom the language of a text is terribly important, I find it terribly ironic that it’s numbers that makes that point best with some readers.
With this list of words in hand, it’s time to re-engage the text. I find it useful to assign students each a word with a high-value frequency, a middle value, and a low value. In the past, I have asked them simply to use the Find feature in their PDF viewer, but, if you have installed the NLTK, you can try out the next script,
concordance.py relies on several functions available through the NLTK that make it possible to work with texts. If you open the script, you can see for yourself, but if you simply run it, you will see:
% python concordance.py Enter the word you would like to see in context:
If you enter the word hunter, the program will print out:
Building index... Displaying 10 of 10 matches: ld , " agreed Rainsford . " For the hunter , " amended Whitney. " Not for ainsford. " You ? ? ? re a big-game hunter , not a philosopher. Who cares been a fairly large animal too. The hunter had his nerve with him to tac st three shots I heard was when the hunter flushed his quarry and wounded . Sanger Rainsford , the celebrated hunter , to my home. " Automatically . " They were no match at all for a hunter with his wits about him , and nsford. " " Thank you , I ? ? ? m a hunter , not a murderer. " " Dear me ling of security. Even so zealous a hunter as General Zaroff could not ? ? a small automatic pistol . The hunter shook his head several times , a spring. But the sharp eyes of the hunter stopped before they reached the
Such an output offers a very quick and easy way to see all the uses of hunter piled on top of each other. The first two line in our results beg another question, though, since they turn up the use of the word hunter in the dialogue between Rainsford and his first interlocutor, Whitney, which contains in it the line that foreshadows much of the rest of the story: The world is made up of two classes—-the hunters and the huntees. That’s an important line, but it doesn’t show up in our search for hunter because from a computer’s point of view hunter and hunters are not the same word, or token to use the more precise term from linguistics. (Is there a way to see hunter, hunters, and huntee all in the same list? There is, but it involves a bit more scripting than will reward us at this moment. You can see the draft script in the UPST directory:
In the next iteration of Text Analytics, and the Useful Python Scripts for Texts, we will take a look at things like collocations, bigrams, synonyms within a text, and other relations.
Thanks for reading. I hope this helps convince you how easy this kind of analysis is and, at the same time, how rewarding it can be. I have found all of these techniques especially powerful when working with undergraduates. Word clouds have become extremely popular, and they are a powerful visualization tool, but they really should represent the middle or end of an analytical process, after analysts have given some thought to the particular words involved.
I was working on a post that outlines my own version of “Text Analytics 101” that I have been using in freshmen writing classes for the past three years, and I found myself considering, momentarily, the uses of “text mining” versus “text analytics” and “data mining” versus “big data.” I’m sure there are distinctions to be made between the two terms, but it’s also the case that terms map onto various disciplines/domains and or historical moments. A quick ngram search in Google, which is based on Google Books, produced the following graph:
A similar search for the first pair produced the following:
The only thing the two graphs suggest to me is that, possibly, the latter terms appear later and thus haven’t made it into paper. I would like to do a similar search of ngrams on the web, but I haven’t found the same simple interface for doing this kind of quick survey.
Computer scientists have taken data and methods from historical linguists and, using a program based on the Markov chain Monte Carlo sampler, have calculated what the origins of the Austronesian language family looked like. [Leslie Katz reports for CNet](http://news.cnet.com/8301-17938_105-57569039-1/language-time-machine-a-rosetta-stone-for-lost-tongues/).