Why Count Words?

“Why count words?” It was a simple question[^cf1]. The person asking the question did not ask it in an overly skeptical, or hostile, fashion. He was honestly taken aback by a series of numbers I had rattled off that corresponded to a collection of texts, of legends, that I had assembled as my first step in my exploration of computational approaches to narrative. The illustration in front of the room had been a bar chart of sixteen legend texts, each collected by an established folklorist (and so the original oral texts were, I felt, reliably represented). The longest text in the collection was a little over one thousand words (1025); the shortest, only 150.

A multiplier of seven is not an order of magnitude in difference, but it is still enough of a spread that it bears further investigation. Mount Everest is, for example, seven times taller than Ben Nevis, the highest mountain in the British Isles. Climbing the former is considerably more prestigious than climbing the latter. The Gross Domestic Product of the U.S. is seven times greater than Brazil. The distance from New York to London is seven times greater than the distance from New York to Washington, D.C. The difference in the latter amounts to a change in continent and a trans-oceanic passage.

My initial answer to the question was simple: I counted words because I wanted to know if it is possible to create a story world using 150 words, and, if so, then I want to understand how that can happen. Given the size of a great number of literary forms, one thousand words is already amazingly concise, but 150 words? Each word must pack an incredible amount of power: something made even more amazing when one realizes that only half that number of words are unique in their usage in this little text. That is, one word alone, he, gets used twelve times. The next nine words that get used most often in this little legend are also fairly uninteresting: and, a, was, the, it, his, said, to, they. So a list of the text’s top ten words doesn’t reveal anything about the story itself, except that, perhaps, there is a singular figure, he, who is counterposed against a group of some kind, they. (It is only when we get to the next ten most often used words, all of which appear only two or three times in the text, that we beginning to get a sense of what the story might be about: man, dog, with, when, went, there, saw, off, horse, controller.)

How is this possible? How can such a small subset of words from an already small text make a story go? That is, I think, the real question. Counting words is but one step along the way, but an important one, and one that we, as folklorists, have failed to undertake. Think for a minute of all the texts that are indexed in the great collection projects of the twentieth century. Add to them all the texts we have collected under the auspices of the ethnography of speaking. It’s an impressive amount of work, and while we have made some synthetic gestures, we have, by and large, mostly focused on differences. All of those differences are, of course, quite compelling, but in focusing on differences, we have also missed an opportunity to make attempts at larger kinds of claims about human nature and culture.

The impulse to count words, for me, is but one step towards a larger understanding of how humans think their way through the world through things of their own making. In the case of texts, they quite literally string one word after another, usually within the flow of a larger program of discourse that itself may or may not be conducive to text-making. Despite all the complexities, people in a variety of speech act contexts somehow decide to initiate a text, place one word upon another in a sequence they both anticipate and, at the same time, manipulate, until they are satisfied, in some fashion, with the result and, like a discursive Atropos, end the life of the string.

Counting words, then, is but one step towards a larger understanding not only how many words, but which words, and in what order. Why these words and not others? And what are the relationship of these words used here to instantiate a story world, but of the actions within the story world to the human world within which they are embedded? In short, what can 150 words tell us about the relationship between words, ideas, and actions?

The great indices of the previous era of folklore scholarship took one step in this direction by attempting to map, mostly in bibliographic terms but indirectly in cartographic, the various texts that had been collected in the initial wave of the philological project. At the same time as Stith Thompson turned his great carousel to compile the Motif Index three-by-five card by three-by-five card, however, a few scholars and scientists were beginning to play with the idea of using computers, as slow and expensive as they were then, to compile statistics about texts[^cf2].

Statistics remains, for most humanists, either an enigma or an enemy. It represents, for many (and with good reason), a regime of mathematics, itself something of a mystery, which has been used too often to summarize a situation or a group of people when a more subtle form of analysis was needed. I will not, in this essay, defend its use in such contexts. Nor am I interested in defending, or capable of discussing, the larger statistical turn that so many forms of knowledge production have undertaken. I have only this, a reworking of a dite from my own childhood and perhaps yours too: just because others are doing it is not a reason for us to do it, too.

I understand very well the humanistic impulse to draw a line in the discursive stand and to cry out “the crunching of us into numbers ends here.” My suggestion here, at this metaphorical line lying before us, is that the crunching will go on and on, and it can do so either without us or with our efforts not only to humanize the crunching but also to stuff it so full of the human that it might very well turn into a new kind of science, a new kind of scholarship that will not only be interesting to others, but also to us as well.

One of the central requirements of statistics is that you must convert information — perhaps a simply little story about a treasure buried somewhere, perhaps a few dozen of such stories, or perhaps several thousand — into data. But such a transformation amount simply to assigning values, most often numbers but they need not be, to the objects that are central to the problem. The analyst defines the problem, and the analyst assigns the values. Folklore studies has already done this in the form of tale type numbers, and motif numbers, and even when we describe the process of contextualization of a particular text.

So why count words? Well, clearly one reason to do so is simply to explore texts and textuality, to satisfy our curiosity about the fundamental dimensions of human expressivity: the number of words in a text, the word clusters (or collocations) that occur within a text as well as the words that always appear in conjunction with others in particular kinds of texts (co-occurrences). A second reason to proceed in this fashion is to make it possible to discover relationships between texts that we have not yet discovered by more traditional means of study. Discovery, indeed the notion of indexing itself, are the chief reason behind so much of the effort in natural language processing, as we will discuss in a moment. The final reason is that by seeing folklore texts in a new light and seeing relationships between texts that we have not gleaned before leads to new forms of knowledge, forms that need not displace but rather refine and extend current ways of knowing.

[^cf1]: The first public presentation of this research project was at the 2013 meeting of the International Society for Contemporary Legend Research. I would like to thank that group for their incredibly generosity and hospitality.

[^cf2]: The image of Stith Thompson sitting in a building dedicated to housing a carousel forty-feet in diameter is one that I owe entirely to Henry Glassie.

How We Spent

Bloomberg’s new Data View feature is quite compelling. A recent interactive visualization focused on [changes in consumer spending][bb] over the past 30 years. The most eyebrow-raising thing I noticed while looking over the various graphs was that almost all of them reveal a change in consumer spending about three years before the collapse of the mortgage bubble: some time between late 2005 and early 2006 consumer spending dips significantly and continues to dip until it bottoms out in 2008. I am not enough of an economist to know if this dip helped break the bubble or if it reveals that consumers were, in some way, aware of the bubble and already anticipating the trouble to come.

[bb]: http://www.bloomberg.com/dataview/2013-12-20/how-we-spend.html

MLA Trying

Kudos to Rosemary Feal for leading efforts within the MLA to re-imagine what it means for PhDs in the humanities *not* to get jobs in the academy: that is, what does it mean when we don’t *clone* ourselves? Even in my small field of folklore studies, I have been astonished by the assumptions that we make about what is “proper” work for our graduates and what is not. As Feal points out, the problem remains of what counts for graduate programs: when our graduate students land research and teaching jobs, that’s good. When they land teaching jobs, that’s acceptable. When they find ways to use their training in other kinds of jobs in ways that speak to their passions, their experience, or their economic needs … that doesn’t count at all. (To be sure, this is probably something that applies more to R1 programs and departments than to R2s, or whatever it is that my university is.)

Opening Scholarship

I think Caleb McDaniel has it [right][], when he considers what it might mean for scholars to work “in the open”: publishing their notes as they make them. He raises all the right opportunities and the right dangers, and I like the idea of using version control for a backend. I would like to compare his use of GitHub and Gitit with what Graham, Milligan, and Weingart are using for [The Historian’s Macroscope][].

[right]: http://wcm1.web.rice.edu/open-notebook-history.html
[The Historian’s Macroscope]: http://www.themacroscope.org

Buying Liquor

[Find the Best][] has a [comparison][] of various liquors by price and quality of taste. We’ll levee aside whatever the latter measure is: what I like are the graphs, the dot graphs for each of the categories of liquor (gin, vodka, whiskey, rum, etc.). Price increases along the vertical and taste improves along the horizontal. There is a trend line, and, more importantly, if you hover over a dot you are given its name, its price point, and its evaluation:

[Find the Best]: http://findthebest.com/
[comparison]: http://blog.findthebest.com/lifestyle/why-liquor-prices-mean-nothing/

Towards an Expanded Disciplinary History

Jonathan Goodwin and I had the chance to team up again at the [Texas Digital Humanities Conference][txdhc]. While our work began with the chronological topic models built last year, Goodwin has recently been experimenting with [co-citational network graphs][jg] based on data drawn from the Web of Science. (We had to depend upon the Web of Science data because the citational data from JSTOR is currently unavailable.)

While we contemplate how to integrate the co-citational data with the topic models, I found myself recalling that the American Folklore Society also has a collection of abstracts submitted for the annual meetings for at least the last few years. I wondered if that material was available through [Open Folklore][]. It isn’t, but the [program brochures and books][bb] produced for AFS annual meetings from 1949 are.

[txdhc]: http://txdhc.org/
[jg]: http://www.jgoodwin.net/folklore-network/slider/highlight.html
[Open Folklore]: http://openfolklore.org/
[bb]: https://scholarworks.iu.edu/dspace/handle/2022/13071

DHSI

The [list of courses][] for the Digital Humanities Summer Institute is amazing. I would love to be able to go to just one of these courses — and even better to go to more. Unfortunately, with salaries frozen for 8 years and absolutely no sense of professional development for faculty at my university, I will mostly have to make do with MOOCs and catching what training I can. Still, one can dream… and, perhaps more importantly, encourage others to *seek out and seize these opportunities!*

[list of courses]: http://www.dhsi.org/courses.php

TEI for Folklore

As Elisa walked me through her TEI-encoded documents, and showed me the XSLT she uses to transform the TEI encoding into network files, I realized that I needed to start working on my own use of TEI. A quick search *ye olde web* for “TEI folklore” turned up … not much.

Two things occur to me: First, this represents an opportunity to be involved in getting TEI up and running in folklore studies, and, second, I need to start collecting useful links:

* So far, it looks like [oral history][] is leading the way.
* The [MLA][] recently received a grant from the NEH to “to begin development of Humanities Commons Open Repository Exchange, or Humanities CORE. Humanities CORE will connect a library-quality repository for sharing, discovering, retrieving, and archiving digital work with Humanities Commons, a developing platform for collaboration among scholarly societies and other humanities organizations.”
* There are [seminars][] on TEI encoding.

**Please note**: if you know of already extant implementations of TEI in folklore studies, please let me know! I don’t want to re-invent the wheel. Drop me a note, if you can, and I’ll add links here, with credits for contributors. (Or we can do this somewhere else, if you like. G+?)

[oral history]: http://www.cdlib.org/groups/stwg/OH_BPG.html
[MLA]: http://news.commons.mla.org/2014/03/27/grant-awarded-for-the-development-of-humanities-core/
[seminars]: http://www.wwp.brown.edu/outreach/seminars/

Fragmented Memory

Geoffrey Rockwell linked to a [project][] that seeks to take maps of computer memory and turn those “images” into textiles. The idea is, of course, that by doing so one brings the information revolution full circle: if the first punch cards were for Jacquard looms, then computerized looms now re-create images drawn from hardware states. Fun, but what I found more interesting was learning that there are such looms that can, in essence, weave images on demand. Wouldn’t it be more fun to encode a message, a la a barcode, and have that woven in a tapestry or blanket, perhaps even subtly modifying it to make it more aesthetic?

[project]: http://phillipstearns.wordpress.com/fragmented-memory/

The First Texas Digital Humanities Conference

I’m just back from the premier offering of the Texas Digital Humanities Conference, and I can’t tell you what a pleasure it was to have such a superb event held so close to home, especially since I won’t be able to make the big Digital Humanities meeting this summer (or next summer, for that matter, since things are unlikely to get better here any time soon). There’s more to write about than what I am posting here, but I wanted to post my notes and links for both my future reference and as part of the conference’s wider historical record: interested readers should also check out the conference’s Twitter stream, [#txdhc][], and Geoffrey Rockwell’s [notes][].

In addition to the notes below, I also want to particularly thank [Elisa Beshero-Bondar][] and Max for walking me through loading networks into Cytoscape.

### Geoffrey Rockwell

Parallel between art critical process — *integritas* (apprehaneding that thing according to its form), *consonantia* (synthesis which is logically and aesthetically permissable), *claritas* (see the thing as it is and no other thing). In text analysis: demarcation, analysis, synthesis.

Tufte suggested the usefulness of spark lines?

http://huco.artsrn.ualberta.ca/~recharti/viz/dendro.html

CHUM: _Computing Humanities_ (was important journal until 2005).

An interpretive thing, a *hermeneutica*, is like an architectural folly from the nineteenth century: there to prompt our own thinking. Not simulacra.

Text analysis works on surrogates, not the text itself, not as text as conventionally understood. Text as string.

Stephan Sinclair is his collaborator.

Predecessor: HyperBow.

Relationship to bricolage? Embroidery. Contribution by framing. Things for others to think through.

Voyant (http://voyant-tools.org) is downloadable and can be run locally.

Beta Yoyant tools are all R for analysis and D3 for visualization.

→ Ask GR about Smith PDF.

→ Contact John Smith about code and about being interviewed.

### Andrew Higgins

Philosophy has an ArXiv? http://philpapers.org/

Co-categorization of articles.

Modularity measure.

Philpapers –> Google Scholar (to scrape citation data).

Bowling Green has an index.

### Anne Chao

Chen Duxiu was the founder of the Chinese communist party. Begins with a social network created in Gephi: threshold was 3 interactions with Chen. [These kinds of faux network visualizations make me realize that having a logic for the layout is terribly important: why are nodes located where they are in the graph? what do the edges represent?]

Later connection with an individual influenced, or trained, by John Dewey.

### Minute Madness

CADOH: Corpus of American Discourses on Health

### Cameron Bruckner

Mike Jones at IU is doing work with topic modeling (of a kind) that takes into account the position of a word in an n-gram.

### Elisa Beshero-Bondar

EBB is interested in mapping poetic structures and ideas in network visualization in the work of Robert Southey. When she coded meta-places and places, she discovered tat the meta-places are necessary for the network to hold together. If they drop out, the network falls apart. Not the case for actual places. Makes sense: you need a cosmology in an epic poem. In-betweenness measures.

Need to know more about measures of centrality. Cf. Alexander Maida (on computer scientists and computational linguistics). Closeness centrality reveals the places that Southey talks about most often.

Startled by the difference between the eccentricity graph.

Shortest half-lengths.

→ KML vs ArcGIS mapping. Cytoscape does mapping.

EBB has students mapping Cook’s voyages. See http://pacific.pitt.edu.

### Kathryn Beebe

Medieval historians grapple not with big data but with small, even tiny, data.

Social networks in texts are very popular.

→ Tim Evans.

GR: “There are ways to metasatize your data, build it up quickly: look at what people say about these texts, at reception.”

### Tanya Clement

ARLO displays spectrograms that represents that amount of energy in each frequency band. Some genre detection. Code switching. Genre switching. (This is more information than wave forms, but it strikes me as an evolutionary improvement, not a revolutionary improvement: a comparison would reveal how different performances, different speakers intertwine frequency and dynamics — this must be the “energy” she was talking about.)

Cf. Shannon and Weaver.

→ Cf. Donald MacKay. 1969. _Information, Mechanics, ???_.

### Elijah Meeks

EM: “This is the first conference I’ve seen that specifically focuses on networks in the humanities.”

Working a book about programming D3,js.

* kindred.stanford.edu
* orbis.standford.edu

EM feels, like many in DH, like he is an impostor. But maybe the better term is interloper.

Interloper *par excellence*: Jared Diamond.

*Neotopology* refers to …

Mike Bostock (mbostock).

→ Anne Knowles. 2002. _Past Time, Past Place: GIS for History_. No volume yet for network visualization.

→ Willard McCarty. 2002. “Humanities Computing: Essential Problems, Experimental Practice.”

The *network turn* is taking place after the *spatial turn*: _Envisioning Landscape, Making World_; _Placing History_; _Spatial Humanities_.

Networks are really simple: it’s the annotation of a connection. E.g., a person is connected to another person, a person is connected to a document. N-partite networks.

A network is a view into your work as a view of the structure and not the components, a part of the process of operationalizing your understanding of the system.

Structure is important.

EM: “We need standards for interactivity.”

EM: “Any network is good as long as you declare the constraints that affected it.”

All of these can visualize a network dataset:

* Arc diagram
* Adjacency matrix (?).
* Force-directed layout.
* Radial layout.
* Donut charts.

You need to know what a random walk is, you need to know what centrality is; you need to understand how modularity detection works, that it returns a value and what that number means. → Learn network statistics.

Invent your own centrality measures. Authorial acts, not authoritative.

→ Understand topology: cool visualization of topoJSON.

See McCarthy’s description of a *trading zone* (2002).

A sloppy way of bundling together socio-physics, traffic analysis, etc.

→ Arts, Humanities, and Networks. (Conference organized by Max. Ebook out from MIT press.)

→ _Book of Trees_.

### Yannick Rochat

*Character-space* is that particular and charged encounter between an individual human personality and a determined space and position with the narrative as a whole, and *character-system* is the arrangement of multiple and differentiated character-spaces — differentiated configuration and manipulation of the human figure — into a unified narrative structure (Woloch 2003: 14).

First graph: occurrences of characters per page with chapter breaks and part breaks indicated.

Second graph: occurrences totaled for each of 12 chapters.

Centrality measures: *degree* rank, *betweenness* rank, *harmonic* rank, *eigenvector* rank.

Louvain [?] clustering in Gephi. (Eigenvector based.)

### Ayse Gursoy

Game criticism as it happens on-line: how discourse happens. (She’s using Google’s slideshow — and maybe EM was too?)

Critics are identifiable personae with many roles: critic, curator, and advocate.

The game _Dear Esther_, an interactive experience, led to debates about *game-ness*: “many discussions of “, “doing the rounds”, and “much has been written about.”

### Neal Audenaert

Collaborated with Nathalie Houston.

Started by calling attention to the difference pages of prose and pages of poetry and three different kinds of features that shape such things *bibliographic* features (paper, binding), *visual* features, and *linguistic* features.

Their research question: How to extract visual features? What are the research questions? How to present/interact with this information? How to analyze this information algorithmically?

Work bubbled out of a THATCamp at Rice a few years ago. Then an NEH StartUp grant. And now a HathiTrust grant.

Used Tesseract to extract page layout.

Nathalie’s questions:

* How long are the lines?
* What’s the spacing between the lines?
* How much text on a page?

Text per page.

Nice use of R with a trend line — like what JG set up.

[#txdhc]: https://twitter.com/search?q=%23txdhc
[notes]: http://www.philosophi.ca/pmwiki.php/Main/1stInauguralTexasDigitalHumanitiesConference
[Elisa Beshero-Bondar]: http://digitalromanticist.wordpress.com/