What a terrific idea: Lincoln Mullen has uploaded [sample data sets for historians learning R][cran]. His note states that “they include population, institutional, religious, military, and prosopographical data suitable for mapping, quantitative analysis, and network analysis.” I would love to see something similar done for folklore studies, and I’ll see what I can to make that happen.

In the mean time, many thanks to Lincoln for doing this. One of the crossroads many at which individuals find themselves when they begin the journey towards computation is not having any material with which to work. Quite often, writers describing their work assume that everyone already has a corpus of material with which to work. Or, we act as though anyone is going to pull stuff off [Project Gutenberg][pg]. A controlled data set gives new users a chance to try things out and get predictable results.

[cran]: http://cran.r-project.org/web/packages/historydata/index.html
[pg]: http://www.gutenberg.org


With any luck, one day in the not too distant future, something like [Dat][] will be interesting to humanists. What is Dat? Dat is “an open source project that provides a streaming interface between every file format and data storage backend.”

[Dat]: http://dat-data.com

Middling Data

I’ve been enjoying working through Matthew Jockers’ [Text Analysis with R for Students of Literature](http://www.matthewjockers.net/text-analysis-with-r-for-students-of-literature/) and following the various discussions about topic modeling and other approaches to “big data” in the humanities on Twitter (and elsewhere — and I really do wish there was more of the elsewhere — more on this in a moment). At the same time, I am, some would argue desperately, trying to teach myself not only the Python language, and to learn the basic terms of computer science but also trying to get a basic grasp of the statistics that lies behind so much of this work.

I do so because not only do these realms fascinate me and, I think, have real possibilities for studying the kinds of texts that I like to study but I would also like to be part of that larger conversation about what dimensions of statistics are useful, and what are not, that the digital humanities will eventually have to have as the “digital” falls away. We will at some point get past the initial, and very exciting, phase of experimentation and grabbing at all the shiny toys, and begin to synthesize these experiments into the ongoing development of the continuum of work that stretches from the humanities to the human sciences.

Folklore and anthropology have long been the kissing cousins on either side of the perceived divide between those two orders, and I am fascinated, in watching the adaptation/adoption of corpus linguistic methods, often linked with information science and various forms of artificial intelligence, with the jump from sentences, or huge gatherings of sentences into things like corpora, to novels.

These is, I think, a middle ground. It’s not the “small date” of the old humanities, nor yet the “big data” which is our current fascination, but something more like middling amounts of data. *Medium data*? (That sounds better than *middling*, but it does suggest a statistical process, no?)

*Middling data* for now, I think.

I am using it to describe the 50 some-odd legend texts I have that range in size from around 100 words to over 1000 words. This size of texts is, in itself, a kind of middle ground between short texts like proverbs and longer texts like myths. (Some oral histories I have collected tend to fall on the shorter end of this range, as well as a number of personal anecdotes, which only means that we have a lot of counting to do in folklore studies to begin to establish things like this. Easy peasy work and still terribly interesting — how many words does a given context require either to reinforce the current reality or to conjure up an alternate one?)

50 texts of 500 words doesn’t seem like too much, does it? (I’m going to go for the middle number of 500 here, just for the sake of argument.) Why that’s only 25,000 words, a long-ish short story from a literary scholar’s point of view. But 50 distinct texts begins to stretch the boundaries of working memory for most human analysts, and certainly as that number grows, one begins to require alternative means of “holding” the texts in some sort of analytical space.

Of course, as the number grows, one needs to effect some kind of compression somewhere in the process. Where and how is why we need statistical reasoning to better inform how we proceed. (Sorry for the surfeit of adverbs there.) And I do love the kinds of things that topic modeling can do, as well as other forms of statistical analyses. Certainly achieving semi-accurate results with a minimum of failures and making effective use of available computational resources is of interest to computer scientists, but I don’t, at this point, particularly care about such things. Rather, I am interested in those forms of manipulation which let me explore a collection of material(s) — perhaps formally organized enough to begin to be something like a corpus but perhaps not.

This middle ground is the ground I want to work for the foreseeable future. It will let me explore the computational and statistical possibilities from within a territory that I can still attempt to grok using old-fashioned, dare I say “analog”?, methods methods. It’s this kind of middle ground work that made Moretti’s _Graphs, Maps, and Trees_ so compelling. (And he seems to have a distinct preference for working with middling data, if I read other essays and understand other talks he has given correctly.)

*Middling* data is a terrible name to be sure, but like the “middling” domains of folklore studies and cultural anthropology, domains often viewed from a certain askance perspective by practitioners in domains more central to either side of the divide between the humanities and the human sciences, I think that there are some terribly productive tensions to be more clearly articulated and discussed.

Then again, I would think that, wouldn’t I?

Uncertainty Quantification

[Uncertainty quantification](http://www.hpcwire.com/hpcwire/2013-09-13/the_masters_of_uncertainty.html) has a nice ring to it. Of course, human beings have been doing this kind of thing for a long time. It’s not quite clear to me what threshold we crossed, but at least a few scientists think we have and that we need a name for that threshold.