John Laudun

What is Culture Analytics?

Gottfried Wilhelm Leibniz was very fond of paradoxes as a way to call attention to the too categorical, and thus too rigid, thinking to be found in conventional wisdom and common sense. In his typically somewhat poetic and thus also somewhat obscure fashion, he argued that rest could be construed as “infinitely small motion” and coincidence as “infinitely small distance.” In his own time, Leibniz was mocked by mathematician d’Alembert who noted that: “a quantity is something or it is nothing: if it is something, it has not yet disappeared; if it is nothing, it has literally disappeared. The supposition that there is an intermediate state between these two states is chimerical” (d’Alembert (1763), 249–250). And yet now, fully immersed in the abilities of calculus to describe the nature of such chimera, we depend upon Leibniz’s notion of the infinitesimally small and the infinitesimally close.

In our own time, Gilles Deleuze was fascinated by Leibniz’s willingness to find very practicable truths through apparent paradoxes. For Deleuze, it was simply a matter of always being on the lookout for fixed identities, as he termed them (LoS). Deleuze, we can be sure, would have enjoyed the continuing furor that continues over the apparent paradox between close and far reading, between the microscopic and the macroscopic, between the necessarily human and the necessarily inhuman. As Deleuze noted about Leibniz’s view of reality: “Not everything is fish, but fish are teeming everywhere” (1993: 10). We do not at the start of this workshop in culture analytics that precedes the annual meeting of the Digital Humanities need in anyway to re-invigorate the debate over antipodes that exist only in idealized rhetorical realms and not in the practical reality of those of us gathered in this room, but we do need, here at the start, to pull apart the threads of the various approaches to what we might crudely call “culture studies + data science.” We do so because this workshop is not described as cultural analytics nor as culturnomics but as culture analytics. And that is a confusing thing.

So, at first a brief history delineating how these things arose and how they diverge and converge. After that, a focus on culture analytics as a particular program of inquiry. With such definitionistics out of the way, we will take up the matter of time series. Throughout we will examine various works that instantiate the many, many ways this work can be done. (And, we the organizers of this workshop wish to emphasize the sheer number of possibilities, many of which have yet to be imagined and thus depend on you here today to imagine them.) What is at stake, I think we all agree, is the successful marriage of developments in quantitative analyses to the qualitative tasks that arise when trying to understand humans, especially when those humans are engaged in the creation and maintenance of social worlds which seem little more than a diaphanous matrix of objects that spring, like Athena from Zeus’ brow, out of the weird combination of electrified meat that is the human brain. We are like Clifford Geertz’ figurative anthropologists, straining to read over the shoulders of those around us, “the culture of a people [as] an ensemble of texts” the texts themselves being nothing more than ensembles of words.

Vergences

Both culturomics and cultural analytics, as names for a certain set of approaches and practices, emerged around the same time towards the end of the first decade of the new milleniu (roughly circa 2009-2010).1 In a strict chronology, cultural analytics arises first, in the guise of a Wikipedia entry in February 2009: Cultural analytics refers to a range of quantitive and analytical methodologies drawn from the natural and social sciences for the study of aesthetics, cultural artifacts and cultural change. The methods include data visualization techniques, the statistical analysis of large data sets, the use of image processing software to extract data from still and moving video, and so forth. Despite its use of empirical methodologies, the goals of cultural analytics generally align with those of the humanities. An influential text for cultural analytics was Franco Moretti’s Graphs, maps, trees: abstract models for a literary history.

Later revisions to the page focused more on visual analytics and visual data analysis, quoting from a 2010 Horizon Report that emphasized that “new research is now beginning to apply [highly advanced computational methods] to the social sciences and humanities … and the techniques offer considerable promise in helping us understand complex social processes like learning, political and organizational change, and the diffusion of knowledge.”2

This notion of a novelty that can both extend and at the same time pair with the humanities as conventionally understood is also present in John Bohannon’s coverage of culturomics in a “News of the Week” in a December 2010 issue of Nature. Bohannon describes the work of Michel at al as being “a wake-up call to the humanities that there is a new style of research that can complement the traditional styles.”3 Michel et al’s “quantitative analysis of culture” focused principally on historical trends as glimpsed through n-grams derived from the millions of books said to be in the Google corpus. The frame they offered for this worked argued that:

Reading small collections of carefully chosen works enables scholars to make powerful inferences about trends in human thought. However, this approach rarely enables precise measurement of the underlying phenomena. Attempts to introduce quantitative methods into the study of culture (1–6) have been hampered by the lack of suitable data. We report the creation of a corpus of 5,195,769 digitized books containing ~4% of all books ever published. Computational analysis of this corpus enables us to observe cultural trends and subject them to quantitative investigation. ‘Culturomics’ extends the boundaries of scientific inquiry to a wide array of new phenomena.

The millions of books caught everyone’s attention, and it began a trend in the digital humanities of equating literature with culture and vice versa. As culturomics developed, however, other data were brought into focus, typically those dealing with news media, perhaps an outcome of the kinds of corpora computational linguists had already assembled.

At any rate, culturomics as a term did not enjoy wide acceptance. A reflection of its inventors’ origins in biology, it was designed to highlight the idea that texts and culture were like genes and genomics, which was perhaps a bridge too far for some in the humanities, who were probably a little skeptical about the possible suggestion that biology is destiny. Cultural analytics became the dominant term, and it’s worth sketching out its history a bit before turning to a discussion of the dimensions that culture analytics seeks to add.

In a later consideration of the early days of cultural analytics, Lev Manovich claims to have first developed the concept in 2005 and to have established a research lab in 2007—though it should be noted that the lab was titled the Software Studies Initiative. He notes that the work of the lab was framed as a series of questions: What does it mean to represent “culture” by “data”? What are the unique possibilities offered by computational analysis of large cultural data in contrast to qualitative methods used in humanities and social science? How to use quantitative techniques to study the key cultural form of our era – interactive media? How can we combine computational analysis and visualization of large cultural data with qualitative methods, including “close reading”? (In other words, how to combine analysis of larger patterns with the analysis of individual artifacts and their details?) How can computational analysis do justice to variability and diversity of cultural artifacts and processes, rather than focusing on the “typical” and “most popular”?

As Manovich himself observes: those questions were no longer unique to cultural analytics but were now being asked in adjacent domains like computational social sciences and the digital humanities. If the traditional domains have added computational methods to their repertoire of accepted methods and objects, then what role remained for cultural analytics as an intellectual project or program? His response is simple and straightfoward: “Digital Humanities and Social Computing carve their own domains in relation to the types of cultural data they study, but Cultural Analytics does not have these limitations.” And he concludes:

We are also not interested in choosing between humanistic vs. scientific goals and methodology, or subordinating one to another. Instead, we are interested combining both in the studies of cultures - focus on the particular, interpretation, and the past from the humanities and the focus on the general, formal models, and predicting the future from the sciences.

The idea of the model, especially one that is both descriptive and in its accuracy potentially predictive, is something most foreign to the humanities and might perhaps be the central tension that drives those studies that feel the need to escape the bounds of the usual disciplines.

Manovich’s division of objects between the computational social sciences and the digital humanities is not unconventional, though his belief that the latter focus solely on the smaller number of objects produced by professionals—e.g., novels—seems to ignore entirely work in history but his concern that humanities scholars too often focus on the output of the few and construe it as representative of a larger whole is not amiss. (I have cast the same dispersion myself.) Meanwhile, he notes, computational social sciences and computer science are focused on the ever-growing stream of born-digital material, much of which is to be found in social media but some of which can be found elsewhere. Just as importantly, these fields also publish at an impressive rate, and in being unconstrained by, by seemingly being unaware of, hermeneutical traditions, they also experiment freely and sometimes end up making imaginative leaps, and drawing interesting conclusions. By scraping thousands of images on Instagram or Flickr or tweets on Twitter or videos on Youtube, scientists could tangibly establish how tens or hundreds of thousands of people understood the world in which they found themselves.

In the current moment, a tension remains between cultural analytics as it occurs in named entities like the Cultural Analytics Lab and the Journal of Cultural Analytics. The lab seeks to “combine data visualization, design, machine learning, and statistics with concepts from humanities, social sciences, and media studies” in order to address (cope with) the “billions of new digital artifacts [created] every day.” Driven largely by the focus of media studies, the lab has worked on “collections of films, animations, comics, magazines, books, newspapers, paintings, photos, [and] video game recordings.”

Across the figurative aisle is the Journal of Cultural Analytics (JCA), an open-access journal “dedicated to the computational study of culture,” which has, up until recently, largely fallen victim to the conflation of computational literary studies with cultural analytics. As does Dan Sinykin in his introduction to a special issue of Post45: “Call it cultural analytics or distant reading or data-rich literary studies.” (JCA claims a linkage to Post45: apart from using similar publishing software, or at least a similar UI, the linkage is largely left unspecified.) That noted, the journal has steadily included scholarship on other kinds of objects and the percentage has steadily increased until the present—though those familiar with the recent contre temps, initiated by Nan Da who steadfastly referred to JCA as focused on computational literary studies, will note that we went back to square one.

Culture Analytics

Even as culturomics and cultural analytics were making their debut, Tim Tangherlini, a folklorist and professor of Scandinavian studies at UCLA, had harnessed his time at the Institute for Pure and Applied Mathematics (IPAM) to propose a summer program to the NEH focused on “network analysis in the humanities.” Hosted by IPAM and with an organizing committee made up of mathematicians, computer scientists, and humanities scholars, the program offered a glimpse into what a more algorithmically-driven approach to humanities problems might look like.

The program’s use of networks was intentionally broad: with topics including “the science of networks and networks in Humanistic inquiry, preparing and cleaning Humanities data for network analysis, internal networks in Humanistic data[—]networks of characters, networks of texts, networks of language, [and] external networks in Humanistic data[—]networks of influence, networks of production, networks of reception.”4 Such a broad focus made it possible for the organizers to include not only hands-on tutorials on particular software applications but also a wide range of presentations on distant reading, computational sociology, and computer science itself. Presenters included Tina Eliassi-Rad who was then working with Google to understand the virality of Youtube videos, Franco Moretti who was creating social networks for Shakespeare and Dickens in order to understand how they revealed narrative structure, Fil Menczer who was then in the process of making Stephen Colbert’s notion of truthiness an algorithmic reality, Katie Borner who was working with scientometrics in hopes of predicting future research streams, as well as James Danowksi on collocation networks. Lev Manovich was there as a participant in the program.

After the NEH-sponsored summer school at IPAM was complete, the conversation continued about encouraging more exploration of the unsolved mathematical opportunities emerging in the many cultural information spaces being created. While the eventual organizers of IPAM’s long program on Culture Analytics — Tangherlini, Eliassi-Rad, Manovich, as well as Mauro Maggioni and Vwani Roychowdhury acknowledged that many successful approaches to the analysis of cultural content and activities had been developed, they also felt there was still a great deal of work to be done. Their goal in the long program was “to promote a greater collaboration across disciplines and devise new approaches and novel mathematics to address the problems of culture analytics.” Their goal was to bring together social scientists and humanities scholars along with applied mathematicians, engineers, and computer scientists.5

From the beginning the organizers and the participants recognized that the creation and maintenance of a domain or discipline requires attention to its field, the people. For culture analytics, they imagined a diverse group committed to collaboration as well as a group of sufficient size as to have some stability as the requisite number of original members winnowed down to those interested in remaining active. The organizers, along with the original participants, also recognized that such a group must seek above all to be inclusive and open both in terms of membership but also in terms of ideas, methods, topics, and objects.

Given the diversity of theories, methodologies, topics, and objects, the consensus on the basic objects of study, and how to measure them, is still in flux. (This workshop is but another step in trying to understand by adding more people and perspectives to the mix.) The first step in finding a useful articulation occurred three years ago when the NSF’s Institute for Pure and Applied Mathematics sponsored a long program on Culture Analytics. The program was 14 weeks long, with week-long workshops of presentations and tutorials interspersed with weeks of residency at IPAM, where participants not only pursued their own work but invented new projects often with new collaborators—a few of which will be glimpsed in today’s workshop.

At the end of that initial sequence of workshops and work, the organizers and the participants convened to assess what had taken place and what they imagined might be on the horizon. Everyone recognized that there were a core set of problems in the study of culture when attempting to account for the differences that matters of scale bring into play. In order to be able to move from the microscopic focus on a passage—or even a sentence or a clause or a phrase—to the macroscopic focus on a corpus determined by population or genre or region or period (or some combination of those or transcending them), the culture analytics group recognized that they regularly encountered problems with defining objects, developing measures, arriving at appropriate models, unpacking interdependencies, accounting for change, building well-adapted algorithms, and accomplishing all this in an ethical framework which treated the human beings on the other side of the objects as, well, human beings. This led to the development of the Arrowhead Problems in Culture Analytics, which were eventually enumerated as follows: What are the basic objects of study and concepts subject to formalization? Stable data-driven mathematical formalization would allow for increased reproducibility and comparison across methods, datasets, and studies. Alternative formalizations may provide fundamentally different, but complementary perspectives.

What are the essential measures for cultural analysis? Consistent, validated, and useful mathematical measures would help characterize the fundamental objects of study and concepts, as well as their interactions. Deeper discussion of common measurements would also allow for consistent identification of distortions, manipulations, and biases in and across cultural data.

What constitutes a successful mathematics of culture? A mathematics of culture would be flexible enough to accommodate the context-dependent nature of culture. Identifying flexible “axioms of culture” and their mathematical treatment would foster the development of analytical tools and algorithms while helping to delineate the limits of quantitative and predictive cultural analysis.

What are the fundamental structures of cultural interdependence? A comprehensive understanding of cultural systems would make sense of cultural units and their subunits, groupings, connection types, and topologies, at multiple scales in space, time, and conceptual dimensions. It would further take into account systematic interdependence and overlap between subsystems of various types.

What are the fundamental dynamics of cultural change? A deeper understanding of cultural dynamics and change would include appropriately sophisticated mathematical models of complex phenomena including cultural emergence, growth, percolation, diffusion, spreading, and evolution.

What are the algorithms that can detect the structures and dynamics of culture?Scalable algorithms detecting structures and transitions in heterogeneous data would address low coverage in sparse data from historical sources or resource-poor areas, as well as massive real-time data streams as acquired through sensors, mobile platforms, or the Internet.

What are the ethical challenges in culture analytics? Culture analytics is not without its risks. Algorithms, tools and methodologies must be carefully vetted and tested so that the results of this work are not only methodologically sound but also ethically sound. Further ethical challenges may arise from dataset collection and preservation bias that need to be carefully characterized.

Time Series

The notion of time series, and of time series analysis, has its origins in industrial processes, business metrics, and the natural sciences. As the National Institute for Standards and Technology notes: “Time series analysis accounts for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for.”6 Time series can be drawn from the values of stocks, the frequency of weather events, sunspots, or even EEGs. The goal of treating such events as a series is to understand the nature of the pattern and perhaps even to predict the next sequence of events. (And thus the centrality of time series to the finance sector becomes readily evident: so much money, so little time.)

Bending the use of time series to unconventional subjects allows for a certain loosening of expectations and practices, but since we have raised the matter of convention, it might help to begin with a few simple observations about time series:

A time series {Yt} or {y1,y2,⋯,yT} is a discrete time, continuous state process where time t=1,2,⋯,=T are certain discrete time points spaced at uniform time intervals. Usually time is taken at more or less equally spaced intervals such as hour, day, month, quarter or year. More specifically, it is set of data in which observations are arranged in a chronological order (A set of repeated observations of the same variable).

Rather than dealing with individuals as units, the unit of interest is time: the value of Y at time t is Yt. The unit of time can be anything from days to election years.

While we are in the business of defining things, we should also note that statisticians distinguish between continuous and discrete time series. A time series is said to be continuous when observations are made continuously in time. The term continuous is used for series of this type even when the measured variable can only take a discrete set of values: that is, the values themselves are not continuous. Take, for example, a sentiment analysis of a text, which assigns either a +1 for a positively valued word and -1 for a negatively valued word. As a stream of discourse, the words are continuous, but the values assigned to them are not. Whereas a time series is said to be discrete when observations are taken at a specific time, usually equally spaced. The term discrete is used for series of this type even when the measured variable is continuous. The simplest example of a discrete time series is time-lapse photography.

Finally, the obvious: time series have a necessary ordering thanks to time itself being unidirectional, with the assumption being that values from one period can effect values in later periods but the contrary not being the case. There are other dimensions to time series that might be discussing as the day unfolds, but, for now, we can leave matters here. That is, time series are interesting because they focus on a fundamental dimension of human experience, time itself. How much we may be misled by the seeming “naturalness” of this dimension is something to be discovered: do not, for example, make the mistake when among mathematicians of confusing causation with correlation.

Example

Escaping for a moment the way we imagine we experience time and returning to the way that statisticians think about time series, they imagine that the complexity of a series is a function of it possessing both systematic and nonsystematic components. Systematic components have a consistent, and often recurring, nature and thus can be described and modeled in the same way a literary scholar might describe, and categorize, the rhyme scheme of a poetic passage or a folklorist might note the recursion of particular phrases or rhythms in the oral composition of a particular singing tradition. Nonsystematic components are all those pieces or dimensions that cannot be described nor modelled. Statisticians might describe these as random; a historian as history.

Any given time series is thus a function of its systematic components, of which there are three kinds—level, trend, seasonality—and one nonsystematic component, noise. If you imagine a time series graph as being similar to a wave-form, then it’s easy to understand level as the average value in a series. A trend is whether a value increases or increases in a series. And, finally, seasonality is a way to describe short-term cycles in a series. From an analytical perspective, all series have a level and noise. Many have trends or seasonalities, but they are not necessary. Like other series, time series can be additive or multiplicative in nature, with additive series being discerned in the linear nature of their trends and multiplicative series having nonlinear trends shaped in a curve.

Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis.

Inherent in the collection of data taken over time is some form of random variation. There exist methods for reducing of canceling the effect due to random variation. An often-used technique in industry is “smoothing”. This technique, when properly applied, reveals more clearly the underlying trend, seasonal and cyclic components. (Engineering Statistics Handbook).

Discourse as time series: assumptions we make is that it is chronological but what about description?

Work(s)

As the discuss above makes clear, there is as much room in culture analytics as in the humanities in general for exploration not only in terms of theories and methods but also topics and objects. In the survey of work that follows, the notion of a time series both expands and contracts: some studies take up historical change itself while others focus on the temporal unfolding of human texts in order to elicit something like style. But always the text itself is the central concern, whether it be in large corpora, subcorpora, or in what some have called an intratextual subcorpora.

Wevers et al

In “Coca-Cola: An Icon of the American Way of Life,” Melvin Wevers and Jesper Verhoef demonstrate how to combine computational and traditional methods in an effort to “shed light on the ways in which the Coca-Cola Company tried to shape the Dutch perception of an American way of life, and by extension provided the discursive building blocks for the construction of a mental map of America.” Their work focused on how n-gram analysis and full-text searching, when paired with close reading of particular documents, are apt methods to construct a sub-corpus from a larger corpus of textual data.” They then use AntConc, a corpus analysis toolkit, to

What’s fascinating about their approach, and built into the warp and weft of culture analytics in general, is how it encourages analysts to get clearer descriptions of the phenomena being examined. In Wevers and Verhoef’s case, they first felt it necessary to answer two questions: “Did the Coca-Cola Company advertise Coca-Cola in Dutch newspapers? If so, when and how frequently?” Using the n-gram viewer, they were focused on the Dutch corpus of newspapers, they were able to identify not only the original advertisement for Coca-Cola, 1928, but also the relative lack of interest before the second world war and its stead post-war rise. But were Coke ads an isolated phenomenon or part of a larger trend of commodity consumable ads? As it turns out, Coke occupied a smaller role than other consumables. Just as importantly for their particular research topic, Coca Cola is often advertised as part of a larger set of items by a store, and so there was a lot of tweaking and refining of their query as they sought to focus on ad by the Coca Cola company for Coke, and not, for example, its other software beverages or for jobs, etc.

Having refined their corpus, Wevers and Verhoef set about understanding the relationship between Coca Cola and some larger idea of America by focusing on context words: words that appear within a set distance, a context horizon, of the key word. As with all such forms of computation, they determined the proper distance based on both what corpus linguistic analysis suggested but also what their own experiments revealed as probably right: ten words before and after. Initial results here revealed that Coca Cola explicitly avoided the emergent trend to refer to itself as an ”American cola,” a phrase being used by reporters and by other brands. As they worked through collocations, they discovered that Coca Cola was, in fact, representing itself as a “glocal” beverage: a beverage of the world but also one bottled locally. They noted:

The link to the United States was thus not so much an explicit geographical reference, but rather a more implicit connection to symbolisms and affects associated with the United States. The glocalizing ability of prominent American brands such as Coca-Cola also explains how consumers equated the process of Americanization with globalization.

As they continued to work through their corpus, Wevers and Verhoef slowly came to realize that Coca Cola worked its magic by offering Dutch consumers seeming contradictions: it was a drink that both stimulated you but also relaxed you. It was both for the business man and the house wife. It was, for the Dutch audience of these ads, the very spirit of a democratic modernity that, through other American brands also appearing, sometimes right alongside those for Coca Cola, America itself seemed to instantiate. They concluded:

The expression of an American way of life in Coca-Cola advertisements manifested on two different levels. First, the advertisements depicted a modern, urban lifestyle — a simulacrum of the burgeoning American consumer society at the time. The second expression of an American way of life took shape in the way advertisers addressed consumers. Consumers in Coca-Cola advertisements could be either male or female, and they were depicted being engaged in activities of leisure or work. Coca-Cola represented elements of an American way of life while at the same time its global spread disassociated the product from its actual origins. This fits within what Rob Kroes calls a resemanticization of reality in which American life is turned into an “imaginary realm to be experienced by those who bought a product.”

Arnold et al

Somewhere along the spectrum of cultural histories less focused on ideological concerns and more on stylistic ones, is Taylor Arnold, Lauren Tilton, and Annie Berke’s examination of the “Visual Style in Two Network Era Sitcoms.” Applying computer vision techniques to capture visual elements within moving images, Taylor et al. create a corpus of images for two Network Era sitcoms: Bewitched (1964-1972) and I Dream of Jeannie (1965-1970). Their goal is to establish “ways that visual style constructs character centrality, intimacy between characters, and the formation of narrative structure and continuity” with the possibility that some the analyses might, in fact, “contradict what might be deduced from plot summaries or script analyses, begging the question of what tensions, contradictions, and paradoxes are buried in these seemingly facile programs.”

Using facial recognition algorithms to locate and identify characters within a shot, they track shot blocking in relationship to the narrative structure, taking advantage of the dialogue-driven nature of American sitcoms. They ultimately arrive at a classification schema of five types of shots: the close-up, the two shot, the group shot, the over-shoulder shot—which most of us will recognize as the conventional shot-reverse shot by which most dialogue is represented in a great deal of television and cinema, and the long shot.

Compiling the types of shots and their contents reveals some interesting differences between the two shows. As Taylor et al note, while both shows feature magical women who somehow only manage to serve the professional concerns of their husband—after all who else is Bewitched except the witch Samantha’s husband Darrin or who else dreams of Jeannie except her master, and later husband, Tony?—in the latter show, Jeannie’s onscreen presence is fourth, after the three male characters while Samantha’s presence is equal to that of her onscreen husband. In addition to being the dominant, or at least co-dominant, face on the screen, Taylor et al also reveal that the two are also most often the anchors for a particular episode, being the first face and in the first scene that the audience sees—usually the pre-title “teaser” or “cold open.”

The use of shots versus scripts also enable Taylor et al to come to a richer understanding of the relationships between characters, revealing not only the centrality of the romantic pairs that are at the heart of the show but the satellite relationships upon which the episodes depend. One of the most fascinating conclusions is that the heart of I Dream of Jeannie is a bromance between the two astronauts.

The types of shots used over the course of a given episode also reveal something about how the narratives are structured: they note that as episodes build, the shots move from those focused on an individual to wider shots: These wider shots help to convey the complexity of the conflict. In particular, Act 2 has a significantly larger proportion of wide shots in which three or more characters present. These shot changes serve to visually reflect the increasingly complicated relationships unfolding in the episode. … The increase in wide shots serves to visually represent and contain these increasingly involved plot lines.

Most importantly, they arrive at these conclusions, among others, “through a computational analysis of every episode of Bewitched and I Dream of Jeannie, totalling 393 episodes and over 150 hours of material.

Oiva et al

In their examination of the way that news the assassination in 1904 of Nikolay Bobrikov spread through extant, and already global, communication networks, Mila Oiva and her co-authors “explore the tempo and the routes through which news was disseminated in … newspaper networks [of the time] … concentrat[ing] particularly on the viral[ity] of the news during the first week following the assassination.7 Their work is part of a larger, still growing domain within the digital humanities, culture analytics, and data science focused on making the most of efforts to digitize corpora of newspapers and taking advantage of the fact that news flows were, in some ways, the first “born-digital” objects and that newspapers were quite content to copy and paste news wire stories into local editions. Such historical practices makes it possible for analysts like Oiva et al, and like Wevers, to trace where information went, how fast it got there, and how much change, if any, it underwent as it got localized—and as a folklorist I can assure you that more than news spread through these conduits: legends (fake news) and poems also spread often circulated on these networks as small filler items, curiosities.

Oiva et al’s goal is to “identif[y] the routes and the pace of the news, as well as the temporal rhythms of the evolving [reporting and commentary].” Assassinated on the stairs of the Finnish Senate on 16 June 1904, news of Bobrikov’s murder spread quickly, reported in “in hundreds of European newspapers and as far away as Mexico City, Honolulu and San Francisco” by the next day, June 17—and, for those with a literary bent, making it into the pages of Joyce’s Ulysses, published a decade later. (So, far, wide, and deep?) One of the first thing they observe is that, thanks to the dense but not distributed nature of the telegram network, the news reached around the world readily to large population centers but in fact took several days to reach smaller cities in Finland, the country of origin, where local newspapers did not have the same kind of access to the network, and so relied on getting theirs from other newspapers better served by the network. As they note:

If the news did not spread at an even pace in all directions, but through information channels, which enabled news to travel 10,000 km from Helsinki to México City faster than 150 km from Helsinki to Hamina, what kinds of channels did the message about a terrorist attack in a remote corner of Europe use and activate?

To answer that question, Oiva et al focused on half of their corpus of 1400 items, taking advantage of the convention at the time for items to feature the origin of a news story—with the caveat that sometimes these origins are faked in order to give certain items added weight or meaning. In addition, their research emphasized that the arbitrary nature of the news networks also had an impact on the news: under Bobrikov Finland had only been allowed one news agency and it was tied to the Russian telegram agencies, so much of the news was channeled through Saint Petersburg. Reuters had an office in the Russian capital and sent the message to the London office, from whence the information spread widely and quickly. Then as now, news agencies gathered news either indirectly, as was the case here, or directly through their own correspondents, and then transmitted it to national, regional, and local newspapers who either re-transmitted it directly or did so indirectly through clippings from adjacent newspapers.

The efficacy of the network, however, in no way guarantees that a local incident will become global news, and how much it will endure in the news cycle. Bobrikov’s assassination was interesting in that not only was the event itself reported, and transmitted, but also his later death at a hospital, his assassin Eugen Shauman’s letter, condolences to Bobrikov’s widow sent by person’s of note, as well as the news itself being featured in other newspapers. The overall effect was to create, as they note, “waves of information.” (The Russians appear to have been particularly interested in how they were being represented elsewhere. Just as interestingly, the Finnish group of which Shauman was a part were just as interested: the project begins with their archive of news clippings, demonstrating how keen they were to track how wide the news circulated and from what perspective it was reported.)

Bardiot et al

For the Leonardo group, of which our own Clarisse Bardiot is a part, the time series involved is focused on theatrical productions, their development through rehearsals and performances. In order to aid not only researchers but also the artists themselves, whom they now are “the first curators of their works,” the group developed Rekall, “an environment to document, analyze creative processes and simplify the recovery of works.” The goal of Rekall is for it to be used during rehearsals—to annotate documents or review the history of a plot sheet—and during production in order to provide an “overview of the creation process and identif[y] the most important documents (which will then have to be the subject of specific preservation measures).”

As the Leonardo group notes, the objectives that guided Rekall’s design were, first, to help artists to document their creations in order to ensure their recovery and overcome the obsolescence of digital technologies, and, second, to help researchers study the genetics of works. The group argues that these two objectives can be satisfied by the same software because “collection takes place for the artists during the creative process itself, as the work stages progress” and analysis takes place for researchers once the process is completed, after the documents have stabilized, frozen, and become traces of what happened.

Recognizing that overcoming any difficulties performing artists may have in working with the software, the Leonard group realized they had “to develop a multimodal environment that included both digitized documents and natively digital documents: the digital traces of the performing arts include texts, images, sounds, videos, computer programs.” Given this gamut of file types, and the fact that we tend to sort these things out and not in, the group sought to find ways to obtain an overview of all these seemingly disparate pieces as well as a way to thread them together in order to trace “the evolution of an idea through different documents, from image to text, from text to table, from table to video recording.”

Attempting to enact the kinds of spans we now see central to culture analytics—the ability to navigate between the micro and macro, between close reading and distant reading, between diachronic and synchronic representation, the Leonardo group has tasked itself with developing software that can represent a process without erasing its complexity. One of their avenues for doing so is attempting to making the file itself the data, so that there is no “break between the source file and the data from which it is extracted.” As they note:

Maintaining this link is fundamental for several reasons. The data are always reductions, fragments and it is important to be able to recontextualize them in order to interpret them better. … In a very concrete way, in a corpus of 10,000 documents, the process of [analysis] must make it possible to identify the 50 documents that the researcher must read closely in order to analyze the work or that the artistic team must migrate or emulate to continue performing his show. Bardiot will be demonstrating the software later during the conference, and for those interested in these possibilities, I highly recommend attending.

Broadwell at al

Peter Broadwell, Tim Tangherlini, and Hyun Kyong Hannah Chang used network analysis to discover the production networks in Korean popular music. we applied network analysis techniques to records in online data archives describing ~4,800 individuals, groups, and companies associated with recent Korean popular music, especially K-pop. Network analyses focusing on specific time intervals can reveal prominent individuals, groups, and larger sub-networks in the Korean music scene during the global rise of K-pop over the past 20 years, shedding light on the comparative structures and scales of these production networks and their changes over time

The most noteworthy features we observed in the derived production networks of Korean popular music over time were its explosive growth in scale and interconnectedness from approximately 2005 to the present, and the extreme prominence of K-pop organizations— particularly S.M. Entertainment and its premiere idol groups—during this time. The high rankings of S.M.’s groups, especially in eigenvector centrality, were largely a result of the company’s practice of orchestrating collaborations and fabricating subgroups and “supergroups” among its artists, which could be characterized as a form of deliberate “network engineering” to boost its members’ prominence.

Leonard et al

Peter Leonard has headed any number of compelling projects as director of Yale’s Digital Humanities Lab, and the project I want to highlight here, in advance of his coming to talk to you in just a moment, is the collaboration with the Getty Research Institute to analyze Ed Ruscha’s Every Building on the Sunset Strip, published in 1966. In some ways, what he and his team are doing is reverse engineering Ruscha’s work in which, as I understand it, he drove a truck down Sunset Boulevard and it took a photograph at a specified time interval, a kind of time series. Now, Ruscha’s series of photographs was in service of imagining the Boulevard in total, and what Leonard et al are trying to do is to re-imagine Ruscha’s project to bring Sunset of that moment back to life, using advanced machine learning techniques to piece together Ruscha’s photographs into three-dimensional views of the Sunset Strip as it was then.

What lies behind some of Leonard’s genius is his facility with neural networks, which is surely not a phrase many of us get to use very often. But how else to explain his creation of a long lost work of nineteenth century Danish literary criticism, the seventh in a six-volume series? More seriously, his work with Lindsay King on “Robots Reading Vogue” allows anyone with access to the web to explore all 2,700 covers and 400,000 pages of Vogue. And you can do it at both the micro and macro level. Using a refined version of direct visualization, for example, Leonard and King demonstrate those moments when Vogue covers reveal frightening consistency. In addition to this exploration, the site he and King have built allows users to explore n-grams, topics, histograms, colormetrics, advertisements, and even word embedding that reveals hierarchical clusters of fabric types. Instead of trying to summarize it, I will simply point you to the site with the advice that it is worth exploring when you have some time and are willing to have your mind blown.

Concluding Remarks

Whether their original publication is in print, on television, a theatrical production, on the web, or in oral tradition, all these materials deserve archival and analytical attention. As the Leonardo group suggests, in focusing on newspapers and novels, we have taken as iconic or indexical a rather small subset of a much larger network of forms and treated them as culture itself. As archives grow, our abilities to address not only the larger piles of data but also the diversity of the forms of data will also need to grow. We see culture analytics as an open invitation to explore the possibilities. If you have glimpsed nothing else in this talk of mine, then you have surely seen how all of this work is always done in collaboration. It is in the spirit of conversation that we proposed the workshop to the conference, and we hope the conversations that take place here are just the beginning.

  1. The current Wikipedia entry for cultural analytics asserts that “The term “cultural analytics” was coined by Lev Manovich in 2007,” but this sentence was added one year after the creation of the page and has no source to substantiate this claim. Similarly, the culturomics website includes among its list of publications one from 2007, but that publication never mentions the term culturomics. It does, however, deploy computational methodologies that will become a part of the standard portfolio of practices in the realm of computational studies of culture. 

  2. The only extant citation for this is: https://web.archive.org/web/20110810143644/http://wp.nmc.org/horizon2010/chapters/visual-data-analysis/ 

  3. This quote is attibuted to Jon Orwant, described as “computer scientist and director of digital humanities initiatives at Google.” 

  4. http://www.ipam.ucla.edu/programs/summer-schools/networks-and-network-analysis-for-the-humanities-an-neh-institute-for-advanced-topics-in-digital-humanities/ 

  5. http://www.ipam.ucla.edu/programs/long-programs/culture-analytics/ 

  6. Engineering Statistics Handbook 6.4 (https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm). 

  7. With my apologies for not listing all of the co-authors of “Spreading News in 1904: The Media Coverage of Nikolay Bobrikov’s Shooting” in the paper itself, they are: Mila Oiva (University of Turku), Asko Nivala(University of Turku), Hannu Salmi (University of Turku), Otto Latva (University of Turku), Marja Jalava (University of Turku), Jana Keck (University of Stuttgart), Laura Martínez Domínguez (Universidad Nacional Autónoma de México), James Parker (Northeastern University).