Word Machine

Word Machines

Pr. John Laudun / HLG 356 / laudun@louisiana.edu

ENGL 370 TR 11:00–12:15 (Oliver 202)

Description

The analysis of texts can be a daunting prospect, whether it is millions of small texts posted as messages or thousands of large texts made up of thousands of words. Texts are messy, the product of humans who have more than one way to apprehend the world — and usually apprehend it quite differently — who are, in fact, negotiating reality via texts. However messy they appear, texts produce results predictably and are more structured than they might appear. While some may shudder at the memory of vocabulary quizzes, sentence diagrams, and lectures on lyric poetry or the sonnet, the fact is that almost all texts regularly and reliably string words together using a sequencing pattern that causes them to be received in a particular way. Maybe we can just imagine them as machines made of words?

Objectives

Of the many dimensions that computation can address, perhaps one of the most difficult is the task of dealing with how humans actually use language. While a great deal can be gleaned from a variety of bounded events and actions—for example, by limiting the possible responses to a question, much more can be learned when we allow humans to produce the texts they wish to produce and then we as analysts set about finding better ways to understand them. This course is an introduction to text analytics, aka text mining, with particular emphases on texts and analytical outcomes. It is designed to familiarize students with the many dimensions of addressing text as data, from text processing to vector semantics to recent developments in AI. The objectives are straightforward: understand the basics of natural language processing, develop approaches to text processing with an especial focus on a corpus developed by participants (either individually or collaboratively), and to communicate those efforts effectively both in language and through visualizations.

Expectations

The course assumes that participants have a familiarity with the fundamental features of Python, the programming language we will be using to examine natural language. To be clear, most of our examples will be drawn from English, but some of our discussion will be focused on what it means to work across languages. We will also give some consideration to how a focus on English has shaped the development of NLP, which in turn has shaped what the functionality of many NLP operations available in Python and other programming languages.

The course sequence begins with a quick review of Python and basic text processing operations. From there we spend time with regular expressions, parsers, and state machines. After that, we take up the necessity of collecting data and then working with our data set to answer questions that arise from the data itself.

As participants work through the course, they are expected to:

  1. understand the basics of text processing in Python

  2. build upon those basics to develop approaches to text processing

  3. apply those approaches to a set of texts (a corpus)

  4. develop a corpus of their own and articulate the better approaches to that corpus

We will use a variety of data sets throughout the course, and it is expected that those data sets, also known as corpora in some instances, will act as models/examples for you to build your own.

Finally, it should be noted that this is not a course in programming, and it is not taught be a computer or information scientist. In many instances, participants will be better programmers than the instructor: my role is to offer the insights of someone who has examined texts both by hand and by algorithm for over 20 years and to facilitate your engagement with the known issues and dimensions as well as those you discover.

Requirements

Texts

Required reading in this class comes in two forms, the classical text book listed below and a series of readings from published studies approaching the question of text analytics from a variety of perspectives, from the humanities to the human sciences to information and computer science.

We will be using one of the following texts:

Karsdorp, Folgert, Mike Kestemont, and Allen Riddell. 2021. Humanities Data Analysis: Case Studies with Python. Princeton University Press.

Bengfort, Banjamin, Rebecca Bilbro, and Tony Ojeda. 2018. Applied Text Analysis with Python: Enabling Language Aware Data Products with Machine Learning. O’Reilly Media.

Grimmer, Justin. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.

We will also be using the following book, which is also available for free as a website–the URL is included with the information.

Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O’Reilly Media. https://www.nltk.org/book/.

Jacob Perkins, author of Python 3 Text Processing with NLTK 3 Cookbook, has demos on stemming and lemmatization, sentiment analysis, tagging and chunk extraction, and phrase extraction and named entity recognition. URL.

Please note: we read broadly from corpus linguistics, corpus stylistics, information and data science. Successful participation requires that you keep up with the reading.

Assignments

More than anything, this is a course where participants do things: from start to finish, the course is a series of activities designed to give you a sense of how to think about texts when you are doing your work. While we begin with some toy corpora to expose you to ideas and issues associated with text analytics, the goal of the course is to have you develop a corpus of your own, either individually or collaboratively, to perform some analysis, and to report your findings in a clear and cogent fashion. (See section on criteria below for details on how assignments are graded.)

Exams. Three times during the semester participants are examined on topics that have been covered in the preceding weeks. The possible exam formats are: oral, where they are asked to walk through a problem in conversation with the instructor; in-class written; and out-of-class written, which are obviously more like writing assignments. Both written assignments may include pseudo-code and/or code as well as comments that explain not only the code but the concepts and methods at stake. E.g., if a step in the overall workflow removes stop words, participants are expected to explain how and why this is occurring.

Project(s). This course is really designed to produce a final project of various parts/phases. The plural in the assignment title reflects the multi-stage process in which sub-assemblies are put together in a sequence and feedback is provided, by both the instructor as well as other participants. The final version of the project, which may very well not be finished, gathers all the pieces into an assembly and the producer(s) of the project provide an explanation not only of the pieces and the assembly but also their response to feedback.

Evaluation Criteria (same for all projects)

Code/Approach (50%): Does the program produce useful results? Is it properly documented with #comments and doc strings? Does it use appropriate data structures?

Write Up (50%): Is it clear what was done and why? Does it adequately cite resources used? Does it adequately cite sources, references, and code adapted? Does it include some evaluation? Is there any analysis of errors? Does it point out any remaining issues? Is it formatted correctly?

Assessment problems are generally open ended – it is not expected that the student can solve them fully: the goal is to see how they approach the problem and understand it.

Schedule

The schedule for this course can be found in the course’s dedicated repository which also includes other course materials.