csplit < awk

I regularly need to split larger text files into smaller text files, or chunks, in order to do some kind of text analysis/mining. I know I could write a Python script that would do this, but that often involves a lot more scripting than I want, and I’m lazy, and there’s also this thing called csplit which should do the trick. I’ve just never mastered it. Until now.

Okay, so I want to split a text file I’ll call excession.txt (because I like me some Banks). Let’s start building the csplit line:

csplit -f excession excession.txt 'Culture 5' '{*}'

… Apparently I still haven’t mastered it. But this bit of awk worked right away:

awk '/Culture 5 - Excession/{filename=NR"excession"}; {print >filename}' excession.txt

For the record, I’m interested in working with the Culture novels of Iain M. Banks. I am converting MOBI files into EPUBs using Calibre, and then into plain text files. No, I cannot make these available to anyone, so please don’t ask.

The Culture series:

  1. Consider, Phlebas (1987)
  2. The Player of Games (1988)
  3. Use of Weapons (1990)
  4. The State of the Art (1991)
  5. Excession (1996)
  6. Inversions (1998)
  7. Look to Windward (2000)
  8. Matter (2008)
  9. Surface Detail (2010)
  10. Hydrogen Sonata (2012)

Assessing the Corpora available to Me

When I go to the Culture Analytics at UCLA’s IPAM in a little over a month, I want to arrive with at least one interesting corpus with which to work. I have the following options:

  • Louisiana treasure legends:
  • Hook legend:
  • Oil industry interviews: 480 texts

The oil industry interviews come as a collection of mostly DOC files with an RTF file or two mixed in. They are a mixed bag in terms of content, but perhaps doing some distant reading might turn up something interesting. To do that, I need to get them into a form with which I can work:

textutil -convert txt ~/Desktop/transcripts/*.docx

And, just after, the same command as above except with *.rtf at the end. Now I’ve got 480 plain text files. It would be nice, for the sake of using filenames later, to get rid of some part of the file names:

Lastname, Firstname 08-09-2006 final.txt
Lastname, Firstname and Firstname 01-23-02 final.txt

I created two Automator workflows: one workflow to make all the letters lowercase in the file names, a personal preference, and to replace spaces with underscores and another workflow to trim all occurrences of final or transcript from the end of files. (This could just as easily have been one workflow, but I created two, since I am guessing I will re-use these workflows again in the future.) Now file names look like this:


Still somewhat ungainly, but it will do for now.

The Future of Jobs

Occasionally I try to think about what the future will look like for my students: how best can I prepare them to do whatever it is they want to do? And so I find myself reading things like the World Economic Forums “The Future of Jobs”, which opens with this rather stunning claim:

65% of children entering primary school today will ultimately end up working in completely new job types that don’t yet exist.

Hmmm. I don’t quite know how to feel about statements like this. I get the extraordinary changes taking place in the world’s economy, but I think it also ignores the extraordinary changes not taking place: we still have bodies. We need to feed those bodies. We need to house those bodies. We need to move them about the landscape. (If the singularity comes sooner than expected, than all bets are off.)

The Perils of “Folklore”

“That’s All Folks!” is a piece from a 1997 issue of Lingua Franca, for those who remember it fondly, about the perils of the name of “folklore”. The name of the field was enjoying a moment, in a larger cycle of such moments, of being debated. Should we switch to folkloristics to sound more, well, ic-ky, like linguistics or physics, or to ethnology to sound more like all the ologies (biology, psychology, sociology, etc.). I’m shortly headed to UCLA to join the Culture Analytics program, so I thought it was worth remembering that this nominal nom-nom, as the kids might say, has a history.