spaCy Model Features

spaCy currently offers four models / pipelines, which are less than clearly labeled small, medium, large, and transformer. If you wondered about the differences, they offer individual tables, which I have compiled into one table here:

(Sorry this is an image of a table and not a proper table: WordPress’ block editing is not helpful — I will switch back to markdown as soon as I have a moment.)

And here is a look at the accuracy statistics as provided by the documentation. Again, all I am doing here is offering a synoptic view:

TAG_ACCPart-of-speech tags (fine grained tags, Token.tag)0.970.970.970.98
SENTS_PSentence segmentation (precision)0.920.920.920.96
SENTS_RSentence segmentation (recall)0.890.900.890.87
SENTS_FSentence segmentation (F-score)0.900.910.900.91
DEP_UASUnlabeled dependencies0.920.920.920.95
DEP_LASLabeled dependencies0.900.900.900.94
ENTS_PNamed entities (precision)0.840.850.850.90
ENTS_RNamed entities (recall)0.840.860.860.90
ENTS_FNamed entities (F-score)0.840.850.860.90

A Structuralist Mini-Reader

As structuralism and post-structuralism, aka grand theory, re-emerge in the context of the digital humanities and quantitative approaches, I remembered that years ago I had compiled a small reader focused on Lévi-Strauss.

The original reader contained the following items below, which I have since scanned, and OCRed, to PDF if anyone is interested. (Please contact me if you are: I want abide by fair use provisions.)

Boon, James. 1985. Claude Lévi-Strauss. In The Return of Grand Theory in the Human Sciences, 159-176. Ed. Quentin Skinner. Cambridge University Press.

Lévi-Strauss, Claude. 1995. Myth and Meaning. Schocken Books.

Lévi-Strauss, Claude. 1971. The Deduction of Crane. In Structural Analysis of Oral Tradition, 3-21. Ed. Pierre Maranda and Elli Köngäs Maranda. University of Pennsylvania Press.

Lévi-Strauss, Claude. 1996. The Story of Lynx. Tr. Catherine Tihanyi. University of Chicago Press.

In addition to these essays, I also have begun to collect essays that either comment on the nature of structuralist (and post-structuralist) thinking and rhetoric as well as connections between that moment and the current one:

Ruegg, Maria. 1979. Metaphor and Metonymy: The Logic of Structuralist Rhetoric. Glyph 6 141-57. 

Geoghegan, Bernard Dionysius. 2011. From Information Theory to French Theory: Jakobson, Levi-Strauss, and the Cybernetic Apparatus. Critical Inquiry 38 96-126. 

If anyone else is interested in compiling more of such a bibliography, I would be interested in the conversation and/or collaboration.

Nonfiction Books

A list of books I loaned out years ago, and apparently never got back, reminded me of some beloved non-fiction books that I am considering re-purchasing:

Trevor Corson’s Secret Life of Lobsters: How Fishermen and Scientists Are Unraveling the Mysteries of Our Favorite Crustacean

Hayden Carruth’s Sitting in: Selected Writings on Jazz, Blues, and Related Topics

And one book listed as simply as Stonework, and while I remember the small paperback well, I do not remember more. Perhaps it is Charles McRaven’s Stonework: Techniques and Projects.

Jakobson’s Response to Saussure’s Cours

From Ladislav Matejka’s “Jakobson’s Response to Saussure’s Cours”:

In the parlance of the octogenarian Jakobson, the decomposition of the phoneme into concurrent distinctive features rejected Saussure’s “linearité du signifiant” and, thereby, one of the general principles of his Cours. In spite of this rejection, it is clear, however, that in the gradual development of distinctive feature theory Jakobson’s decades-long duel with Saussure’s concept of the phoneme had played a crucial role. In fact, it is perhaps not far from the truth to claim that without Jakobson’s life-long dual with Saussure’s Cours, there would not be Jakobson’s distinctive features theory as we know it.

Matejka goes on to note that Jakobson early on rejected the absoluteness of Saussure’s antinomy between synchrony and diachrony: “every system necessarily exists as an evolution, whereas, on the other hand, evolution is inescapably of a systemic nature” (Jakobson 1928).

Matejka, Ladislav. 1997. Cahiers de l’ILSL 9: 169–176.

Blogging’s Dimming Future

As part of a larger effort of getting rid of things I don’t need, which includes materials and links and notes that I have stashed all over my computer’s hard drive, I am spending Saturday night going through Safari’s Reading List. What’s worth keeping, I am saving to Pocket, and then I am deleting the rest.

Along the way, I came across Ben Thompson’s “Blogging’s Bright Future from 2 February 2015. The essay begins with what was then breaking news, that many pundits nee bloggers were lamenting the demise of the blog. Thompson’s analysis is smart as always, and he observes that many are lamenting the demise of the single blogger as blogs sought to get bigger and deliver more readers to advertisers. Thompson’s model would be the one that eventually was adopted to others and led to the rise of SubStack and Matter, among others.

There’s plenty to discuss there, but what I was struck by was the following passage:

The truth, though, is that blogging has evolved. It is absolutely true that the old Sullivan-style — tens of posts a day, mostly excerpts and links, with regular essays in immediate response to ongoing news — is mostly over.

What I liked about my blog, this blog, when I first started it was how it was simply that, a web log, a place where I kept notes that were also public, so if someone asked me something and I had already written about it, I could simply point them to the blog.

And then the blog got attention, and people were looking at it, and it was getting linked to by Ivy League libraries and national research centers, and I got too nervous to post all the things that in fact made the blog a blog for me.

And along the way WordPress went from being blogging software to a publishing platform.

And all the fun went out of it, and all the utility, too.

Chunks of what people describe as this second brain phenomenon strike me as what the blog, my blog, used to be. I don’t know if this will ever get back to that. There are some downsides to keeping things in public, but it does make me wonder about simply creating an internal blog.

Jupyter Notebook Colored Boxes

Many thanks to Daniel Kotik for the following HTML that can be dropped into a Jupyter notebook markdown cell:

<div class="alert alert-block alert-info"> <b>NOTE</b> Use blue boxes for Tips and notes. </div>

<div class="alert alert-block alert-success"> Use green boxes sparingly, and only for some specific purpose that the other boxes can't cover. For example, if you have a lot of related content to link to, maybe you decide to use green boxes for related links from each section of a notebook. </div>

<div class="alert alert-block alert-warning"> Use yellow boxes for examples that are not inside code cells, or use for mathematical formulas if needed. </div>

<div class="alert alert-block alert-danger"> In general, just avoid the red boxes. </div>

Research Mindset

In a recent article in Inc, Maria Haggerty concludes that the single most important quality to look for in individuals who may be, or are, high performers are:

  • long-term commitment to a specific domain: This describes a person who is committed to making an increasing difference to one domain over a sustained period of time.
  • questing disposition: When confronted with a challenge, this person becomes excited and wants to pursue that challenge, seeing it as an opportunity to reach the next level of performance.
  • connecting disposition: A person whose instinct, when confronted with a challenge, is to actively reach out and connect with others who can help address it together.

I look at that list and think: that sounds like you are describing a researcher, or at least a research mindset.

Installing Git on macOS

It’s good to be reminded that things are not easy when it comes to things like analytics. For those already deep into it with well-established setups, it’s easy to forget how hard-won that setup might have been.

I was handed just such a reminder today when I decided that, as part of my effort to re-build my website using GitHub Pages — after a decade and a half on WordPress — that instead of waiting for the web infrastructure to build the site so I could check changes, I would run it locally. This may strike many GH Pages users as obvious, but I was genuinely trying to develop a code-free website that I could then share with colleagues and students to get them started using a text editor and Git. What’s more gratifying than a website? Instant publishing.

GitHub Pages run on Ruby and use the Jekyll gem. A quick web search revealed that the best way to install Ruby on macOS was Homebrew, which is itself written in Ruby. Fine. I use conda to maintain my Python stack. It makes sense. And as an added bonus, people love Homebrew and I know it does a whole lot more.

The convention for installing homebrew is:

/bin/bash -c "$(curl -fsSL"

Only that reveals that I have forgotten to install the Xcode Command Line Tools. That’s not so strange to me: when I used to use MacPorts for package management, it was the first step to install MacPorts — which became an almost annual task as Apple increased the frequency with which is released major versions of macOS.

xcode-select --install

Installation done. I re-ran the bash command above. Oops! I forgot to sign the license.

sudo xcodebuild -license

License agreed to, I re-ran the homebrew installation command again. You agree to a few things … and fail. Something about git. I do a few casual web searches that do not solve the problem until I remember to do the obvious: excerpt the error code into the search: xcode-select: Failed to locate 'git'. At some point I stumbled upon the solution:

xcodebuild -runFirstLaunch

I also ended up clearing out the previous homebrew failed installs and starting somewhat from scratch:

sudo rm -rf /usr/local/Homebrew

That’s a bit of a winding path for me, and I have a reasonable amount of patience and an almost reasonable amount of awareness, if not actual knowledge. This would be overwhelming for a lot of new users. I understand now why so many courses in which the basics are taught feature such installations as part of a class meeting. They are just so many weird things that can go wrong, and it helps to have someone who can help you troubleshoot and who can also re-assure you that eventually you will have a working installation and you will not need to worry about any of this … until next time.

Textacy SVOs

If you have landed here, then you have been intrigued by the possibility of handing over having Textacy do the work of delivering subject-verb-object triples out of your text data. It can be done, but there are some nuances to making things happen.

One of the first issues I encountered was getting ValueErrors for nlp.maxlength when using Textacy’s built-in function, but if I used spaCy to create a spaCy doc everything was fine:

# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_lg')

# Use the pipe method to feed documents 
docs = list(nlp.pipe(texts_f))

# Checking to see if things worked:

Please note that Textacy does have a corpus object. I have not used it yet, but it looks like you could simply feed it the list of spaCy docs. It allows you to bundle metadata with the texts — I would like to see examples of how people are using it.

corpus = textacy.Corpus("en_core_web_sm", data=docs)

Spacy has built-in PoS tagging, accessing it looks like this:

for token in docs[0][0:5]:
    print (token, token.tag_, token.pos_) # spacy.explain(token.tag_)
# If we want to see all the nouns used 
# as subjects in the test document:
subjects = [str(item[0]) for item in SVOs]
subjects_set = set(subjects)

print(f"There are {len(subjects_set)} unique subjects out of {len(subjects)}.")
# Get out just the first person singular triples:
for item in SVOs:
    if str(item[0]) == '[i]':

It looks like the verb “contents” — the verb phrase — contains more material than we want. If all we want is the very itself, we will need to target the last item in the verb list.

for item in SVOs:
    if str(item[0]) == '[i]':