Working through the Textacy SVO

14 Jul 2022

If you have landed here, then you have been intrigued by the possibility of handing over having Textacy do the work of delivering subject-verb-object triples out of your text data. It can be done, but there are some nuances to making things happen.

One of the first issues I encountered was getting ValueErrors for nlp.maxlength when using Textacy’s built-in function, but if I used spaCy to create a spaCy doc everything was fine:

# Load the Space pipeline to be used
nlp = spacy.load('en_core_web_lg')

# Use the pipe method to feed documents 
docs = list(nlp.pipe(texts_f))

# Checking to see if things worked:

Please note that Textacy does have a corpus object. I have not used it yet, but it looks like you could simply feed it the list of spaCy docs. It allows you to bundle metadata with the texts – I would like to see examples of how people are using it.

corpus = textacy.Corpus("en_core_web_sm", data=docs)

Spacy has built-in PoS tagging, accessing it looks like this:

for token in docs[0][0:5]:
    print (token, token.tag_, token.pos_) # spacy.explain(token.tag_)
# If we want to see all the nouns used 
# as subjects in the test document:
subjects = [str(item[0]) for item in SVOs]
subjects_set = set(subjects)

print(f"There are {len(subjects_set)} unique subjects out of {len(subjects)}.")
# Get out just the first person singular triples:
for item in SVOs:
    if str(item[0]) == '[i]':

It looks like the verb “contents” – the verb phrase – contains more material than we want. If all we want is the very itself, we will need to target the last item in the verb list.

for item in SVOs:
    if str(item[0]) == '[i]':

You can go back to the logbook or dive into the archive. Choose your own adventure!