Open Source Tools for NLP

A recent correspondence featured on the Corpora-List began with a list of lists for doing natural language processing (NLP). I am collecting/compiling the various references and links here with the hope of sorting them at some point in time.

The first reference is to a StackOverflow thread that wondered which language, Java or Python, was better for NLP. The question is unanswerable but in the process of deliberating, a number of libraries for each language were discussed. Link.

Apache has its own NLP functionality in Stanbol.

The University of Illinois’ Cognitive Computation Group has developed a number of NLP libraries for Java available on GitHub.

DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. It’s available on GitHub.

Corpus Tools is a portal dedicated to a number of software tools for corpus processing.

LIMA is developed in C++, works under Windows and Linux (possible to build it on macOS), and supports tokenization, morphologic analysis, pos tagging, parsing, SRL, NER, etc. The free version supports English and French. The closed one adds support for
Arabic, German, Spanish. Experiments were made on Portuguese, Chinese,
Japanese and Russian. The developer promises that more languages will be added to the free version.

Nikola Milosevic noted that he was developing two tools with the aim to process tables in scientific literature: “They are a bit specific, since they take as input XMLs, currently from PMC and DailyMed, soon HTML reader will be implemented. The tools are TableAnnotator, a tool for disentangling tabular structure into a structured database with labeling functional areas (headers, stubs, super-rows, data cells), finding inter-cell relationships and annotations (can be made with various vocabularies, such as UMLS, WordNet or any vocabulary in SKOS format) and TabInOut, a tool that uses TableAnnotator and is actually a wizard for making information extraction rules. He also notes that he has a couple of other open source tools: a stemmer for Serbian and Marvin, a flexible annotation tool that can use UMLS/MetaMap, WordNet or SKOS vocabulary annotation source for text.

There is also the SAFAR framework dedicated to ANLP (Arabic Natural Language Processing). It is free, cross-platform and modular. It includes: resources needed for different ANLP tasks such as lexicons, corpora and ontologies; basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics; applications for the ANLP; and utilities such as sentence splitting, tokenization, transliteration, etc.

The Centre of Language and Speech Technology, Radboud University Nijmegen has a suite of tools, all GPLv3 , that are available from their LaMachine distribution for easy installation. Many are also in the upcoming Debian 9, as part of debian-science, the Arch User Repository, and the Python Package Index where appropriate.

  • FoLiA: Format for Linguistic Annotation is an extensive and practical format for linguistically annotated resources. Programming libraries available for Python ( and C++ (
  • FLAT: FoLiA Linguistic Annotation Tool A comprehensive web-based linguistic annotation tool.
  • Ucto is a regular-expression based tokeniser with rules for various languages. Written in C++. Python binding available as well. Supports the FoLiA format.
  • Frog is a suite of NLP tools for dutch (pos tagging, lemmatisation, NER, dependency parsing, shallow parsing, morphological analysis). C++, python binding available, supports the FoLiA format.
  • Timbl is a memory-based machine learning (k-NN, IB1, IGTree)
  • Colibri Core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way.
  • Gecco(Generic Environment for Context-Aware Correction of Orthography).
  • CLAM can quickly turn command-line applications into RESTful webservices with web-application front-end.
  • LuigiNLP is a still new and experimental NLP Pipeline system built on top of SciLuigi, and in turn on Spotify’s Luigi.

Batch Converting DOCX Files

My students live in a Microsoft universe, for the most part. I don’t blame them: it’s what their parents and teachers know. And I blame those same adults in their lives for not teaching them how to do anything more powerful with that software, turning Word into nothing more than a typewriter with the ability to format things in an ad hoc fashion. Style sheets! Style sheets! Style sheets! As an university professor, I duly collect their Word documents, much I would collect their printed documents, and I read them, mark on them, and hand them back. Yawn.1

Sometimes, just to play with them, I take all their papers and I mine them for patterns: words and phrases and topics that occur across a number of papers. You can’t do that with Word documents, so you need to convert them into something more useful. (And, honestly, much of what my students turn in could be done in plain text and we would all be better off.)

On a Mac, textutil does the trick nicely:

textutil -convert txt ./MyDocxFiles/*.docx

I generally then select all the text files and move them to their own directory, where, for some forms of mining I simply lump them into one big file:

cat ./texts/*.txt > alltexts.txt

(I should probably figure out how to do the “convert to text” and “place in another directory” in one command line.)

pandoc can also do this, and I need to figure that syntax out.

  1. I also sit through their prettily formatted but also fairly substance-less PowerPoints — I’m not just picking on them here: I also work with them on making such presentations more meaningful. 

Of Types, Motifs, Tropes

For our next class, we are going to go a-hunting, tale-type hunting. I am going to bring an assortment of texts, some folktales and some not, that I will give you to track down. Your means of determining the nature of the texts will be the Tale-Type Index and the Motif Index. You will, I think, fairly quickly figure out how to use those two instruments to your best advantage.

It might also be a good moment to think about the nature of such cataloging efforts. One place to begin, as a kind of quick review of the origins and development of the indices is the Wikipedia entry on the Aarne–Thompson classification systems. (There is a separate entry on motif worth reading.) Once there, you will see a reference to a rather recent, in terms of the indices themselves, consideration by Alan Dundes’ “The Motif-Index and the Tale Type Index: A Critique”. (There is also Hans-Jörg Uther’s assessment in “Classifying Folktales”.)

The two indices work together to catalogue those tales within their pages by their constiuent parts, motifs. As a number of observers have remarked, this is no small matter and has lead some to regard the entire enterprise as hopeless, given the seemingly endless variability of the human imagination.

And yet, as seemingly old-fashioned as the tale-type and motif indices would seem to be, we have re-created them in TV Tropes. And so, it would seem, some of you have already played a drinking game to tale types. Congratulations.