Automating Text Cleaning

I am fundamentally ambivalent about the automation of text-cleaning: spending time with the data, by getting unexpected results from your attempts at normalization strikes me as one way to get to know the data and to be in a position to do better analysis. That noted, there have been a number of interesting text-cleaning libraries, or text-cleaning functionality built into analytic libraries, that have caught my attention over the past year or so. The most recent of these is clean-text. Installation is simple:

pip install clean-text

And then:

from clean-text import clean

The clean(the_string, *parameters*) takes a number of interesting parameters that focus on a particular array of difficulties:

Text Analytics APIs 2018

Text Analytics APIs 2018: A Consumer Guide is $895 for a single user license. At 299 pages, that’s about $3 per page. The blurb notes that:

Robert Dale is an internationally-recognized expert in Natural Language Processing, with three decades of experience in academia and industry. With a PhD from the University of Edinburgh, he’s worked for Microsoft and Nuance, and he’s driven the development of SaaS-based NLP software for a startup. He has taught at the University of Edinburgh in the UK and at Macquarie University in Sydney, and presented tutorials and summer school courses around the world. He has over 150 peer-reviewed publications, including a comprehensive Handbook of Natural Language Processing, and the de facto textbook Building Natural Language Generation Systems.

Open Source Tools for NLP

A recent correspondence featured on the Corpora-List began with a list of lists for doing natural language processing (NLP). I am collecting/compiling the various references and links here with the hope of sorting them at some point in time.

The first reference is to a StackOverflow thread that wondered which language, Java or Python, was better for NLP. The question is unanswerable but in the process of deliberating, a number of libraries for each language were discussed. Link.

Apache has its own NLP functionality in Stanbol.

The University of Illinois’ Cognitive Computation Group has developed a number of NLP libraries for Java available on GitHub.

DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. It’s available on GitHub.

Corpus Tools is a portal dedicated to a number of software tools for corpus processing.

LIMA is developed in C++, works under Windows and Linux (possible to build it on macOS), and supports tokenization, morphologic analysis, pos tagging, parsing, SRL, NER, etc. The free version supports English and French. The closed one adds support for
Arabic, German, Spanish. Experiments were made on Portuguese, Chinese,
Japanese and Russian. The developer promises that more languages will be added to the free version.

Nikola Milosevic noted that he was developing two tools with the aim to process tables in scientific literature: “They are a bit specific, since they take as input XMLs, currently from PMC and DailyMed, soon HTML reader will be implemented. The tools are TableAnnotator, a tool for disentangling tabular structure into a structured database with labeling functional areas (headers, stubs, super-rows, data cells), finding inter-cell relationships and annotations (can be made with various vocabularies, such as UMLS, WordNet or any vocabulary in SKOS format) and TabInOut, a tool that uses TableAnnotator and is actually a wizard for making information extraction rules. He also notes that he has a couple of other open source tools: a stemmer for Serbian and Marvin, a flexible annotation tool that can use UMLS/MetaMap, WordNet or SKOS vocabulary annotation source for text.

There is also the SAFAR framework dedicated to ANLP (Arabic Natural Language Processing). It is free, cross-platform and modular. It includes: resources needed for different ANLP tasks such as lexicons, corpora and ontologies; basic levels modules of language, especially those of the Arabic language, namely morphology, syntax and semantics; applications for the ANLP; and utilities such as sentence splitting, tokenization, transliteration, etc.

The Centre of Language and Speech Technology, Radboud University Nijmegen has a suite of tools, all GPLv3 , that are available from their LaMachine distribution for easy installation. Many are also in the upcoming Debian 9, as part of debian-science, the Arch User Repository, and the Python Package Index where appropriate.

  • FoLiA: Format for Linguistic Annotation is an extensive and practical format for linguistically annotated resources. Programming libraries available for Python ( and C++ (
  • FLAT: FoLiA Linguistic Annotation Tool A comprehensive web-based linguistic annotation tool.
  • Ucto is a regular-expression based tokeniser with rules for various languages. Written in C++. Python binding available as well. Supports the FoLiA format.
  • Frog is a suite of NLP tools for dutch (pos tagging, lemmatisation, NER, dependency parsing, shallow parsing, morphological analysis). C++, python binding available, supports the FoLiA format.
  • Timbl is a memory-based machine learning (k-NN, IB1, IGTree)
  • Colibri Core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgrams (i.e patterns with one or more gaps, either of fixed or dynamic size) in a quick and memory-efficient way.
  • Gecco(Generic Environment for Context-Aware Correction of Orthography).
  • CLAM can quickly turn command-line applications into RESTful webservices with web-application front-end.
  • LuigiNLP is a still new and experimental NLP Pipeline system built on top of SciLuigi, and in turn on Spotify’s Luigi.

Stanford Dependencies in Python

David McClosky wrote to the Corpora List with the following news:

I’m happy to announce two new Python packages for parsing to Stanford Dependencies. The first is PyStanfordDependencies which is a Python interface for converting Penn Treebank trees to Stanford Dependencies. It is designed to be easy to install and run (by default, it will download and use the latest version of Stanford Dependencies for you):

[code lang=text]
import StanfordDependencies
sd = StanfordDependencies.get_instance(version='3.4.1')
sent = sd.convert_tree('(S1 (NP (DT some) (JJ blue) (NN moose)))')
print sent.as_asciitree()
moose [root]
+– some [det]
+– blue [amod]

PyStanfordDependencies also includes a basic library for reading, manipulating, and producing CoNLL-X style dependency trees.

The second package is an updated version of bllipparser (better known as the Charniak-Johnson reranking parser). bllipparser gives you access to the parser and reranker from Python. The most recent update integrates bllipparser with PyStanfordDependencies allowing you parse straight from text to Stanford Dependencies. It also adds tools for reading and manipulating Penn Treebank trees.

More information is available in the READMEs. Feedback, bug reports, feature requests are welcomed (please use the GitHub issue trackers for the latter two).


[OpeNER][] is a language analysis toolchain helping (academic) researchers and companies make sense out of “natural language analysis”. It consist of easy to install, improve and configure components to:

* Detect the language of a text
* Tokenize texts
* Determine polarisation of texts (sentiment analysis) and detect what topics are included in the text.
* Detect entities named in the texts and link them together. (e.g. President Obama or The Hilton Hotel)
* The supported language set currently consists of: English, Spanish, Italian, German and Dutch.

Besides the individual components, guidelines exists on how to add languages and how to adjust components for specific situations and topics.


The Saffron Research Browser

I’m still trying to figure out what all I can do with the [Saffron][] browser/visualizer. It claims to analyze the research communities of natural language processing, information retrieval, and the semantic web through “text mining and linked data principles.”

The list of research domains is rather short and under-explained for the uninitiated:

Saffron's List of Research Domains

Saffron’s List of Research Domains

I clicked on [ANLP][], which is *applied natural language processing, and you get both a list of hot topics:

Hot Topics in ANLP

Hot Topics in ANLP

As well as a taxonomy network/tree that offers labels when you hover over nodes, which are themselves clickable links:

Taxonomy Network for ANLP

Taxonomy Network for ANLP

Clicking on one of the “hot topics,” in this case [natural language text][], gives you a bar chart of the frequency of the topic in documents for the past thirty years:

Frequency of Natural Language Text as a Topic over 30 Years

Frequency of Natural Language Text as a Topic over 30 Years

A list of similar topics:

Topics Similar to "Natural Language Text"

A list of experts:

Saffron's List of Experts Associated with "Natural Language Text"

Saffron’s List of Experts Associated with “Natural Language Text”

And a list of publications:

The Top 5 Publications for "Natural Language Text"

The Top 5 Publications for “Natural Language Text”

Like a lot of browsers, this kind of static presentation of the results impoverishes the exploration that it encourages. I also haven’t explored what are its inputs: I wonder how full/complete its historical record is.

[natural language text]:

The War for Our Texts

[Josh Constine at TechCrunch has an article]( about what he is calling the “message war” that Google, Apple, and Facebook are either already waging or are about to wage. While I rolled my eyes over the somewhat hyperbolic nature of the piece — it is TechCrunch and the world is always about to end or be revolutionized (sometimes at the same time) — I did find the following bit fascinating:

> People love content, but people need direct communication. Who you communicate with on a daily basis and via what medium are vital signals regarding where people sit in your social graph. Whichever company owns the most of this data will have better ways to refine the relevance of their content streams, showing you updates by the people you care about aka communicate with most, and showing ads nearby. Through natural language processing and analysis, whoever controls messages will also get to machine-read all of them and target you with ads based on what you’re talking about.

The social graph has become a cliché, at least among the technorati, but it is still powerful information that companies would like to have in order to market to us better, and perhaps on an individual basis. The nature of our relationships, as realized in actual messages, has always, so must of us have felt, been somewhat sacrosanct, off-limits, for us alone to know.

Well, that isn’t necessarily the case, since Google has always made a point of saying the ads shown through the web interface for its Gmail service are based, in some fashion, on the content of those e-mails. Like a lot of people, I have a GMail account, but it is strictly used as a channel for people I don’t know or who need pro forma contact information. (Site registrations, software licenses, and the like.) Thus, what Google gleans about me from reading my Gmail account is rather one-dimensional.

But I do text, and when I do text, it is with those closest to me, which is why I assume everybody wants access to that data. More interestingly, the way they are going to access that data is through a technology that I myself am interested in, *natural language processing*.

The world just keeps getting more and more interesting.

Working with Python’s NLTK, Working on my Python Fu

I started [a thread on Stackoverflow][1] as I try to determine how to write a Python script using the Natural Language Toolkit that will write the concordance for a term out to a file. Here’s the script as it stands:

#! /usr/bin/env python

import nltk
import sys

# First we have to open and read the file:

thefile = open(‘all_no_id.txt’)
raw =

# Second we have to process it with nltk functions to do what we want

tokens = nltk.wordpunct_tokenize(raw)
text = nltk.Text(tokens)

# Now we can actually do stuff with it:

concord = text.concordance(“cultural”, 75, sys.maxint)

# Now to save this to a file

fileconcord = open(‘ccord-cultural.txt’, ‘w’)

Eventually I hope to have a script that will ask me for the `source text` and the `term` to be put in context and that will then generate a `text` file with the name of the term in it.

I should note that one of the respondents has already pointed me to a thread on the [NLTK discussion group][2], which I knew existed but had someone managed not to find.

If you’re interested in the discussion group, here’s its [home page][3] in the new Google Groups format. (It’s an ugly URL, to be sure.)

**Update**: [NLTK is now on GitHub][4]. Some of the [documentation][5], from what I can tell is in Tex. The NLTK book, which I own as an O’Reilly codex and epub, is also on GitHub as well as [an NLTK repository][6], which appears to be empty for now.

If you’re interested in the book: [visit O’Reilly’s site][site], where you can purchase it in a variety of formats, codex or electronic. The great thing about the e-versions is that you can pick and choose from PDF, epub, or mobi, which means I can have the PDF on my iPad and the epub on my phone and the mobi on my Kindle. If you really only want to deal with Amazon, then if you follow [this link][amz], I will get a small commission.

Natural Language Processing with Python