Automating Text Cleaning

I am fundamentally ambivalent about the automation of text-cleaning: spending time with the data, by getting unexpected results from your attempts at normalization strikes me as one way to get to know the data and to be in a position to do better analysis. That noted, there have been a number of interesting text-cleaning libraries, or text-cleaning functionality built into analytic libraries, that have caught my attention over the past year or so. The most recent of these is clean-text. Installation is simple:

pip install clean-text

And then:

from clean-text import clean

The clean(the_string, *parameters*) takes a number of interesting parameters that focus on a particular array of difficulties:

Leave a Reply