I am fundamentally ambivalent about the automation of text-cleaning: spending time with the data, by getting unexpected results from your attempts at normalization strikes me as one way to get to know the data and to be in a position to do better analysis. That noted, there have been a number of interesting text-cleaning libraries, or text-cleaning functionality built into analytic libraries, that have caught my attention over the past year or so. The most recent of these is clean-text
. Installation is simple:
pip install clean-text
And then:
from clean-text import clean
The clean(the_string, *parameters*)
takes a number of interesting parameters that focus on a particular array of difficulties: