Text sanitizing means cleaning-up the entered text according to the previously established parameters or rules. Those rules usually involve text parsing and handling.
A whole set of tools could be used to clear-up such an input. For example, basic tools like string replacement method - str.replace() or regex – re.sub() can achieve removing or changing a specific character sequences. Additionally, there is Unicode normalization. Consider following strings:
1
2
s1='Menue\u00f1o' # Menueño
s2='Menuen\u0303o' # Menueño
They represent the same text, and are represented by multiple characters therefore the strings itself are different:
1
print(s1==s2)
Output:
1
False
Obviously comparing those strings will be a problem so this needs to be fixed somehow. When considering Unicode, Python sports the unicodedata module with method normalize() which can do the trick:
1
2
3
4
fix_s1=unicodedata.normalize('NFC', s1)
fix_s2=unicodedata.normalize('NFC', s2)
print (fix_s1==fix_s2) # true
NFC argument means fully composition of the characters, so if we apply it to the both strings, they will be comparable. There is another argument which depicts decomposition NFD, which fully decomposes the use of the underlying characters:
1
2
3
4
fix_s1=unicodedata.normalize('NFD', s1)
fix_s2=unicodedata.normalize('NFD', s2)
print (fix_s1==fix_s2) # true
To see the difference between NFC and NFD, let’s ascii print the first string after applying NFC and NFD parameters, respectively:
1
2
print(ascii(fix_s1)) # NFC applied outputs 'Menue\xf1o'
print(ascii(fix_s1)) # NFD applied outputs 'Menuen\u0303o'
Furthermore, when trying to clean-up the text, the various strip() methods could be used for processing. They can be used to get rid of whitespace and characters on the either left or right side of the string, but not in the middle:
1
2
3
s=' python is open source '
s=s.strip()
print(s)
Output:
1
python is open source
Here the s.strip() method trims leading and trailing whitespace, but it can’t remove the inside ones. To do that, we could use the replace() method, as follows:
1
s=s.replace(' ','')
Which outputs:
1
pythonisopensource
Going further, there are other, more advanced options to the text clean-up. Suppose a chaotic string like this:
"ṗŕögrặmṃìng\fis\t great\r".
There is a way to process whole ranges of characters or strip diacritical marks by using str.translate() method. Let’s first fix the whitespace:
1
2
3
4
5
6
7
8
9
10
11
12
s='ṗŕögrặmṃìng\fis\t great\r'
process={
ord('\t'): None, # deleted
ord('\f'): ' ', # replaced with an empty space
ord('\r'): None # deleted
}
s=s.translate(process)
print(s)
Output:
1
ṗŕögrặmṃìng is great
Basically, it uses a translation table which is then applied to the given string. One can build a larger table and include all the fixes inside. Adding this code to the previous block:
1
2
3
4
5
6
7
import unicodedata
import sys
combine=dict.fromkeys(c for c in range(sys.maxunicode) if unicodedata.combining(chr(c)))
b=unicodedata.normalize('NFD', s)
b=b.translate(combine)
print(b)
Output:
1
programming is great
In this example, a dictionary mapping every Unicode character (through comprehension iteration) is created from the dict.fromkeys(). The original input is then normalized into decomposed form using normalize() with a NFD parameter. Afterwards, translate() method is used to delete all the accented characters.
Ending note is that the larger data set, the more possible performance issues the complex code may have. Therefore, performance-wise, it is sometimes better to repeat simpler code than to forge out an one-solution-fits-all. In any case, a testing will reveal what is the most acceptable solution in that regard.