文本分析算法需要格式化文本数据,最基本的处理方式为:
1.全部文本转化为小写
2.提取词干
注意:词干提取或许不完美,比如将“famous”提取为“famou”,但是只要它在所有文档中都一致就行。
3.去掉标点
实例:
# text normalization
text = 'Having shown how this small “document” would be represented, let’s use it for something.'
# lowerizing
text = text.lower()
# eliminate punctuations
import string
# make translating table:x->y note:len(x) == len(y)
transtab = str.maketrans(string.punctuation+'”“‘’', ' '*(len(string.punctuation)+4))
text = text.translate(transtab)
#stemming
from nltk.stem.porter import PorterStemmer
# note PorterStemmer is a function which should be represented as PorterStemmer()
text = [PorterStemmer().stem(i) for i in text.split()]
输出一个以单词为特征的特征向量:
text
Out[112]:
['have',
'shown',
'how',
'thi',
'small',
'document',
'would',
'be',
'repres',
'let',
's',
'use',
'it',
'for',
'someth']