文本数据格式化基本技巧

文本分析算法需要格式化文本数据,最基本的处理方式为:

1.全部文本转化为小写

2.提取词干

注意:词干提取或许不完美,比如将“famous”提取为“famou”,但是只要它在所有文档中都一致就行。

3.去掉标点

实例:


# text normalization
text = 'Having shown how this small “document” would be represented, let’s use it for something.'

# lowerizing
text = text.lower()

# eliminate punctuations
import string
# make translating table:x->y note:len(x) == len(y)
transtab = str.maketrans(string.punctuation+'”“‘’', ' '*(len(string.punctuation)+4))
text = text.translate(transtab)

#stemming
from nltk.stem.porter import PorterStemmer
# note PorterStemmer is a function which should be represented as PorterStemmer()
text = [PorterStemmer().stem(i) for i in text.split()]

输出一个以单词为特征的特征向量:

text
Out[112]: 
['have',
 'shown',
 'how',
 'thi',
 'small',
 'document',
 'would',
 'be',
 'repres',
 'let',
 's',
 'use',
 'it',
 'for',
 'someth']

猜你喜欢

转载自blog.csdn.net/zs15321583801/article/details/84061198