文本清洗常用的工具

原文地址:https://zhuanlan.zhihu.com/p/53286270  贪心科技李文哲老师的文章  学习笔记

1、去除标点符号

s = ''.join(c for c in word if c not in string.punctuation)

2、英文转换为小写  

s.lower()

3、数字归一化

s = '#number' if s.isdigit() else s

4、停用词库/低频词库

停用词库:我们可以直接在搜索引擎上搜索“停用词库”或“english stop words list”,能找到很多停用词库。例如:

stop_words = ["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your","ain't","aren't","can't","could've","couldn't","didn't","doesn't","don't","hasn't","he'd","he'll","he's","how'd","how'll","how's","i'd","i'll","i'm","i've","isn't","it's","might've","mightn't","must've","mustn't","shan't","she'd","she'll","she's","should've","shouldn't","that'll","that's","there's","they'd","they'll","they're","they've","wasn't","we'd","we'll","we're","weren't","what'd","what's","when'd","when'll","when's","where'd","where'll","where's","who'd","who'll","who's","why'd","why'll","why's","won't","would've","wouldn't","you'd","you'll","you're","you've"]

低频次库:我们可以使用Counter等库获取所有句子中所有词的词频,通过筛选词频获得低频词库。例如:

from collections import Counter
# 获取词典
word_dict = Counter(sentence_list)
# 建立低频词库
low_frequency_words = []
low_frequency_words.append([k for (k,v) in  word_dict.items() if v <2])

获取停用词库和低频词库后,将词库中的词语删除  

if s not in stop_words and s not in low_frequency_words:
    sentence += s

5、去除不必要的标签

这一块在实际工作中需要灵活的使用,例如使用re库对文本做正则删除、替换,利用json库去解析json数据,又或者使用规则对文本进行相应的处理。

猜你喜欢

转载自www.cnblogs.com/duoba/p/12307738.html