词频-逆文档频率(TF-IDF)
词频-逆文档频率(term frequency - inverse document frequency,TF-IDF),由词频(TF)和逆文档频率(IDF)两部分组成。给定语料库 , 词条 在文档 中出现的次数。
- 词频(TF):词条 在文档 中的出现的频率
- 逆文档频率(IDF):总文档数除以包含词条 的文档数,度量词条的重要性
平滑
- 词频-逆文档频率(TF-IDF):
import numpy as np
import pandas as pd
import re
def tf_idf(term_list, doc_list):
num_docs = len(doc_list)
num_terms = len(term_list)
# term frequency
tf = pd.DataFrame(
data=np.zeros(shape=(num_docs, num_terms)), columns=term_list
)
for idx_doc, doc in enumerate(doc_list):
len_doc = len(doc)
for term in term_list:
for term_doc in doc:
if term_doc == term:
tf.loc[idx_doc, term] += 1
# normalizing
tf.loc[idx_doc, :] /= len_doc
# inverse document frequency
idf = pd.Series(
data=np.zeros(shape=(num_terms, )), index=term_list
)
for term in term_list:
for doc in doc_list:
if term in doc:
idf.loc[term] += 1
idf += 1
idf /= num_docs
idf = - np.log(idf)
return tf, idf
def segmentation(doc_list):
"""
split, remove non-alphanumeric chars
"""
term_list = {}
for i, q in enumerate(doc_list):
q = re.sub(pattern=r"[^\w\s]", repl="", string=q)
terms = q.lower().strip().split(" ")
doc_list[i] = terms
for term in terms:
if term_list.get(term):
term_list[term] += 1
else:
term_list[term] = 1
term_list = pd.Series(term_list)
return term_list, doc_list
doc_list = [
"I come to China to travel",
"This is a car polupar in China",
"I love tea and Apple ",
"The work is to write some papers in science"
]
term_list, doc_list = segmentation(doc_list)
print("terms:")
print(term_list)
print("documents:")
print(doc_list)
terms:
i 2
come 1
to 3
china 2
travel 1
this 1
is 2
a 1
car 1
polupar 1
in 2
love 1
tea 1
and 1
apple 1
the 1
work 1
write 1
some 1
papers 1
science 1
dtype: int64
documents:
[['i', 'come', 'to', 'china', 'to', 'travel'], ['this', 'is', 'a', 'car', 'polupar', 'in', 'china'], ['i', 'love', 'tea', 'and', 'apple'], ['the', 'work', 'is', 'to', 'write', 'some', 'papers', 'in', 'science']]
tf, idf = tf_idf(term_list.index, doc_list)
print("term frequency:")
print(tf)
print("inverse document frequency:")
print(idf)
term frequency:
i come to china travel this is \
0 0.166667 0.166667 0.333333 0.166667 0.166667 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.142857 0.000000 0.142857 0.142857
2 0.200000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.111111 0.000000 0.000000 0.000000 0.111111
a car polupar ... love tea and apple the \
0 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.000000
1 0.142857 0.142857 0.142857 ... 0.0 0.0 0.0 0.0 0.000000
2 0.000000 0.000000 0.000000 ... 0.2 0.2 0.2 0.2 0.000000
3 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.111111
work write some papers science
0 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.000000 0.000000 0.000000 0.000000 0.000000
2 0.000000 0.000000 0.000000 0.000000 0.000000
3 0.111111 0.111111 0.111111 0.111111 0.111111
[4 rows x 21 columns]
inverse document frequency:
i 0.287682
come 0.693147
to 0.287682
china 0.287682
travel 0.693147
this 0.693147
is 0.287682
a 0.693147
car 0.693147
polupar 0.693147
in 0.287682
love 0.693147
tea 0.693147
and 0.693147
apple 0.693147
the 0.693147
work 0.693147
write 0.693147
some 0.693147
papers 0.693147
science 0.693147
dtype: float64