词频-逆文档频率(TF-IDF)

词频-逆文档频率(TF-IDF)

词频-逆文档频率(term frequency - inverse document frequency,TF-IDF),由词频(TF)和逆文档频率(IDF)两部分组成。给定语料库 D = { d j } \mathcal{D} = \{ d_{j} \} n i , j n_{i, j}表示 词条 t i t_{i} 在文档 d j d_{j} 中出现的次数。

  • 词频(TF):词条 t i t_{i} 在文档 d j d_{j} 中的出现的频率

TF i , j = n i , j k n k , j \text{TF}_{i, j} = \frac{n_{i, j}}{\sum_{k} n_{k, j}}

  • 逆文档频率(IDF):总文档数除以包含词条 t i t_{i} 的文档数,度量词条的重要性

IDF i = log D j : t i d j \text{IDF}_{i} = \log \frac{| \mathcal{D} |}{| j : t_{i} \in d_j |}

平滑

IDF i = log D j : t i d j + 1 \text{IDF}_{i} = \log \frac{| \mathcal{D} |}{| j : t_{i} \in d_j | + 1}

  • 词频-逆文档频率(TF-IDF):

TF-IDF i , j = TF i , j × IDF i \text{TF-IDF}_{i, j} = \text{TF}_{i, j} \times \text{IDF}_{i}

import numpy as np
import pandas as pd
import re
def tf_idf(term_list, doc_list):
    
    num_docs = len(doc_list)
    num_terms = len(term_list)
    
    # term frequency
    tf = pd.DataFrame(
        data=np.zeros(shape=(num_docs, num_terms)), columns=term_list
    )
    for idx_doc, doc in enumerate(doc_list):
        len_doc = len(doc)
        for term in term_list:
            for term_doc in doc:
                if term_doc == term:
                    tf.loc[idx_doc, term] += 1
                    
        # normalizing
        tf.loc[idx_doc, :] /= len_doc
    
    # inverse document frequency
    idf = pd.Series(
        data=np.zeros(shape=(num_terms, )), index=term_list
    )
    for term in term_list:
        for doc in doc_list:
            if term in doc:
                idf.loc[term] += 1
    idf += 1
    idf /= num_docs
    idf = - np.log(idf)
    
    return tf, idf
                
    
def segmentation(doc_list):
    """
    split, remove non-alphanumeric chars
    """
    term_list = {}
    for i, q in enumerate(doc_list):
        q = re.sub(pattern=r"[^\w\s]", repl="", string=q)
        terms = q.lower().strip().split(" ")
        doc_list[i] = terms
        for term in terms:
            if term_list.get(term):
                term_list[term] += 1
            else:
                term_list[term] = 1
    term_list = pd.Series(term_list)
    
    return term_list, doc_list
doc_list = [
    "I come to China to travel",
    "This is a car polupar in China",
    "I love tea and Apple ",
    "The work is to write some papers in science"
]

term_list, doc_list = segmentation(doc_list)

print("terms:")
print(term_list)
print("documents:")
print(doc_list)
terms:
i          2
come       1
to         3
china      2
travel     1
this       1
is         2
a          1
car        1
polupar    1
in         2
love       1
tea        1
and        1
apple      1
the        1
work       1
write      1
some       1
papers     1
science    1
dtype: int64
documents:
[['i', 'come', 'to', 'china', 'to', 'travel'], ['this', 'is', 'a', 'car', 'polupar', 'in', 'china'], ['i', 'love', 'tea', 'and', 'apple'], ['the', 'work', 'is', 'to', 'write', 'some', 'papers', 'in', 'science']]
tf, idf = tf_idf(term_list.index, doc_list)
print("term frequency:")
print(tf)
print("inverse document frequency:")
print(idf)
term frequency:
          i      come        to     china    travel      this        is  \
0  0.166667  0.166667  0.333333  0.166667  0.166667  0.000000  0.000000   
1  0.000000  0.000000  0.000000  0.142857  0.000000  0.142857  0.142857   
2  0.200000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
3  0.000000  0.000000  0.111111  0.000000  0.000000  0.000000  0.111111   

          a       car   polupar  ...  love  tea  and  apple       the  \
0  0.000000  0.000000  0.000000  ...   0.0  0.0  0.0    0.0  0.000000   
1  0.142857  0.142857  0.142857  ...   0.0  0.0  0.0    0.0  0.000000   
2  0.000000  0.000000  0.000000  ...   0.2  0.2  0.2    0.2  0.000000   
3  0.000000  0.000000  0.000000  ...   0.0  0.0  0.0    0.0  0.111111   

       work     write      some    papers   science  
0  0.000000  0.000000  0.000000  0.000000  0.000000  
1  0.000000  0.000000  0.000000  0.000000  0.000000  
2  0.000000  0.000000  0.000000  0.000000  0.000000  
3  0.111111  0.111111  0.111111  0.111111  0.111111  

[4 rows x 21 columns]
inverse document frequency:
i          0.287682
come       0.693147
to         0.287682
china      0.287682
travel     0.693147
this       0.693147
is         0.287682
a          0.693147
car        0.693147
polupar    0.693147
in         0.287682
love       0.693147
tea        0.693147
and        0.693147
apple      0.693147
the        0.693147
work       0.693147
write      0.693147
some       0.693147
papers     0.693147
science    0.693147
dtype: float64
发布了103 篇原创文章 · 获赞 162 · 访问量 5万+

猜你喜欢

转载自blog.csdn.net/zhaoyin214/article/details/103302573