一 术语
- TF: Term Frequency,词频;衡量某个指定的词语在某份【文档】中出现的【频率】
- IDF: Inverse Document Frequency,逆文档频率;一个词语【普遍重要性】的度量。
二 TD-IDF
- 传统的TD-IDF
- 词汇word的词频(TF)值
\[ TF Score = \frac{ 指定词汇word在第i份文档documents[i]中出现的次数 }{ 文档的长度 } \] - 词汇word的逆文档频率(IDF)值
\[ IDF Score = log( \frac{ 文档集documents的总数 }{ 指定词word在文档集documents中出现过的文档总数 } ) \] - 词汇word与某份文档documents[j]的关联度得分(TF-IDF)
\[ TF-IDF(word | docuements ) = Similarity(word | documents ) \]
\[ Similarity(word | documents ) = TF Score*IDF Score \] - 短语sentence与某份文档documents[j]的关联度得分(TF-IDF)
\[ sentence = [word1,word2,...,wordi,...,wordn] \]
\[ TF-IDF_{_{sentence}}(word | docuements ) = TF-IDF_{_{word1}} + TF-IDF_{_{word2}} + ... + TF-IDF_{_{wordi}} + ... + TF-IDF_{_{wordn}} \]
- 词汇word的词频(TF)值
- 早期Lucence版的TF-IDF
\[ TF-IDF(word | docuements ) = Similarity(word | documents ) \]
\[ Similarity(word | documents ) = log( \frac{ 文档集documents的总数 }{ 指定词word在文档集documents中出现过的文档总数 + 1 })*sqrt(TF Score) * (\frac{1}{sqrt(文档documents[j]的长度)}) \]
log(numDocs / (docFreq + 1)) * sqrt(tf) * (1/sqrt(length)) $$