假设有3篇文章,file1, file2, file3,文件内容如下:
file1 (单词1,单词2,单词3,单词4....)
file2 (单词a,单词b,单词c,单词d....)
file3 (单词1,单词a,单词3,单词d....)
那么建立的倒排索引就是这个样子:
单词1 (file1,file3)
单词2 (file1)
单词3 (file1,file3)
单词a (file2, file3)
....
倒排索引的概念很简单:就是将文件中的单词作为关键字,然后建立单词与文件的映射关系。当然,你还可以添加文件中单词出现的频数等信息。倒排索引是搜索引擎中一个很基本的概念,几乎所有的搜索引擎都会使用到倒排索引。
下面是我对于倒排索引的一个简单的实现。该程序对于输入的一段文字,查找出该词所出现的行号以及出现的次数。
import java.io.*; import java.util.HashMap; import java.util.Map; public class InvertedIndex { private Map<String, Map<Integer, Integer>> index; private Map<Integer, Integer> subIndex; public void createIndex(String filePath) { index = new HashMap<String, Map<Integer, Integer>>(); try { File file = new File(filePath); InputStream is = new FileInputStream(file); BufferedReader read = new BufferedReader(new InputStreamReader(is)); String temp = null; int line = 1; while ((temp = read.readLine()) != null) { String[] words = temp.split(" "); for (String word : words) { if (!index.containsKey(word)) { subIndex = new HashMap<Integer, Integer>(); subIndex.put(line, 1); index.put(word, subIndex); } else { subIndex = index.get(word); if (subIndex.containsKey(line)) { int count = subIndex.get(line); subIndex.put(line, count+1); } else { subIndex.put(line, 1); } } } line++; } read.close(); is.close(); } catch (IOException e) { System.out.println("error in read file"); } } public void find(String str) { String[] words = str.split(" "); for (String word : words) { StringBuilder sb = new StringBuilder(); if (index.containsKey(word)) { sb.append("word: " + word + " in "); Map<Integer, Integer> temp = index.get(word); for (Map.Entry<Integer, Integer> e : temp.entrySet()) { sb.append("line " + e.getKey() + " [" + e.getValue() + "] , "); } } else { sb.append("word: " + word + " not found"); } System.out.println(sb); } } public static void main(String[] args) { InvertedIndex index = new InvertedIndex(); index.createIndex("news.txt"); index.find("I love Shanghai today"); } }
其中,输入文件news.txt内容为:
I am eriol I live in Shanghai and I love Shanghai I also love travelling life in Shanghai is beautiful
输出结果为:
word: I in line 1 [1] , line 2 [2] , line 3 [1] ,
word: love in line 2 [1] , line 3 [1] ,
word: Shanghai in line 2 [2] , line 4 [1] ,
word: today not found