elasticsearch入门篇：数据存储方式研究：lucene索引的原理（代码示例）

上篇文章讲了传统关系数据库作为索引的不足，以及lucene是如何解决这个问题的。这篇文章结合lucene官方文档中的一段单元测试代码，进行了一些代码改动，同时增加了基于个人理解的中文注释。

 1 　　@Test
 2   public void testDemo() throws IOException {
 3     String longTerm = "longtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongtermlongterm";
 4     String text = "This is the text to be indexed. " + longTerm;
 5     // 初始化索引库位置
 6     Path indexPath = Files.createTempDirectory("tempIndex");
 7     System.out.println(indexPath.toString());
 8     // 打开索引存储位置
 9     try (Directory dir = FSDirectory.open(indexPath)) {
10       Analyzer analyzer = new StandardAnalyzer(CharArraySet.EMPTY_SET);
11       // 指定分词器为StandardAnalyzer
12       try (IndexWriter iw = new IndexWriter(dir, new IndexWriterConfig(analyzer))) {
13         // 创建文档
14         Document doc = new Document();
15         // 这里的StringField，根据该类的注释，作用是创建一个Field，将fieldname作为一个token
16         // Field.Store表示是否在索引中存储原始的域值
17         // Field.Store.YES:在查询结果里显示域值（比如文章标题）
18         // Field.Store.NO:不需要显示域值（比如文章内容）
19         doc.add(new StringField("fieldname", text, Field.Store.YES));
20         // 写入文件，创建索引库
21         iw.addDocument(doc);
22       }
23 
24       // 检索.
25       try (IndexReader reader = DirectoryReader.open(dir)) {
26         IndexSearcher searcher = new IndexSearcher(reader);
27       // 根据分词内容查询所以文件，并返回一个结果
28         assert 1 == searcher.count(new TermQuery(new Term("fieldname", longTerm)));
29 
30         // 按照单个分词查询
31         Query query = new TermQuery(new Term("fieldname", "text"));
32         // 通过IndexSearcher按照分词查询命中的所有结果
33         TopDocs hits = searcher.search(query, 1);
34         assert 1 == hits.totalHits;
35 
36         // 遍历按照分词查询命中的所有结果
37         for (int i = 0; i < hits.scoreDocs.length; i++) {
38           Document hitDoc = searcher.doc(hits.scoreDocs[i].doc);
39           assert text.equals(hitDoc.get("fieldname"));
40         }
41 
42         // 多关键词查询：根据给出的分词列表，查询所有命中的结果
43         PhraseQuery phraseQuery = new PhraseQuery("fieldname", "to", "be");
44         assert 1 == searcher.count(phraseQuery);
45       }
46     }
47 
48     IOUtils.rm(indexPath);
49   }

上述代码导入如下两个dependency就能运行：

 1         <dependency>
 2             <groupId>org.apache.lucene</groupId>
 3             <artifactId>lucene-core</artifactId>
 4             <version>7.4.0</version>
 5         </dependency>
 6 
 7         <dependency>
 8             <groupId>org.apache.lucene</groupId>
 9             <artifactId>lucene-queryparser</artifactId>
10             <version>7.4.0</version>
11         </dependency>

读完上述代码大致可以看出，lucene的层次结构依次如下：索引(Index) –> 段(segment) –> 文档(Document) –> 域(Field) –> 词(Term)

为了让整个索引和检索的流程更清晰，从网上盗了一张流程图如下：

elasticsearch入门篇：数据存储方式研究：lucene索引的原理（代码示例）

猜你喜欢