首先导入开发需要的架包:
这些架包都能在网上下载
代码演示:
首先创建一个类,继承Analyzer,实现自己的分词器,实现具体方法,代码如下:
package com.szy.arvin.demo;
import java.io.Reader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.util.Version;
import com.chenlb.mmseg4j.Dictionary;
import com.chenlb.mmseg4j.MaxWordSeg;
import com.chenlb.mmseg4j.analysis.MMSegTokenizer;
/**
* @author arvin
* @date 2013-7-3 上午11:18:38
*/
public class MySameAnlyzer extends Analyzer{
/* (non-Javadoc)
* @see org.apache.lucene.analysis.Analyzer#createComponents(java.lang.String, java.io.Reader)
*/
@Override
protected TokenStreamComponents createComponents(String s, Reader reader) {
Tokenizer source = new WhitespaceTokenizer(Version.LUCENE_43, reader);
Dictionary dic=Dictionary.getInstance();
TokenStream result=new MysameTokenFilter(new MMSegTokenizer(new MaxWordSeg(dic),reader));
return new TokenStreamComponents(source, result);
}
}
其次创建一个过滤器类, 继承TokenFilter,代码如下:
package com.szy.arvin.demo;
import java.io.IOException;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
/**
* @author arvin
* @date 2013-7-3 上午11:40:20
*/
public class MysameTokenFilter extends TokenFilter {
//分词信息属性
private CharTermAttribute cta=null;
/**
* @param input
*/
protected MysameTokenFilter(TokenStream input) {
super(input);
cta=this.addAttribute(CharTermAttribute.class);
}
/* (non-Javadoc)
* @see org.apache.lucene.analysis.TokenStream#incrementToken()
*/
@Override
public boolean incrementToken() throws IOException {
//判断是否存在数据
if(!input.incrementToken())return false;/*1
*结果将会是:[我的][家乡][在][福建][建省][龙][岩][市连城县] 会在市后面加上连城县
if(cta.toString().equals("市")){
cta.append("连城县");
}
*/
/*2
*结果将会是:[我的][故乡][在][福建][建省][龙][岩][市]] 会将”家乡“替换成“故乡”
if(cta.toString().equals("家乡")){
cta.setEmpty();//如果匹配成功之后,会将“家乡”清除,用“故乡"代替
cta.append("故乡");
}*/
System.out.println(cta);
return true;
}
}
然后创建一个类,对中文进行分词,代码如下:
package com.szy.arvin.demo;
import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
/**
* @author arvin
* @date 2013-7-2 下午2:55:40
*/
public class AnalyzerUtils {
/**
* 显示分词信息
* @param str
* @param a
* @Adder by arvin 2013-7-2 下午5:02:24
*/
public static void displayToken(String str,Analyzer a) {
try {
TokenStream stream = a.tokenStream("content",new StringReader(str));
//创建一个属性,这个属性会添加流中,随着这个TokenStream增加
CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while(stream.incrementToken()) {
System.out.print("["+cta+"]");
}
System.out.println();
stream.end();
} catch (IOException e) {
e.printStackTrace();
}
}
}
最后写一个测试类查看结果:
package com.szy.arvin.demo;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.SimpleAnalyzer;
import org.apache.lucene.analysis.core.StopAnalyzer;
import org.apache.lucene.analysis.core.WhitespaceAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.junit.Test;
/**
* @author arvin
* @date 2013-7-2 下午3:20:12
*/
public class AnalyzerTest {
@Test
public void testAnalyzer(){
String str="我的家乡在福建省龙岩市";
Analyzer a5=new MySameAnlyzer();
AnalyzerUtils.displayToken(str, a5);
}}
结果显示:
我的
[我的]家乡
[家乡]在
[在]福建
[福建]建省
[建省]龙
[龙]岩
[岩]市
[市]