版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/jiyang_1/article/details/79042933
1、工具包下载
分词segmenter:https://nlp.stanford.edu/software/segmenter.shtml
主体识别NER:https://nlp.stanford.edu/software/CRF-NER.shtml
注意:需下载stanford-ner-2012-11-11-chinese.zip,stanford-corenlp-full-2017-06-09.zip
2、项目搭建
需将下面文件放入项目根目录下并加载jar包。
3、Stanford NLP segmeter代码参考:
public class ZH_SegDemo {
public static CRFClassifier<CoreLabel> segmenter;
static {
// 设置一些初始化参数
Properties props = new Properties();
props.setProperty("sighanCorporaDict", "data");
props.setProperty("serDictionary", "data/dict-chris6.ser.gz");
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");
segmenter = new CRFClassifier<CoreLabel>(props);
segmenter.loadClassifierNoExceptions("data/ctb.gz", props);
segmenter.flags.setProperties(props);
}
public static String doSegment(String sent) {
String[] strs = (String[]) segmenter.segmentString(sent).toArray();
StringBuffer buf = new StringBuffer();
for (String s : strs) {
buf.append(s + " ");
}
System.out.println("segmented res: " + buf.toString());
return buf.toString();
}
public static void main(String[] args) {
try {
String readFileToString = FileUtils.readFileToString(new File("a.txt"));
String doSegment = doSegment(readFileToString);
System.out.println(doSegment);
} catch (IOException e) {
e.printStackTrace();
}
}
}
4、NER主体识别,需要先分词后主体识别
public class ExtractDemo {
private static AbstractSequenceClassifier<CoreLabel> ner;
public ExtractDemo() {
InitNer();
}
public void InitNer() {
String serializedClassifier = "classifiers/chinese.misc.distsim.crf.ser.gz"; // chinese.misc.distsim.crf.ser.gz
if (ner == null) {
ner = CRFClassifier.getClassifierNoExceptions(serializedClassifier);
}
}
public String doNer(String sent) {
return ner.classifyWithInlineXML(sent);
}
public static void main(String args[]) {
String str = "北海 已 成为 中国 对外开放 中 升起 的 一 颗 明星";//已分词
ExtractDemo extractDemo = new ExtractDemo();
System.out.println(extractDemo.doNer(str));
System.out.println("Complete!");
}
}
demo已上传到github