日本语NLP

各种日本语分词器综述：

Mecab

下载地址http://taku910.github.io/mecab/

http://mecab.sourceforge.net/

Mecab is open source tokenizer system for various language(if you have dictionary for it)

http://www.52nlp.cn/%E6%97%A5%E6%96%87%E5%88%86%E8%AF%8D%E5%99%A8-mecab-%E6%96%87%E6%A1%A3

Mecab是奈良先端科学技術大学院的Taku Kudo(工藤拓)开发的日文分词系统。该作者写过多个 machine learning 方面的软件包，最有名的就是 CRF++。目前该作者在 google@Japan 工作。

Mecab设计的基本方针是不依赖于具体的语言，词典，语料库，采用 Conditional Random Fields (CRF) 模型进行参数估计, 性能优于使用隐马模型的 ChaSen 。同时，平均解析速度高于 ChaSen, Juman, KAKASI 这些日文词法分析器。

现有Mecab日语分词词典有ipadic词典、neologd词典。

ipadic词典是Mecab的标准词典，于2015年3月之后少有更新维护，所以有很多新词分不出来。neologd词典全称mecab-ipadic-neologd词典，包含许多新词，可以在Mecab搭配使用。另外neoglod词典可以经过格式转换，在Juman/Juman++中使用。

支持c/c++集成，支持perl/python等各种脚本调用。

mecab 安装

 ％ tar zxfv mecab-XX.tar.gz
 ％ cd mecab-XX
 ％ ./configure --prefix=***
 ％ make
 ％ make check
 ％ make install

mecab ipadic 词典编码

没有特别说明，缺省使用 euc 编码. 如果要使用 shift-jis 和 utf8 编码，可以修改词典的 configure 脚本中 charset 选项，重新编译词典, 这样就能生成 shift-jis 和 utf8 编码的词典.

% tar zxfv mecab-ipadic-2.7.0-xxxx
% cd mecab-ipadic-2.7.0-xxxx
% ./configure --with-charset=sjis
% make
% make install

% tar zxfv mecab-ipadic-2.7.0-xxxx
% ./configure --with-charset=utf8
% make
% make install

说明文档：http://www.flickering.cn/nlp/2014/06/%E6%97%A5%E6%96%87%E5%88%86%E8%AF%8D%E5%99%A8-mecab-%E6%96%87%E6%A1%A3/

最新版本MeCab 0.97 2008-02-03更新

	MeCab	ChaSen	JUMAN	KAKASI
解析模型	bi-gram 马尔科夫模型	可变长马尔科夫模型	bi-gram 马尔科夫模型	最长一致
cost 估计	从语料库学习	从语料库学习	人手	没有 cost 的概念
学习模型	CRF (区别式模型)	HMM (生成式模型)
词典检索算法	Double Array	Double Array	Patricia Tree	Hash?
求解算法	Viterbi	Viterbi	Viterbi	决定的?
连接表的实现	2元 Table	自动机	2元 Table?	没有连接表?
词性层级	无限制多级词性	无限制多级词性	固定2级	没有词性概念?
未登陆词处理	字符种类 (动作定义可变更)	字符种类 (不可变更)	字符种类 (不可变更)
带约束的解析	可能	2.4.0 以后可能	不可能	不可能
N-best解	可能	不可能	不可能	不可能

Juman

Juman/Juman++由京都大学的黑桥・河原研究室（Kurohashi & Kawahara Laboratory, Kyoto University, Japan, 主攻自然语言处理，http://nlp.ist.i.kyoto-u.ac.jp）开发。Juman/Juman++提供分词和POS能力。

Juman is tokenizer tool developped by Kurohashi laboratory, Kyoto University, Japan.

Juman is strong for ambigious writing style in Japanese, and is strong for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Juman++

Juman++ is tokenizer developped by Kurohashi laboratory, Kyoto University, Japan.

Juman++ is succeeding system of Juman. It adopts RNN model for tokenization.

Juman++ is strong for ambigious writing style in Japanese, and is strong for new-comming words thanks to Web based huge dictionary.

And, Juman tells you semantic meaning of words.

Kytea

Kytea is tokenizer tool developped by Graham Neubig.

Kytea has a different algorithm from one of Mecab or Juman.

http://www.phontron.com/kytea/

Kyoto Text Analysis Toolkit

Chasen

http://chasen-legacy.osdn.jp/

Kuromoji

http://www.atilika.org

如何使用http://rensanning.iteye.com/blog/2008575

wirtten by JAVA

Kuromoji supports standard morphological analysis features such as

Word segmentation - segmenting text into words (morphemes)
Part-of-speech tagging - assign word-categories (nouns, verbs, particles, adjectives, etc.)
Lemmatization - get dictionary forms for inflected verbs and adjectives
Readings - extract readings for kanji

Gosen

网址https://github.com/westei/stanbol-gosen

http://code.google.com/p/lucene-gosen/

支持分句、分词、POS和NER功能。是否是独立工具待明确？

python package:

JapaneseTokenizer

用python实现对多个分词器的封装，包括Mecab、Juman、Juman++、Kytea，下载网址https://pypi.org/project/JapaneseTokenizer/1.3.0/

natto-py

https://pypi.org/project/natto-py/

日语NLP相关的资源

Stopwrods：

https://www.ranks.nl/stopwords/japanese