crf++完成分词任务（人民日报）

安装好crf++后（其实这里用到的只是crflearn.exe和crftest.exe）和下载人民日报的数据之后，我们就可以准备CRF方法的训练过程了。

首先是对数据进行处理，生成训练所需的文件。我们去除一些不必要的符号、空格等，对词的位置进行标注（分为单字和多字处理），生成几个文件。其中训练数据和测试数据按照9：1的比例设置。

get_train_data.py

# coding = utf8
'''
处理人民日报文本数据，生成训练所需数据train.data等，可用来训练crf模型
'''
import codecs

def covertTag():
    src_file = codecs.open('./'+'guomengfei.txt','r')  #读标注数据
    # 写train.data/test.data/test_real.data等文件
    test_real_file = codecs.open('./' + 'test_rel.data', 'w', 'utf-8')
    test_file = codecs.open('./' + 'test.data', 'w', 'utf-8')
    train_file = codecs.open('./' + 'train.data', 'w', 'utf-8')

    i = 0 #行数
    for line in src_file.readlines():
        line = line.strip('\r\n\t')  #去掉首尾的格式字符
        if line == "":  #处理空行：跳过
            continue
        i += 1
        terms = line.split(' ')
        test = False

        if i % 10 == 0:  #按照9:1的比例设置训练集train.data和测试集test.data
            test = True
            print line

        for term in terms:
            term = term.strip('\t ')
            i1 = term.find('[')  #查找'['，去除'['
            if i1 >= 0 and len(term) >i1 + 1:
                term = term[i1 + 1:]  #提取'['后的内容
            i2 = term.find(']')  #同上
            if i2 >= 0 and len(term) >i2 + 1:
                term = term[:i2]  #提取']'前的内容
            if len(term) <= 0:  #啥都没了，处理什么
                continue

            word, pos = term.split('/')  #词，词性
            if pos == 'm':  #数字，跳过
                continue
            if test == True:
                #test.data
                for w in word.decode('utf-8'):
                    test_file.write(w + u'\tB\n')
                #real data
                word = word.decode('utf-8')
                if len(word) == 1:  #单字成词，置为S
                    test_real_file.write(word + '\tS\n')
                else:  #多字处理：按照'头'/'中间部分'/'尾'处理，对应B/M/E
                    test_real_file.write(word[0] + '\tB\n')
                    for w in word[1 : len(word) - 1]:
                        test_real_file.write(word[0] + '\tM\n')
                    test_real_file.write(word[0] + '\tE\n')
            else:
                #train.data
                word = word.decode('utf-8')
                if len(word) == 1:  #单字成词，置为S
                    train_file.write(word + '\tS\n')
                else:  #多字处理：按照'头'/'中间部分'/'尾'处理，对应B/M/E
                    train_file.write(word[0] + '\tB\n')
                    for w in word[1 : len(word) - 1]:
                        train_file.write(word[0] + '\tM\n')
                    train_file.write(word[0] + '\tE\n')
        if test:
            test_file.write(u'\n')
            test_file.flush()
            test_real_file.write(u'\n')
            test_real_file.flush()
        else:
            train_file.write(u'\n')
            train_file.flush()
    print i

if __name__ == '__main__':
    covertTag()

运行get_train_data.py，将会生成几个文件：

train.data 训练数据，包含label（正确答案）

test_data 测试数据，label全部是B，需要训练

test_real.data 测试数据的正确答案

之后cd到目录下，命令行下面执行（我的txt文件为199801_utf8，template是模板文件，可能需要从crfpp里面复制过来，可以找一下），分别是训练、测试命令。测试（分词结果）放在了result.txt中。

crf_learn template train.data model

crf_test -m model 199801_utf8.txt > result.txt

crf++完成分词任务（人民日报）

猜你喜欢