进入到cpp目录下(pwd=.../cpp/),在command目录中有个pinyinime_dictbuilder.cpp文件,源码中可以看到main函数,这里就是词库构建的入口,接下来看下main函数源码:
25 /**
26 * Build binary dictionary model. Make sure that ___BUILD_MODEL___ is defined
27 * in dictdef.h.
28 */
29 int main(int argc, char* argv[]) {
30 DictTrie* dict_trie = new DictTrie();
31 bool success;
32 if (argc >= 3)
33 success = dict_trie->build_dict(argv[1], argv[2]);
34 else
35 success = dict_trie->build_dict("../data/rawdict_utf16_65105_freq.txt",
36 "../data/valid_utf16.txt");
37
38 if (success) {
39 printf("Build dictionary successfully.\n");
40 } else {
41 printf("Build dictionary unsuccessfully.\n");
42 return -1;
43 }
44
45 success = dict_trie->save_dict("../../res/raw/dict_pinyin.dat");
46
47 if (success) {
48 printf("Save dictionary successfully.\n");
49 } else {
50 printf("Save dictionary unsuccessfully.\n");
51 return -1;
52 }
53
54 return 0;
55 }
根据注释来看,该函数是用来构建字库模型的,缺省状态下执行第35行,用data下面的两个utf16的txt文件来作为源进行build,如果构建成功,第45行会进行保存,保存的目录以及文件格式通过代码很容易找到了,这是个二进制文件,其内部实际上是一些列的数据结构组合而成,构建词库模型的核心逻辑就在第35行调用的build_dict方法中,接下来进一步跟进(./share/dicttrie.cpp):
103 #ifdef ___BUILD_MODEL___
104 bool DictTrie::build_dict(const char* fn_raw, const char* fn_validhzs) {
105 DictBuilder* dict_builder = new DictBuilder();
106
107 free_resource(true);
108
109 return dict_builder->build_dict(fn_raw, fn_validhzs, this);
110 }
第105行创建了一个dict_builder,这个builder就是实际用来构建的对象,创建完对象后调用了free_resource方法用来释放一下创建字典过程中用到的数据结构,这些数据结构其实就是最终保存字典的时候需要保存的对象,通过fwrite方法写入到文件dict_pinyin.dat中,释放完相关数据结构以后调用builder的build_dict方法,开始构建过程(./share/dictbuilder.cpp):
bool DictBuilder::build_dict(const char *fn_raw,
const char *fn_validhzs,
DictTrie *dict_trie) {
...
lemma_num_ = read_raw_dict(fn_raw, fn_validhzs, 240000);
...
spl_buf = spl_table_->arrange(&spl_item_size, &spl_num);
...
// 把所有合法音节组织成一个Trie树 construct方法
if (!spl_trie.construct(spl_buf, spl_item_size, spl_num,
spl_table_->get_score_amplifier(),
spl_table_->get_average_score())) {
free_resource();
return false;
}
printf("spelling tree construct successfully.\n");
// 填充lemma_arr_数组每个元素的spl_idx_arr项,它表示每个汉字的音对应的spl_id
// Convert the spelling string to idxs
for (size_t i = 0; i < lemma_num_; i++) {
for (size_t hz_pos = 0; hz_pos < (size_t)lemma_arr_[i].hz_str_len;
hz_pos++) {
...
int spl_idx_num =
spl_parser_->splstr_to_idxs(lemma_arr_[i].pinyin_str[hz_pos],
strlen(lemma_arr_[i].pinyin_str[hz_pos]),
spl_idxs, spl_start_pos, 2, is_pre);
...
if (spl_trie.is_half_id(spl_idxs[0])) {
uint16 num = spl_trie.half_to_full(spl_idxs[0], spl_idxs);
assert(0 != num);
}
lemma_arr_[i].spl_idx_arr[hz_pos] = spl_idxs[0];
}
}
...
// Sort the lemma items according to the hanzi, and give each unique item a
// id
// 按照汉字串排序,更新idx_by_hz字段,为每个词分配一个唯一id
sort_lemmas_by_hz();
// 构建单字表到scis_,并根据该单字表更新lemma_arr_中的hanzi_scis_ids字段
scis_num_ = build_scis();
// Construct the dict list
dict_trie->dict_list_ = new DictList();
bool dl_success = dict_trie->dict_list_->init_list(scis_, scis_num_,
lemma_arr_, lemma_num_);
assert(dl_success);
// Construct the NGram information
NGram& ngram = NGram::get_instance();
ngram.build_unigram(lemma_arr_, lemma_num_,
lemma_arr_[lemma_num_ - 1].idx_by_hz + 1);
...
lma_nds_used_num_le0_ = 1; // The root node
bool dt_success = construct_subset(static_cast<void*>(lma_nodes_le0_),
lemma_arr_, 0, lemma_num_, 0);
...
if (kPrintDebug0) {
printf("Building dict succeds\n");
}
return dt_success;
}
构建字典的主要逻辑都在这个方法中,代码中只保留了主要逻辑方法可以直观的看出构建的具体过程,首先从两个文件中读取内容到对应的数据结构中,raw这个文件内容被保存到lemma_arr_数组中,但是通过打印lemma_num的值发现lemma_arr数组中实际元素个数为65101个,但是raw_dict_utf16_65101.txt文件为65105行,为什么会少四个呢?通过对for循环中的continue挂断点发现问题在read_raw_dict函数中:
// The whole line must have been parsed fully, otherwise discard this one.
token = utf16_strtok(to_tokenize, &token_size, &to_tokenize);
if (spelling_not_support || NULL != token) {
i--;
continue;
}
spelling_not_support为true,通过打印当前索引i发现文件中确实存在非法的拼音,如6557行的:
哼 2072.17903804 0 hng
以及17035、17036、17037行:
噷 6.18262663209 1 hm 唔 1126.6237397 0 ng 嗯 31982.2903695 0 ng
所以正好是四个元素。
validhanzi这个内容保存到valid_hzs数组中打印valid_hzs数组如下所示:
(gdb) ptype valid_hzs
type = unsigned short *
(gdb) p valid_hzs
$1 = (ime_pinyin::char16 *) 0x627190
(gdb) p *valid_hzs@10
$2 = {12295, 19968, 19969, 19971, 19975, 19976, 19977, 19978, 19979, 19980}
(gdb)
指针类型的数组,这里只是打印了前十个元素,第一个12295就是汉字“〇”的Unicode编码,可以在这里验证,lemma_arr_已经在Google原生输入法LatinIME词库构建流程分析--相关数据结构分析这篇文章中打印过了,这两个数组是后面构建词库的基础,read_raw_ditct()方法读取完数据后调用了spl_table->arrange方法,此方法返回一个指针数组spl_buf,该数组为有效汉语音节总数,长度为413,并且是经过排序的,然后调用spl_trie->construct方法,构建所有合法音节(413个)的字典树,传入的数据依次为音节数组spl_buf、数组元素长度、数组长度、用于计算每个音节score(得分)的放大器以及一个平均score,具体值如下:
(gdb) p spl_num
$7 = 413
(gdb) p spl_item_size
$8 = 8
(gdb) p spl_table_->get_score_amplifier()
$9 = -14.1073904
(gdb) p spl_table_->get_average_score()
$10 = 100 'd'
(gdb)
这里的节点是用结构体SpellingNode来描述的:
(gdb) ptype first_son
type = struct ime_pinyin::SpellingNode {
ime_pinyin::SpellingNode *first_son;
ime_pinyin::uint16 spelling_idx : 11;
ime_pinyin::uint16 num_of_son : 5;
char char_this_node;
unsigned char score;
} *
(gdb)
spl_trie构建的字典有两层,第0层是root_即level0层,第1层是root_的儿子节点即level1层,通过打印spl_trie的堆栈信息如下:
spl_trie = @0x6160a0: {static kMaxYmNum = 64, static kValidSplCharNum = 26, static kHalfIdShengmuMask = 1, static kHalfIdYunmuMask = 2, static kHalfIdSzmMask = 4,
static kHalfId2Sc_ = "0ABCcDEFGHIJKLMNOPQRSsTUVWXYZz", static char_flags_ = 0x615140 <ime_pinyin::SpellingTrie::char_flags_> "\006\005\005\005\006\005\005\005", static instance_ = 0x6160a0,
spelling_buf_ = 0x616510 "A", spelling_size_ = 8, spelling_num_ = 413, score_amplifier_ = -14.1073904, average_score_ = 100 'd', spl_ym_ids_ = 0x628350 "", ym_buf_ = 0x628eb0 "A", ym_size_ = 6, ym_num_ = 33,
splstr_queried_ = 0x617200 "ZhUO", splstr16_queried_ = 0x617220, root_ = 0x617240, dumb_node_ = 0x617260, splitter_node_ = 0x617280, level1_sons_ = {0x6172a0, 0x6172b0, 0x6172c0, 0x6172d0, 0x6172e0, 0x6172f0,
0x617300, 0x617310, 0x0, 0x617320, 0x617330, 0x617340, 0x617350, 0x617360, 0x617370, 0x617380, 0x617390, 0x6173a0, 0x6173b0, 0x6173c0, 0x0, 0x0, 0x6173d0, 0x6173e0, 0x6173f0, 0x617400}, h2f_start_ = {0, 30,
35, 51, 67, 86, 109, 114, 124, 143, 0, 162, 176, 195, 221, 241, 266, 268, 285, 299, 313, 329, 348, 0, 0, 368, 377, 391, 406, 423}, h2f_num_ = {0, 5, 16, 35, 19, 23, 5, 10, 19, 19, 0, 14, 19, 26, 20, 25, 2,
17, 14, 14, 35, 19, 20, 0, 0, 9, 14, 15, 37, 20}, f2h_ = 0x627fb0, node_num_ = 496}
__PRETTY_FUNCTION__ = "bool ime_pinyin::DictBuilder::build_dict(const char*, const char*, ime_pinyin::DictTrie*)"
其中root_的地址=0x617240,root_也是一个类型位SpellingNode的结构体,通过gdb进一步打印其first_son:
(gdb) p spl_trie->root_.first_son
$58 = (ime_pinyin::SpellingNode *) 0x6172a0
(gdb)
level1_sons_ = {0x6172a0, 0x6172b0, 0x6172c0, 0x6172d0, 0x6172e0, 0x6172f0,
0x617300, 0x617310, 0x0, 0x617320, 0x617330, 0x617340, 0x617350, 0x617360, 0x617370, 0x617380, 0x617390, 0x6173a0, 0x6173b0, 0x6173c0, 0x0, 0x0, 0x6173d0, 0x6173e0, 0x6173f0, 0x617400}
root_的first_son地址是0x6172a0,这个地址其实就是level1首元素的地址,根节点直接指向level1首元素,而level1的长度=26,其char_this_node正是从a~z的大写字母:
(gdb) p spl_trie->level1_sons_[0].char_this_node
$60 = 65 'A'
(gdb) p spl_trie->level1_sons_[1].char_this_node
$61 = 66 'B'
(gdb) p spl_trie->level1_sons_[3].char_this_node
$62 = 68 'D'
(gdb) p spl_trie->level1_sons_[2].char_this_node
$63 = 67 'C'
(gdb) p spl_trie->level1_sons_[4].char_this_node
$64 = 69 'E'
(gdb) p spl_trie->level1_sons_[25].char_this_node
$65 = 90 'Z'
(gdb) p spl_trie->level1_sons_[26].char_this_node
Cannot access memory at address 0x330023001e000a
(gdb)
但是并不是每个字母都可以作为声母的,如level1_sons_的第9个元素正好是‘I’,所以它的地址为0x0。
然后我们再看下level1_sons_首元素有几个儿子节点呢?
(gdb) p spl_trie->level1_sons_[0].num_of_son
$67 = 3
(gdb)
对,三个,哪三个呢?其实答案就在spl_buf_数组中,要验证此也不难,继续往下跟就是了:
(gdb) p spl_trie->level1_sons_[0].first_son[0].char_this_node
$70 = 73 'I'
(gdb) p spl_trie->level1_sons_[0].first_son[1].char_this_node
$71 = 78 'N'
(gdb) p spl_trie->level1_sons_[0].first_son[2].char_this_node
$72 = 79 'O'
分别是ai an ao,其实an往下还有呢,就是ang,到这里spl_trie中构建的树结构就明晰了,然后我们回过头来看一下root_节点的first_son指针,其实它是个指针数组,其内容为
{first_son = 0x617420, spelling_idx = 1, num_of_son = 3, char_this_node = 65 'A', score = 86 'V'},
{first_son = 0x617480, spelling_idx = 2, num_of_son = 5, char_this_node = 66 'B', score = 57 '9'},
{first_son = 0x617620, spelling_idx = 3, num_of_son = 6, char_this_node = 67 'C', score = 72 'H'},
{first_son = 0x6179e0, spelling_idx = 5, num_of_son = 5, char_this_node = 68 'D', score = 46 '.'},
{first_son = 0x617c50, spelling_idx = 6, num_of_son = 3, char_this_node = 69 'E', score = 79 'O'},
{first_son = 0x617cb0, spelling_idx = 7, num_of_son = 5, char_this_node = 70 'F', score = 72 'H'},
{first_son = 0x617e00, spelling_idx = 8, num_of_son = 4, char_this_node = 71 'G', score = 62 '>'},
{first_son = 0x617ff0, spelling_idx = 9, num_of_son = 4, char_this_node = 72 'H', score = 64 '@'},
{first_son = 0x6181e0, spelling_idx = 11, num_of_son = 2, char_this_node = 74 'J', score = 59 ';'},
{first_son = 0x618380, spelling_idx = 12, num_of_son = 4, char_this_node = 75 'K', score = 70 'F'},
{first_son = 0x618570, spelling_idx = 13, num_of_son = 6, char_this_node = 76 'L', score = 62 '>'},
{first_son = 0x618810, spelling_idx = 14, num_of_son = 5, char_this_node = 77 'M', score = 68 'D'},
{first_son = 0x6189e0, spelling_idx = 15, num_of_son = 6, char_this_node = 78 'N', score = 66 'B'},
{first_son = 0x618c70, spelling_idx = 16, num_of_son = 1, char_this_node = 79 'O', score = 109 'm'},
{first_son = 0x618c90, spelling_idx = 17, num_of_son = 5, char_this_node = 80 'P', score = 90 'Z'},
{first_son = 0x618e50, spelling_idx = 18, num_of_son = 2, char_this_node = 81 'Q', score = 66 'B'},
{first_son = 0x626ff0, spelling_idx = 19, num_of_son = 5, char_this_node = 82 'R', score = 65 'A'},
{first_son = 0x6271a0, spelling_idx = 20, num_of_son = 6, char_this_node = 83 'S', score = 46 '.'},
{first_son = 0x627540, spelling_idx = 22, num_of_son = 5, char_this_node = 84 'T', score = 70 'F'},
{first_son = 0x6277a0, spelling_idx = 25, num_of_son = 4, char_this_node = 87 'W', score = 61 '='},
{first_son = 0x627890, spelling_idx = 26, num_of_son = 2, char_this_node = 88 'X', score = 68 'D'},
{first_son = 0x627a30, spelling_idx = 27, num_of_son = 5, char_this_node = 89 'Y', score = 51 '3'},
{first_son = 0x627bd0, spelling_idx = 28, num_of_son = 6, char_this_node = 90 'Z', score = 61 '='}
在root_的first_son数组中存放的元素就是level1_sons_中存放的元素,虽然地址不同(不是同一个指针)但是存放的内容是相同的,至此spl_trie构建的树的结构就逐渐明晰了:
上图只是用char_this_node来简要说明一下spl_trie中构建的树结构,其实每个节点是结构体SpellingNode对象。然后再往下看for循环中:
// 填充lemma_arr_数组每个元素的spl_idx_arr项,它表示每个汉字的音对应的spl_id
// Convert the spelling string to idxs
for (size_t i = 0; i < lemma_num_; i++) {
for (size_t hz_pos = 0; hz_pos < (size_t)lemma_arr_[i].hz_str_len;
hz_pos++) {
...
int spl_idx_num =
spl_parser_->splstr_to_idxs(lemma_arr_[i].pinyin_str[hz_pos],
strlen(lemma_arr_[i].pinyin_str[hz_pos]),
spl_idxs, spl_start_pos, 2, is_pre);
...
if (spl_trie.is_half_id(spl_idxs[0])) {
uint16 num = spl_trie.half_to_full(spl_idxs[0], spl_idxs);
assert(0 != num);
}
lemma_arr_[i].spl_idx_arr[hz_pos] = spl_idxs[0];
}
}
外层for循环从0到65101,其实就是遍历lemma_arr_数组,内层for循环遍历每个lemma的汉字对应的拼音,进一步调用spl_parser->splstr_to_idx()方法实现汉字拼音到id的映射,具体映射过程在下一篇博客再说。
for循环结束以后开始构建单汉字表,构建之前先对lemma_arr_按照汉字排序并给每个词分配一个唯一的id:
// Sort the lemma items according to the hanzi, and give each unique item a
// id
// 按照汉字串排序,更新idx_by_hz字段,为每个词分配一个唯一id
sort_lemmas_by_hz();
// 构建单字表到scis_,并根据该单字表更新lemma_arr_中的hanzi_scis_ids字段
scis_num_ = build_scis();
然后是构建单汉字表scis,但是在构建之前先对lemma_arr数组进行了排序操作,即按照汉字对应的unicode十进制数来排序,这里只打印lemma_arr数组前10个元素:
{
{idx_by_py = 0, idx_by_hz = 1, hanzi_str = {12295, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {1, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {210, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"LING\000\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"},
hz_str_len = 1 '\001', freq = 248.484543},
{idx_by_py = 0, idx_by_hz = 2, hanzi_str = {19968, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {2, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {396, 0, 0, 0, 0, 0, 0, 0, 0},
pinyin_str = {"YI\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000",
"\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 134392.703},
{idx_by_py = 0, idx_by_hz = 3, hanzi_str = {19969, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {3, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {
100, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"DING\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 4011.11377},
{idx_by_py = 0, idx_by_hz = 4, hanzi_str = {19969, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {4, 0, 0, 0,
0, 0, 0, 0}, spl_idx_arr = {431, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ZhENG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 3.37402463},
{idx_by_py = 0, idx_by_hz = 5, hanzi_str = {19971, 0, 0, 0, 0, 0, 0, 0, 0},
hanzi_scis_ids = {5, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {285, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"QI\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 6313.39502},
{idx_by_py = 0, idx_by_hz = 6, hanzi_str = {
19975, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {6, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {238, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"MO\000\000\000\000", "\000\000\000\000\000\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001',
freq = 4.85489225},
{idx_by_py = 0, idx_by_hz = 7, hanzi_str = {19975, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {7, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {370, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {
"WAN\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000",
"\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 25941.043},
{idx_by_py = 0, idx_by_hz = 8, hanzi_str = {19976, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {8, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {
426, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ZhANG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 305.971039},
{idx_by_py = 0, idx_by_hz = 9, hanzi_str = {19977, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {9, 0, 0, 0,
0, 0, 0, 0}, spl_idx_arr = {315, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"SAN\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 26761.9336},
{idx_by_py = 0, idx_by_hz = 10, hanzi_str = {19978, 0, 0, 0, 0, 0, 0, 0, 0},
hanzi_scis_ids = {10, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {332, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ShANG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 284918.875}
}
构建scis完成后每个汉字对应一个id,使用此id来更新lemma_arr_中的hanzi_scis_ids字段,构建完成的scis表大小和内容如下(只打印了前十个):
(gdb) p scis_num
$141 = 17038
(gdb) p *scis@10
$142 = {{freq = 0, hz = 0, splid = {half_splid = 0, full_splid = 0}},
{freq = 248.484543, hz = 12295, splid = {half_splid = 13, full_splid = 210}},
{freq = 134392.703, hz = 19968, splid = {half_splid = 27, full_splid = 396}},
{freq = 4011.11377, hz = 19969, splid = {half_splid = 5, full_splid = 100}},
{freq = 3.37402463, hz = 19969, splid = {half_splid = 29, full_splid = 431}},
{freq = 6313.39502, hz = 19971, splid = {half_splid = 18, full_splid = 285}},
{freq = 4.85489225, hz = 19975, splid = {half_splid = 14, full_splid = 238}},
{freq = 25941.043, hz = 19975, splid = {half_splid = 25, full_splid = 370}},
{freq = 305.971039, hz = 19976, splid = {half_splid = 29, full_splid = 426}},
{freq = 26761.9336, hz = 19977, splid = {half_splid = 20, full_splid = 315}}}
(gdb)
valid_uft16.txt文件中总汉字个数位16466个,而scis表中除了第0个还剩17037个,为什么会比valid_utf16.txt中多呢?因为有的字有多个读音,如‘丨’字,既都‘gun’,又读‘e’,还可以读成‘shu’,所以scis中的总数要比valid_utf16.txt中要多。然后是从lemma_arr_数组来进行初始化字典列表,并且lemma_arr_数组是经过按照汉字排序过的,id也是从1开始分派好的,调用如下:
// Construct the dict list
dict_trie->dict_list_ = new DictList();
bool dl_success = dict_trie->dict_list_->init_list(scis_, scis_num_,
lemma_arr_, lemma_num_);
assert(dl_success);
init_list方法中进一步调用了fill_scis和fill_list方法:
#ifdef ___BUILD_MODEL___
bool DictList::init_list(const SingleCharItem *scis, size_t scis_num,
const LemmaEntry *lemma_arr, size_t lemma_num) {
if (NULL == scis || 0 == scis_num || NULL == lemma_arr || 0 == lemma_num)
return false;
initialized_ = false;
if (NULL != buf_)
free(buf_);
// calculate the size 计算大小
size_t buf_size = calculate_size(lemma_arr, lemma_num);
if (0 == buf_size)
return false;
//分配资源
if (!alloc_resource(buf_size, scis_num))
return false;
//填充scis_hz_和scis_splid_两个数组,数据来源scis数组。
fill_scis(scis, scis_num);
// Copy the related content from the array to inner buffer
fill_list(lemma_arr, lemma_num);
initialized_ = true;
return true;
}
在fill_scis方法中就是一个for循环,把scis中的hz字段内容依次复制到scis_hz_数组中,同时复制scis的splid字段到scis_splid_数组中:
void DictList::fill_scis(const SingleCharItem *scis, size_t scis_num) {
assert(scis_num_ == scis_num);
for (size_t pos = 0; pos < scis_num_; pos++) {
scis_hz_[pos] = scis[pos].hz;
scis_splid_[pos] = scis[pos].splid;
}
}
最终初始化的scis_hz_数组内容为(这里只是打印前十个元素为例):
(gdb) p *scis_hz_@10
$152 = {0,
12295,
19968,
19969,
19969,
19971,
19975,
19975,
19976,
19977}
数组scis_splid内容为:
(gdb) p *scis_splid_@10
$154 = {{half_splid = 0, full_splid = 0},
{half_splid = 13, full_splid = 210},
{half_splid = 27, full_splid = 396},
{half_splid = 5, full_splid = 100},
{half_splid = 29, full_splid = 431},
{half_splid = 18, full_splid = 285},
{half_splid = 14, full_splid = 238},
{half_splid = 25, full_splid = 370},
{half_splid = 29, full_splid = 426},
{half_splid = 20, full_splid = 315}}
(gdb)
长度就是scis的长度了。在init_list函数中调用完fill_scis之后紧接着又调用了fill_list函数,在该函数中初始化了buf_这个数组:
void DictList::fill_list(const LemmaEntry* lemma_arr, size_t lemma_num) {
size_t current_pos = 0;
utf16_strncpy(buf_, lemma_arr[0].hanzi_str,
lemma_arr[0].hz_str_len);
current_pos = lemma_arr[0].hz_str_len;
size_t id_num = 1;
for (size_t i = 1; i < lemma_num; i++) {
utf16_strncpy(buf_ + current_pos, lemma_arr[i].hanzi_str,
lemma_arr[i].hz_str_len);
id_num++;
current_pos += lemma_arr[i].hz_str_len;
}
assert(current_pos == start_pos_[kMaxLemmaSize]);
assert(id_num == start_id_[kMaxLemmaSize]);
}
传入的参数为lemma_arr数组和该数组长度,在前面的逻辑中已经对lemma_arr数组按照汉字进行了排序,最终buf_数组长度为150837,其存储内容为:
(gdb) p *buf_@10
$230 = {12295, 19968, 19969, 19969, 19971, 19975, 19975, 19976, 19977, 19978}
这些内容对应rawdict_utf16_65101_freq.txt文件中的汉字,即排序后的lemma_arr数组中的hanzi_str字段依次放在buf_数组中,但是lemma_arr数组中的hanzi_str字段有单个汉字、两个汉字、三个汉字以及四个汉字的情况,那也没用,就是依次放在buf_中,估计search的时候会去判断的,这里先留个悬念!这一步初始化了三个数组,分别是scis_hz_、scis_splid和buf_,只有当这三个数组都初始化成功即dl_success为true的时候下面的断言语句才可以通过,继续往下构建n-gram,n-gram信息的构建主要是为后期的预测功能提供支持,构建过程在另一篇文章中单独研究,这里先来看下DictBuilder::construct_subset方法:此方法从build_dict中调用时传入的参数item_start=0,item_end=65101,也就是说从文件rawdict_utf16_65101_freq.txt的第一行到最后一行进行遍历,从根节点开始构建字典树,默认从level0开始构建,
// 1. Scan for how many sons
size_t parent_son_num = 0;
// LemmaNode *son_1st = NULL;
// parent.num_of_son = 0;
LemmaEntry *lma_last_start = lemma_arr_ + item_start;
uint16 spl_idx_node = lma_last_start->spl_idx_arr[level];
// Scan for how many sons to be allocaed
for (size_t i = item_start + 1; i< item_end; i++) {
LemmaEntry *lma_current = lemma_arr + i;
uint16 spl_idx_current = lma_current->spl_idx_arr[level];
if (spl_idx_current != spl_idx_node) {
parent_son_num++;
spl_idx_node = spl_idx_current;
}
}
parent_son_num++;
这个for循环就是来遍历lemma_arr_计算总共有多少个节点,什么条件下才增加一个节点呢?答案就是spl_idx_current != spl_idx_node这个条件成立,看过第一篇LatinIME数据解构分析的话应该能记得LemmaEntry中有一个字段就是spl_idx_arr,这个字段就是汉字的拼音在对应数据结构(spl_buf_)中拼音的id值,他们两个不相等就说明需要增加一个节点了,跳过for循环打印parent_son_num发现正好时413,也就是Google原生LatinIME输入法spl_buf_数据内容的行数,即有效汉语音节总数。
字典构建过程分析到此还并未结束,限于篇幅,将在流程分析(二)中继续对字典构建流程进行分析,文中如有纰漏、谬误,敬请指教!