其实输入法词库相关数据结构的定义基本上都在头文件dictdef.h文件中,进入到代码目录cpp下.
初始化字库,首先读取txt文件内容到数据结构lemma_arr和valid_hzs中,lemma_arr是一个数组类型为LemmaEntry,下面来看下LemmaEntry定义(cpp/include/dictdef.h):
//rawdict_utf16_65105_freq.txt每一行是一个LemmaEntry实体
//在记录拼音的时候,它默认将拼音字母转成大写,仅对双声母中的h使用小写。 这里指的是pinyin_str
struct LemmaEntry {
LemmaIdType idx_by_py;
LemmaIdType idx_by_hz;
char16 hanzi_str[kMaxLemmaSize + 1];
// The SingleCharItem id for each Hanzi.
uint16 hanzi_scis_ids[kMaxLemmaSize];
uint16 spl_idx_arr[kMaxLemmaSize + 1];
char pinyin_str[kMaxLemmaSize][kMaxPinyinSize + 1];
unsigned char hz_str_len;
float freq;
};
首先来看下rawdict_utf16_65105_freq.txt文件内容:
鼥 0.750684002197 1 ba
釛 0.781224156844 1 ba
軷 0.9691786136 1 ba
釟 0.9691786136 1 ba
蚆 1.15534975655 1 ba
。。。。。。
可以看到该文件行数为65105,每一行的格式都是:汉字 频率 ? 拼音,结构体中的freq就是频率,hz_str_len就是汉字的长度,二维数组pinyin_str[8][7]用来存放拼音,限制最长汉字串长度为8,单个汉字拼音长度限定为7,hanzi_str[8+1]是用来存放汉字的一种unicode编码,如第一个字“鼥”的编码就是:40741,可以在这里转换,在gdb中查看lemma_arr_第一个元素如下:
{idx_by_py = 0, idx_by_hz = 0, hanzi_str = {40741,
0,
0,
0,
0,
0,
0,
0,
0}, hanzi_scis_ids = {0,
0,
0,
0,
0,
0,
0,
0}, spl_idx_arr = {0,
0,
0,
0,
0,
0,
0,
0,
0}, pinyin_str = { "BA\000\000\000\000",
"\000\000\000\000\000\000",
"\000\000\000\000\000\000",
"\000\000\000\000\000\000",
"\000\000\000\000\000\000",
"\000\000\000\000\000\000",
"\000\000\000\000\000\000",
"\000\000\000\000\000\000"}, hz_str_len = 1 '\001', freq = 0.750684023}
拼音都被转换成了大写,但是双声母中的h除外。hanzi_scis_ids字段对应的是该lemma每个汉字在单个汉字表scis中对应的id,比如:第一行的“鼥”字在单字表scis中对应的id就被赋值给了hanzi_scis_ids的第一个元素即hanzi_scis_ids[0]位置,最后一行为“欧洲市场”,那么该字段对应的数组中依次存放“欧” “洲” “市” “场”所对应的id,假如“欧” “洲” “市” “场”分别对应id为34、46、29、200,那么hanzi_scis_ids[0] = 34、hanzi_scis_ids[1] = 46、hanzi_scis_ids[2] = 29、hanzi_scis_ids[3] = 200,其余值仍为初始值0,spl_idx_arr字段描述了每个LemmaEntry中每个汉字字音的id,在gdb中跳过50000次执行后正好跳到“叫声”这个词组,打印看到lemma_arr_中的结构:
(gdb) p lemma_arr_[8117]
$31 = {idx_by_py = 0, idx_by_hz = 0, hanzi_str = {21483, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {0, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {166, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"JIAO\000\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"},
hz_str_len = 1 '\001', freq = 55685.8672}
(gdb) p lemma_arr_[12781]
$32 = {idx_by_py = 0, idx_by_hz = 0, hanzi_str = {22768, 0, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {0, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {337, 0, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {"ShENG\000",
"\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"},
hz_str_len = 1 '\001', freq = 7169.11719}
(gdb) p lemma_arr_[i-1]
$33 = {idx_by_py = 0, idx_by_hz = 0, hanzi_str = {21483, 22768, 0, 0, 0, 0, 0, 0, 0}, hanzi_scis_ids = {0, 0, 0, 0, 0, 0, 0, 0}, spl_idx_arr = {166, 337, 0, 0, 0, 0, 0, 0, 0}, pinyin_str = {
"JIAO\000\000", "ShENG\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000", "\000\000\000\000\000\000"},
hz_str_len = 2 '\002', freq = 564.171448}
(gdb) p i
$34 = 33619
(gdb)
8117为“叫”这个字在lemma_arr_中的存储,12781为“声”这个字的lemma_arr_存储,i-1 = 33618,可以看到lemma_arr[33618]处“叫声”的idx_by_hz和spl_idx_arr就是其单个字的组合而来,同时还可以看到双声母“ShENG”中的h被设置为了小写。idx_by_py和idx_by_hz字段分别表示该lemma通过汉字id数组和拼音id数组计算出来的lemma的id。
前面提到过,LemmaEntry字段hanzi_scis_ids字段表示lemma(称之为汉字串吧)中每个汉字在单汉字表中的id,单汉字表scis也是一个数组,其类型为SingleCharItem的结构体(cpp/include/dictdef.h):
#ifdef ___BUILD_MODEL___
struct SingleCharItem {
float freq;
char16 hz;
SpellingId splid;
};
字段splid对应了单个汉字的拼音id,hz即汉字的描述,freq字段描述该单个汉字的频率,具体代码在dictbuilder.cpp中
size_t hz_num = lemma_arr_[pos].hz_str_len;
...
if (1 == hz_num)
scis_[scis_num_].freq = lemma_arr_[pos].freq;
else
scis_[scis_num_].freq = 0.000001;
汉字num为1也就是lemma只有一个汉字,否则freq设置为0.000001。再来看splid类型为SpellingId,这也是一个结构体:
typedef struct {
uint16 half_splid:5;
uint16 full_splid:11;
} SpellingId, *PSpellingId;
此结构体定义的half_splid和full_splid使用了位字段进行定义,即half_splid可以存储的无符号short类数不大于31(最大为11111),而full_splid可以存储最大无符号short类形数(大于31)小于2的0次方累加到2的11次方(具体多少自己算吧)。
接下来看与构建字典树相关的数据结构,一个是LmaNodeLE0,另一个是LmaNodeGE1,它们分别代表层数小于等于0上的节点和层数大于1上的节点,先来看LmaNodeLE0的定义(cpp/include/dictdef.h):
/**
* We use different node types for different layers
* Statistical data of the building result for a testing dictionary:
* root, level 0, level 1, level 2, level 3
* max son num of one node: 406 280 41 2 -
* max homo num of one node: 0 90 23 2 2
* total node num of a layer: 1 406 31766 13516 993
* total homo num of a layer: 9 5674 44609 12667 995
*
* The node number for root and level 0 won't be larger than 500
* According to the information above, two kinds of nodes can be used; one for
* root and level 0, the other for these layers deeper than 0.
*
* LE = less and equal,
* A node occupies 16 bytes. so, totallly less than 16 * 500 = 8K
*/
struct LmaNodeLE0 {
uint32 son_1st_off;
uint32 homo_idx_buf_off;
uint16 spl_idx;
uint16 num_of_son;
uint16 num_of_homo;
};
/**
* GE = great and equal
* A node occupies 8 bytes.
*/
struct LmaNodeGE1 {
uint16 son_1st_off_l; // Low bits of the son_1st_off
uint16 homo_idx_buf_off_l; // Low bits of the homo_idx_buf_off_1
uint16 spl_idx;
unsigned char num_of_son; // number of son nodes
unsigned char num_of_homo; // number of homo words
unsigned char son_1st_off_h; // high bits of the son_1st_off
unsigned char homo_idx_buf_off_h; // high bits of the homo_idx_buf_off
};
结构体LmaNodeGE0和LmaNodeGE1结构体主要在dictbuilder::construct_subset(...)方法中调用,从buidl_dict方法调用时传入的item_start=0, item_end=65101,就是rawdict_utf16_65101_freq.txt文件第一行到最后一行,也就是遍历lemma_arr_这个数组生成分层的trie树,在方法construtc_subset中递归调用自己为level0和level1层上添加节点,这块具体结构形式还没弄太明白,等以后明白了再详细描述这两个结构体吧。
// Node used for the trie of spellings
struct SpellingNode {
SpellingNode *first_son;
// The spelling id for each node. If you need more bits to store
// spelling id, please adjust this structure.
uint16 spelling_idx:11;
uint16 num_of_son:5;
char char_this_node;
unsigned char score;
};
结构体SpellingNode用来描述拼音字典树的每个节点,此结构体定义在cpp/include/spellingtrie.h中,*first_son是一个指向儿子节点类型为SpellingNode的指针数组首地址,spellingtrie.cpp的construct方法中构建的音节树的每个节点都是此类型,root_节点的first_son指针指向level1_sons的首元素地址,num_of_son采用位字段来定义,说明它可以存放不大于2的5次方的整数,该字段用来描述以此char_this_node描述的char可以组成的音节数量,如:当char_this_node 为 ‘A‘时,它的儿子节点数为3,分别是ai an ao。字段score即此char的得分,score越小搜索优先级越高,root_节点的score=0,位字段spelling_idx描述每个可组成音节的字母在列表中的id:
{first_son = 0x617420, spelling_idx = 1, num_of_son = 3, char_this_node = 65 'A', score = 86 'V'},
{first_son = 0x617480, spelling_idx = 2, num_of_son = 5, char_this_node = 66 'B', score = 57 '9'},
{first_son = 0x617620, spelling_idx = 3, num_of_son = 6, char_this_node = 67 'C', score = 72 'H'},
{first_son = 0x6179e0, spelling_idx = 5, num_of_son = 5, char_this_node = 68 'D', score = 46 '.'},
{first_son = 0x617c50, spelling_idx = 6, num_of_son = 3, char_this_node = 69 'E', score = 79 'O'},
{first_son = 0x617cb0, spelling_idx = 7, num_of_son = 5, char_this_node = 70 'F', score = 72 'H'},
{first_son = 0x617e00, spelling_idx = 8, num_of_son = 4, char_this_node = 71 'G', score = 62 '>'},
{first_son = 0x617ff0, spelling_idx = 9, num_of_son = 4, char_this_node = 72 'H', score = 64 '@'},
{first_son = 0x6181e0, spelling_idx = 11, num_of_son = 2, char_this_node = 74 'J', score = 59 ';'},
{first_son = 0x618380, spelling_idx = 12, num_of_son = 4, char_this_node = 75 'K', score = 70 'F'},
{first_son = 0x618570, spelling_idx = 13, num_of_son = 6, char_this_node = 76 'L', score = 62 '>'},
{first_son = 0x618810, spelling_idx = 14, num_of_son = 5, char_this_node = 77 'M', score = 68 'D'},
{first_son = 0x6189e0, spelling_idx = 15, num_of_son = 6, char_this_node = 78 'N', score = 66 'B'},
{first_son = 0x618c70, spelling_idx = 16, num_of_son = 1, char_this_node = 79 'O', score = 109 'm'},
{first_son = 0x618c90, spelling_idx = 17, num_of_son = 5, char_this_node = 80 'P', score = 90 'Z'},
{first_son = 0x618e50, spelling_idx = 18, num_of_son = 2, char_this_node = 81 'Q', score = 66 'B'},
{first_son = 0x626ff0, spelling_idx = 19, num_of_son = 5, char_this_node = 82 'R', score = 65 'A'},
{first_son = 0x6271a0, spelling_idx = 20, num_of_son = 6, char_this_node = 83 'S', score = 46 '.'},
{first_son = 0x627540, spelling_idx = 22, num_of_son = 5, char_this_node = 84 'T', score = 70 'F'},
{first_son = 0x6277a0, spelling_idx = 25, num_of_son = 4, char_this_node = 87 'W', score = 61 '='},
{first_son = 0x627890, spelling_idx = 26, num_of_son = 2, char_this_node = 88 'X', score = 68 'D'},
{first_son = 0x627a30, spelling_idx = 27, num_of_son = 5, char_this_node = 89 'Y', score = 51 '3'},
{first_son = 0x627bd0, spelling_idx = 28, num_of_son = 6, char_this_node = 90 'Z', score = 61 '='}
但是该字段为11位,也就是说它可以存放不大于2的11次方的整数(2048)。