custom_replace(replace_pattern)
功能:用于按规则对文本进行替换。
参数:
- replace_pattern:替换规则列表,可使用正则表达式。
样例:
from torchtext.data.functional import custom_replace
custom_replace_transform = custom_replace([(r'[Se]', '#'), (r'\s+', '_')])
list_a = ["Sentencepiece encode aS pieces", "exampleS to try!"]
print(list(custom_replace_transform(list_a)))
样例结果:
['##nt#nc#pi#c#_#ncod#_a#_pi#c#s', '#xampl##_to_try!']
simple_space_split(iterator)
功能:按照空白字符切割文本,包括空格、制表符、换行等。
参数:
- iterator:迭代器。需要分割的文本。
样例:
from torchtext.data.functional import simple_space_split
list_a = ["Sentencepiece\t\t\tencode as\n pieces", "example to try!"]
print(list(simple_space_split(list_a)))
样例结果:
[['Sentencepiece', 'encode', 'as', 'pieces'], ['example', 'to', 'try!']]
注意:只是单纯的按空白字符分割,标点符号会跟单词连在一起。
numericalize_tokens_from_iterator(vocab, iterator, removed_tokens=None)
功能:将文本列表根据字典映射为索引的迭代器。
参数:
- vocab:单词与索引的对应字典。
- iterator:文本迭代器。需要转换为索引的文本。
- removed_tokens:需要忽略的单词列表,默认None。若不为None,则列表中的单词在索引化时会被删除。
样例:
from torchtext.data.functional import simple_space_split, numericalize_tokens_from_iterator
vocab = {
"Sentencepiece":0,
"encode":1,
"as":2,
"pieces":3
}
sentences = [
"Sentencepiece encode as as as",
"pieces pieces encode"
]
ids_iter = numericalize_tokens_from_iterator(
vocab=vocab,
iterator=simple_space_split(sentences),
)
for ids in ids_iter:
print([num for num in ids])
ids_iter = numericalize_tokens_from_iterator(
vocab=vocab,
iterator=simple_space_split(sentences),
removed_tokens=["encode"]
)
for ids in ids_iter:
print([num for num in ids])
样例结果:
[0, 1, 2, 2, 2]
[3, 3, 1]
[0, 2, 2, 2]
[3, 3]