Python正则表达式小结

写在前面

本文部分来自于re模块的官方说明文档,部分来自于笔者自己的见解,如有不当,望多多包涵。

一、匹配流程

匹配模式pattern先由正则表达式引擎编译为pattern对象,再和需要匹配的文本匹配得到匹配结果。
注意 匹配的时候类型要一致,str类型的字符串只能用str类型的模式去匹配;bytes类型的字符串只能用bytes类型的模式去匹配。

二、基本概念

2.1 匹配字符

匹配字符分为直接匹配字符和特殊匹配字符。直接匹配字符就是一个字符或字符串;特殊匹配字符有:

元字符 含义
. 匹配换行符以外的任意字符
^ 匹配字符串的开始
$ 匹配字符串的末尾或字符串末尾的新行之前

还有一些特殊匹配字符,它们都由\开头,称为转义字符,常用的有:

转义字符 含义
\number 捕获分组,匹配同一编号的组的内容
\A 匹配字符串的开始(A是字母表的开始)
\Z 匹配字符串的末尾(Z是字母表的结束)
\b 匹配单词开始或结尾处的空白字符串(b 代表 ‘blank’)
\B 匹配非单词开始或结尾处的空白字符串
\d 匹配任何十进制数字(d 代表 ‘decimal’)
\D 匹配任意非数字字符
\s 匹配任意空白字符(s 代表 ‘space’)
\S 匹配任意非空白字符(即\t\r\n\f\v和空格以外的字符)
\w 匹配任意字母或数字
\W 匹配任何非字母和非数字
\character 匹配元字符本身

2.2 匹配量词

正则表达式中的量词有:

量词 含义
* 匹配之前的子模式零次或多次
+ 匹配之前的子模式一次或多次
? 匹配之前的子模式零个或一个
{m} 匹配之前的子模式m次
{m,n} 匹配之前的子模式m次到n次(n可以不写表示无上限)

2.3 关系字符

关系字符表示pattern中的逻辑关系。

关系字符 含义
[] 匹配当中出现的任意一个字符
(...) 将其中的表达式看成一个整体进行匹配
(?aiLmsux) 设置flags参数
(?-aiLmsux) 取消flags参数设定
(?:...) (...)类似,但被group忽略
(?P<name>...) 可以按名称访问组匹配的子字符串
(?P=name) 匹配之前由名为name的组匹配的文本
(?<=...) 与一个子模式相连,当...和子模式的对应位置匹配的时候匹配。比如,(?<=[a-z])\d+(?=[a-z])匹配以小写字母开头和结尾的数字,只匹配数字。这一部分具体可以阅读正则表达式中的断言
- 出现在[]中,表示范围。如[a-z]表示从a到z的任意字母。
^ 出现在[]的开头,表示求补集。
| 表示前后子模式的任意一个,从左到右。
? 跟在量词后面,将匹配模式改为非贪婪。

三、常用函数

函数 含义
match 将pattern和字符串开头进行匹配
fullmatch 将pattern和整个字符串进行匹配
search 在字符串中搜索pattern是否存在
sub 替换在字符串中发现的pattern
subn sub相同,但替换后返回替换次数
split 根据pattern分割字符串
findall 找到在字符串里的出现的所有pattern
finditer 返回一个迭代器,为每个匹配生成一个Match对象。
compile 将一个pattern编译成一个pattern对象
purge 清除正则表达式缓存
escape 将字符串中的所有非字母数字都反斜杠表示

返回的match对象有以下方法:

方法 含义
group(num=0)groups() 获取匹配的表达式
start([group]) 获取分组匹配的子串在整个字符串中的起始位置
end([group]) 获取分组匹配的子串在整个字符串中的结束位置
span([group]) 返回(start([group]),end([group]))

一些函数包含可选参数flags:

flags 含义
A/ASCII 对于str类型的pattern,让\w\W\b\B\d\D匹配相应的ASCII字符而不是Unicode字符。(对于bytes类型的pattern不需要指定。)
I/IGNORECASE 执行不区分大小写的匹配
L/LOCALE \w\W\b\B取决于当前语言环境
M/MULTILINE ^除了与字符串的开头,还与换行之后行的开头匹配。让$除了与字符转的结尾匹配,还与新行之前行的结尾匹配。
S/DOTALL .匹配任何字符,换行符也匹配
X/VERBOSE 忽略空白和为了使正则表达式美观的注释
U/UNICODE 禁止bytes类型pattern的匹配

下面是这些函数的原型:

def match(pattern, string, flags=0):
    """Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).match(string)

def fullmatch(pattern, string, flags=0):
    """Try to apply the pattern to all of the string, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).fullmatch(string)

def search(pattern, string, flags=0):
    """Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found."""
    return _compile(pattern, flags).search(string)

def sub(pattern, repl, string, count=0, flags=0):
    """Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return
    a replacement string to be used."""
    return _compile(pattern, flags).sub(repl, string, count)

def subn(pattern, repl, string, count=0, flags=0):
    """Return a 2-tuple containing (new_string, number).
    new_string is the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in the source
    string by the replacement repl.  number is the number of
    substitutions that were made. repl can be either a string or a
    callable; if a string, backslash escapes in it are processed.
    If it is a callable, it's passed the Match object and must
    return a replacement string to be used."""
    return _compile(pattern, flags).subn(repl, string, count)

def split(pattern, string, maxsplit=0, flags=0):
    """Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list."""
    return _compile(pattern, flags).split(string, maxsplit)

def findall(pattern, string, flags=0):
    """Return a list of all non-overlapping matches in the string.
    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.
    Empty matches are included in the result."""
    return _compile(pattern, flags).findall(string)

def finditer(pattern, string, flags=0):
    """Return an iterator over all non-overlapping matches in the
    string.  For each match, the iterator returns a Match object.
    Empty matches are included in the result."""
    return _compile(pattern, flags).finditer(string)

def compile(pattern, flags=0):
    "Compile a regular expression pattern, returning a Pattern object."
    return _compile(pattern, flags)

def purge():
    "Clear the regular expression caches"
    _cache.clear()
    _compile_repl.cache_clear()

def template(pattern, flags=0):
    "Compile a template pattern, returning a Pattern object"
    return _compile(pattern, flags|T)

# SPECIAL_CHARS
# closing ')', '}' and ']'
# '-' (a range in character set)
# '&', '~', (extended character set operations)
# '#' (comment) and WHITESPACE (ignored) in verbose mode
_special_chars_map = {i: '\\' + chr(i) for i in b'()[]{}?*+-|^$\\.&~# \t\n\r\v\f'}

def escape(pattern):
    """
    Escape special characters in a string.
    """
    if isinstance(pattern, str):
        return pattern.translate(_special_chars_map)
    else:
        pattern = str(pattern, 'latin1')
        return pattern.translate(_special_chars_map).encode('latin1')

Pattern = type(sre_compile.compile('', 0))
Match = type(sre_compile.compile('', 0).match(''))

暂时先写这么多吧,关于详细的正则表达式说明可以看看这个链接

猜你喜欢

转载自blog.csdn.net/qq_43549984/article/details/85047661