python之re

正则表达式（Regular Expression）使用单个字符串来描述、匹配一系列匹配某个句法规则的字符串。

re中常用方法:

re.match(pattern, string)
    pattern:自己编写的正则表达式
    string：要匹配的字符串
  自字符串首字符，从左向右开始匹配。若匹配成功，则返回一个匹配对象Macth Object，具有group() groups()等方法，用来返回字符串的匹配部分。否则返回None（注意不是空字符串""）。
re.search(pattern, string)
参数同上，从字符串中查找符合pattern的内容，找到一个就停止查找。若匹配成功，则返回一个匹配对象Macth Object。

In[1]：s = '123abcnvios321avnie'
In[2]：result = re.search(r'\d+[a-z]', s)
In[3]：result.group()
Out[3]: '123a'

re.findall(pattern, string)
参数同上。从字符串中查找所有符合pattern的内容。返回一个列表。

In[1]：s = '123abcnvios321avnie'
In[2]：result = re.findall(r'\d+[a-z]', s)
In[3]：result
Out[3]: ['123a', '321a']

re.sub(pattern, repl, string)
repl：可以是一个字符串变量，用来替换要被替换的匹配值；或者为一个函数对匹配结果进行某种操作以后，进行返回，用返回值替换匹配值。返回一个字符串。

# repl为字符串变量
In[1]：s = 'name=xxx score=66'
# 将成绩66改为88
In[2]：result = re.sub(r'\d+',  ‘88’,  s)
In[3]：result
Out[3]: 'name=xxx score=88'

# repl为函数名
In[1]：s = 'name=xxx math_score=66 english_score=77'
In[2]：def replace(score):
           print(score.group())
    	   print(type(score))
           return '0'
# 可以发现传入score的是一个匹配对象Macth Object，每匹配到一个运行一次函数。
In[3]：result = re.sub(r'\d+', replace, s)
66
<class '_sre.SRE_Match'>
77
<class '_sre.SRE_Match'>
In[4]：result
Out[4]: 'name=xxx math_score=0 englist_score=0'

# 将所有成绩加10分
In[5]：def replace(score):
           return str(int(score.group()) + 10)
In[6]：result = re.sub(r'\d+',  replace,  s)
In[7]：result
Out[7]: 'name=xxx math_score=76 englist_score=87'

re.split(pattern, string)
根据匹配进行切割字符串，并返回一个列表

In[1]：s = 'aaa-bbb:ccc,ddd'
# 根据- 或：或，进行分割 
In[2]：result = re.split(r'-|:|,', s)
In[3]：result
Out[3]: ['aaa', 'bbb', 'ccc', 'ddd']

表示字符

字符	功能
.	匹配任意1个字符（除了\n）
[ ]	匹配[ ]中列举的字符
\d	匹配数字，即0-9
\D	匹配非数字，即不是数字
\s	匹配空白，即空格，tab键
\S	匹配非空白
\w	匹配单词字符，即a-z、A-Z、0-9、_
\W	匹配非单词字符

^: 对[]内的内容取反，即匹配非[]内的内容
\d == [0-9]
\D == [^0-9]
\w == [0-9a-zA-Z_]

表示数量

字符	功能
*	匹配前一个字符出现0次或者无限次，即可有可无
+	匹配前一个字符出现1次或者无限次，即至少有1次
?	匹配前一个字符出现1次或者0次，即要么有1次，要么没有
{m}	匹配前一个字符出现m次
{m,}	匹配前一个字符至少出现m次
{m,n}	匹配前一个字符出现从m到n次

｛1，｝ == +
｛0，｝ == *
｛0， 1｝ == ？

表示边界

字符	功能
^	匹配字符串开头
$	匹配字符串结尾
\b	匹配一个单词的边界
\B	匹配非单词边界

字符串边界：

In[1]：re.match(r'.+w$', 'windows') # 错误匹配，字符w未在字符串结尾。
In[2]：result = re.match(r'.+s$', 'windows')
In[2]：result.group()
Out[2]: 'windows'

单词边界：
想让一个字符出现在单词的边界处，则该字符后边可以跟空格表示单词结束，或者该字符在字符串最后，表示字符串结束

In[1]：result = re.match(r'^\w+w\b', 'windows') # 错误匹配，字符w未在单词最后或字符串结尾。
# w在单词最后
In[2]：result = re.match(r'^\w+ow\b', 'window sssss')
In[2]：result.group()
Out[2]: 'window'
In[3]：result = re.match(r'^\w+ow\B', 'windows')
In[3]：result.group()
Out[3]: 'window'

匹配分组

字符	功能
\|	匹配左右任意一个表达式
（ab）	将括号中字符作为一个分组
\num	引用分组num匹配到的字符串
（?P<name>）	分组起别名
(?P=name)	引用别名为name分组陪陪到的字符串

分组：

In[1]：s = '<html><p>windows</p></html>'
# 将括号中字符作为一个分组
In[2]：result = re.match(r'<(.+)><(.+)>(.+)</.+></.+>', s)
# 通过groups()显示
In[3]：result.groups()
Out[3]: ('html', 'p', 'windows')

引用分组：

In[1]：s = '<html><p>windows</html></p>'
# 当字符串s中标签不对应时，也能匹配出结果，但在实际应用中有问题。
In[2]：result = re.match(r'<.+><.+>(.+)</.+></.+>', s)
In[3]：result.group()
Out[3]:  <html><p>windows</html></p>

# 通过引用分组解决
In[1]：s = '<html><p>windows</html></p>'
# 使用引用分组后匹配不出结果，因为标签不对应。\1引用的是html，但是在字符串s中他的位置是p.\2引用的是p，但是在字符串s中他的位置是html,
In[2]：result = re.match(r'<(.+)><(.+)>(.+)</\2></\1>', s)

# 字符串s中标签对应，则匹配成功
In[1]：s = '<html><p>windows</p></html>'
In[2]：result = re.match(r'<(.+)><(.+)>(.+)</\2></\1>', s)
In[3]：result.group()
Out[3]: <html><p>windows</html></p>
In[4]：result.groups()
Out[54]: ('html', 'p', 'windows')

分组起别名:
和引用分组类似。相当于给匹配到的分组起个名字，在匹配时，不用引用分组的方式（\x）,而是用引用别名的方式（?P=name）

In[1]：s = '<html><p>windows</p></html>'
In[2]：result = re.match(r'<(?P<name1>.+)><(?P<name2>.+)>(.+)</(?P=name2)></(?P=name1)>', s)
In[3]：result.group()
Out[3]: <html><p>windows</p></html>

贪婪和非贪婪：
在Python里用正则表达式匹配时默认是贪婪的，即：总是尝试匹配尽可能多的字符；非贪婪则相反，总是尝试匹配尽可能少的字符。
例如我们匹配一串数字时，可能出现缺少数字的现象。

In[1]: s = 'this is a telephone number: 0311-12345678'
# 想要匹配出0311-12345678
In[2]: result = re.match(r'^.+(\d+-\d+)$', s)
In[3]: re.group(1)
Out[3]: '1-12345678'

上面这个例子匹配时，默认为贪婪模式，”.+“会尽量的匹配出尽可能多的字符。因此数字的前三位”031“也被他匹配走了，“\d+”只需一位字符就可以匹配，所以它匹配了数字“1”。可以匹配模式改为如下形式：

在\d后面加上数量限制：re.match(r’^.+(\d{4}-\d+)$’, s)
改为分贪婪模式：re.match(r’^.+?(\d±\d+)$’, s)
非贪婪模式的改法为：将非贪婪操作符“？”，用在"*" , “+” , "?"的后面，要求正则匹配的越少越好。

手机号匹配规则分析：
4. 总长度11位： patten=’\d\d\d\d\d\d\d\d\d\d\d‘
（这样仍有许多问题，如第一个数非1，或字符串s过长等。）
5. 第一个数为1： patten=’1\d\d\d\d\d\d\d\d\d\d‘
6. 第二个数为3或4或5或7或8：patten=’1[34578]\d\d\d\d\d\d\d\d\d‘
7. 剩余九位数字：patten=’1[34578]\d{9}‘
8. 开头、结尾：patten=’^1[34578]\d{9}$‘
注：此分析只是简单的过程，不是实际手机号匹配模式。对于其他位的数字没有加条件限制。

匹配0-100任意数字
分析：
一位数：0 1 2 …
两位数：23 81 39 …
三位数：100
当位数>=2时，第一个数字不能为0，故0为特殊情况。
9. 匹配第一个数：pattern = ‘[1-9]’
10. 匹配第二个数：pattern = ‘[1-9]\d?’，此时可匹配除0以外的一位数或两位数。
11. 特殊情况：pattern = ‘[1-9]\d?|0|100’
12. 添加边界：pattern = ‘[1-9]\d? $|0$ |100 $KaTeX parse error: Expected 'EOF', got '\d' at position 47: …ttern = ‘[1-9]?\̲d̲?$ |0 $|100$ ’

猜你喜欢