python正则

测试积点老人 发表于 2018-12-28 14:24:38

特殊字符
[*]\b \B
\b用于匹配一个单词的边界，如 \bthe表示任意以the开头的字符串，\bthe\b匹配the
\B将匹配单词中间的模式，如\B表示任何包含但不以the作为起始的字符串
[*]\w* 第一个字符是字母，第二个如果存在是字母或者数字
[*]\d{3}-\d{3}-\d{4}: 匹配美国电话号码，前面是区号
[*]\w+@\w+\.com : 匹配电子邮件地址

扩展
[*](?:\w+\.)* 以句点作为结尾的字符串，如google. twitter.
[*](?=.com) 如果一个字符串以'.com'结尾才做匹配
[*](?!.net) 如果一个字符串不以'.net'结尾才匹配
[*](?<=800-) 如果字符串前为"800-"才匹配
[*](?<!192\.168\.) 如果字符串前不为'192.168.'才匹配，用于过滤掉一组c类ip地址
[*](?(1)y|x) 用于匹配组1，如存在就与y匹配，否则与x匹配

方法re.match re.search
match方法从字符串起始部份开始匹配，如果起始位置不符，就失败，结果为None
search方法不但会搜索字符串第一次出现的位置，而且严格地对字符串从左到右搜索例子：1.>>> m = re.match('foo','food')
>>> m.group()
'foo'
>>>
2. >>> m = re.match('foo','seefood')
>>> m.group()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
匹配失败
3. >>> m = re.search('foo','seefood')
>>> m.group()
'foo'
匹配电子邮件：>>> import re
>>> patt = '(\w+@\w+\.\w+)'
>>> m = re.match(patt,'178919347@qq.com')
>>> m.group()
'178919347@qq.com'

分组：
>>> import re
>>> patt = '(\w+)@(\w+\.\w+)'
>>> m = re.match(patt,'178919347@qq.com')
>>> m.groups()
('178919347', 'qq.com')
>>> m.group(1)
'178919347'
匹配字符串的起始和结尾以及单词边界该操作一般用re.search,不用re.match，因为match()总是从字符串开始位置进行匹配>>> m = re.search('^The','The end.') # 匹配开头
>>> if m is not None:m.group()
...
'The'

>>> m = re.search('^The','end. The')#不作为起始，匹配不到
>>> if m is not None:m.group()
...

>>> m = re.search(r'\bthe','bite the dog')#在边界
>>> if m is not None:m.group()
...
'the'

>>> m = re.search(r'\bthe','bitethe dog')# 有边界
>>> if m is not None:m.group()
...

>>> m = re.search(r'\Bthe','bitethe dog')# 没有边界
>>> if m is not None:m.group()
...
'the'
使用findall()查找每一次出现的位置findall()查询字符串中某个正则表达式模式全部的非重复出现的情况，与search()类似，但findall()总是返回一个列表，如没有匹配到内容，则返回空列表。>>> re.findall(r'car','carry the barcardi to the car')
['car', 'car', 'car']
扩展符号(?iLmsux)系列选项 i. (?i)忽略大小写>>> re.findall(r'(?i)the','The biggest one is the lion!')
['The', 'the']ii. (?m) 实现多行混合>>> re.findall('(?im)(^th[\w ]+)',"""
... This line is the first,
... another line,
... that line, it's the best
... """)
['This line is the first', 'that line']iii) (?s) 表示点号(.)能够用来表示\n符号（反之其通常用于表示出了\n之外的符号）>>> re.findall(r'th.+',"""
... The first line
... the second line
... the third line
... """)
['the second line', 'the third line']

>>> re.findall(r'(?s)th.+',"""
... The first line
... the second line
... the third line
... """)
['the second line\nthe third line\n']iv. (?:...) 表示使用该符号，可以对部分正则表达式进行分组，但是不会保存>>> re.findall(r'http://(?:\w+\.)*(\w+\.com)','http://google.com http://www.google.com http://code.google.com')
['google.com', 'google.com', 'google.com']
>>> re.search(r'\((?P<areacode>\d{3})\) (?P<prefix>\d{3})-(?:\d{4})','(800) 555-1212').groupdict()
{'areacode': '800', 'prefix': '555'}(?P<name>)和(?P=name)符号，前者通过使用一个名称标识符而不是从1开始增加到N的数字来保存匹配，如果使用数字来保存匹配结果，我们就可以使用\1,\2...\N来检索。
v. (?=...)和(?!...)实现前视匹配，前者是正向前视断言，后者是负向前视断言。

匹配ip地址IP地址格式可表示为：XXX.XXX.XXX.XXX，XXX取值范围是0-255，前三段加一个.重复了三次，在与最后一段合并及组成IP地址的完整格式。所以IP地址的正则表示法如下：
((25|2\d|((1\d{2})|(?\d)))\.){3}(25|2\d|((1\d{2})|(?\d)))

匹配oabt004美剧网中首页的美剧名字和magnet地址，保存在dict中返回
#!/usr/bin/env python
# coding:utf-8
import requests
import re

class GetMag(object):
def __init__(self):
   self.result = dict()
   self.res = None
   self.mag = None
   self.name = None
def getText(self,url):
   self.res = requests.get(url).text
def getMag(self):
   name_pat = re.compile(r'''class="name">(.*?)-''')
   mag_pat = re.compile(r'''data-magnet="(.*?)"''')
   self.mag = re.findall(mag_pat,self.res)
   self.name = re.findall(name_pat,self.res)
   for n in self.name:
         for m in self.mag:
            self.result.update({n:m})
   print(self.result)

def main():
getmag = GetMag()
getmag.getText(url='http://oabt004.com/index/index?cid=1')
getmag.getMag()

if __name__ == "__main__":
main()

Miss_love 发表于 2021-1-5 14:46:19

支持分享

页: [1]

51Testing软件测试论坛 's Archiver

python正则