关于python异步协程的问题，求解？

测试积点老人 · 发表于 2021-10-14 14:32:41

具体过程就是，在写爬取一个小说网站的小爬虫
大概架构是：
1、先从目录页入手找到各个子页面的地址
2、再通过子页面地址用正则抓取到正文内容
3、将上述第二步，抓取正文内容的过程，改为异步协程的方法，同时抓取

现在1、2步以实现，在同步状态下可以正常运行；但是根据小子网上学的改为异步协程方法进行修改后；却报错无法执行，望老鸟指教
以下是拿来练手的小说地址：https://book.qidian.com/info/1030136856/#Catalog

以下为报错内容：

import requests
import asyncio
import aiohttp
import aiofiles
import json
import re
#用来提取子页面地址的正则
obj1=re.compile(r'<a class="subscri" href="//read.qidian.com/hankread/1030136856/94755936/" target="_blank"></a>.*?<ul class="cf">(?P<all>.*?) <div class="book-content-wrap cf">',re.S)
obj2=re.compile(r'<li data-rid=.*?><a href="(?P<url>.*?)'
r'title=".*?">(?P<name>.*?)</a>',re.S)
#用来提取小说正文的正则
obj3=re.compile(r'<div class="read-content j_readContent" id="">(?P<main>.*?)</div>.*?<div class="admire-wrap">',re.S)
headers={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 OPR/78.0.4093.184'
}
#提取到的页面子链接需要补充：https:
child_supplement='https:'
#拿到章节子页面地址
async def getintroduction(url,headers):
resp=requests.get(url)
chapter=resp.text
#print(resp.text)
urls=[] #准备一个列表放提取到的url
#用正则提取章节子页面地址
#第一次先提取整个网页中的目录模块
result=obj1.finditer(chapter)
for it in result:
#print(it.group('all'))
table=it.group('all')
#第二次再分开提取每个章节的URL
content=obj2.finditer(table)
for it in content:
#print(it.group('name'))
#print(it.group('url'))
name = it.group('name')
url =child_supplement+it.group('url') #对提取到的链接进行拼接
#准备异步任务，把提取到的url，放到一个列表里
urls.append(aiodownload(url,name))
#print(urls)
resp.close()
await asyncio.wait(urls)
#读取子页面的正文内容
async def aiodownload(url,name):
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp: # 得到每个章节子页面的信息
content=await resp.json() #读到的内容存储
#print(content)
result=obj3.finditer(content)
for it in result:
#print(it.group('main'))
book=it.group('main')
async with aiofiles.open('4.8-book/'+name,mode='w',encoding='UTF-8') as f: #保存文件，注意wb模式是不能使用encoding的
await f.write(book)
print('完成')
if __name__=='__main__':
b_id='1030136856'
url='https://book.qidian.com/info/'+b_id+'/#Catalog'
#getintroduction(url,headers=headers)
asyncio.run(getintroduction(url,headers=headers))

复制代码

qqq911 · 发表于 2021-10-15 10:46:57

异步应该没有实时好用吧

litingting0214 · 发表于 2021-10-15 15:15:55

为啥要用异步

jingzizx · 发表于 2021-10-15 16:15:24

异常抓取有问题

		自动登录	找回密码
密码			(注-册)加入51Testing

关于python异步协程的问题，求解？

站长推荐 /1