[PySpider] 架构及实际问题

测试积点老人 · 发表于 2018-12-6 14:40:26

本帖最后由测试积点老人于 2018-12-6 14:42 编辑

架构设计：

pyspider的设计基础是：以python脚本驱动的抓取环模型。
通过python脚本进行结构化信息的提取，follow链接调度抓取控制，实现最大的灵活性
通过web化的脚本编写、调试环境。web展现调度状态
抓取环模型成熟稳定，模块间相互独立，通过消息队列连接，从单进程到多机分布式灵活拓展

具体流程

分布式部署

##### Master
phantomjs --ssl-protocol=any --disk-cache=true /home/video/.jumbo/lib/python2.7/site-packages/pyspider/fetcher/phantomjs_fetcher.js 25555 &
supervise -p "var/run/status/pyspider_js/" -f "bin/pyspider -c etc/pyspider/config.json phantomjs"
supervise -p "var/run/status/pyspider_ui/" -f "bin/pyspider -c etc/pyspider/config.json webui"
supervise -p "var/run/status/pyspider_sc/" -f "bin/pyspider -c etc/pyspider/config.json scheduler"
supervise -p "var/run/status/pyspider_pr/" -f "bin/pyspider -c etc/pyspider/config.json processor"
supervise -p "var/run/status/pyspider_pr/" -f "bin/pyspider -c etc/pyspider/config.json fetcher"
supervise -p "var/run/status/pyspider_fe/" -f "bin/pyspider -c etc/pyspider/config.json --phantomjs-proxy='localhost:25555' fetcher "
supervise -p "var/run/status/pyspider_re/" -f "bin/pyspider -c etc/pyspider/config.json result_worker"
##### Slave
nohup phantomjs --ssl-protocol=any --disk-cache=true /home/video/.jumbo/lib/python2.7/site-packages/pyspider/fetcher/phantomjs_fetcher.js 25555&
/home/video/.jumbo/bin/python bin/pyspider -c etc/pyspider/config.json processor &
/home/video/.jumbo/bin/python bin/pyspider -c etc/pyspider/config.json fetcher &
/home/video/.jumbo/bin/python bin/pyspider -c etc/pyspider/config.json --phantomjs-proxy="localhost:25555" fetcher &
/home/video/.jumbo/bin/python bin/pyspider -c etc/pyspider/config.json result_worker &

复制代码

webui

web的可视化任务监控
web脚本编写，单步调试
异常捕获、log捕获，print捕获等
result viewer, exporter
每隔30s，前端会自动请求 GET /counter GET /queues

scheduler

任务优先级
周期定时任务
流量控制（控制抓取速度实现）traffic control(token bucket algorithm)-> fetcher
基于时间周期或前链标签，即itag（例如itag=更新时间）的重抓取调度
only one scheduler is allowed
five threading
judge: new task\re-crawl task

fetcher

method, header, cookie, proxy, etag, last_modified, timeout 等等抓取调度控制
可以通过适配类似 phantomjs 的webkit引擎支持渲染
fetch webpages then send results to processor
Phantomjs Fetcher: fetch and render pages with JavaScript enabled
多实例分布式部署

processor

内置的pyquery，以jQuery解析页面
在脚本中完全控制调度抓取的各项参数
可以向后链传递信息
异常捕获capture the exceptions and logs
running the script written by user
send status(task track) and new tasks to scheduler
send results to Result Worker
多实例分布式部署

result worker(optional)

receive results from processor
overwrite it to deal with result by your needs

架构图
pyspider的架构主要分为：scheduler（调度器）、fetcher（抓取器）、processor（脚本执行）各个组件间使用消息队列连接，除了scheduler是单点的（可以独立修改），fetcher 和 processor 都是可以多实例分布式部署的。任务由 scheduler发起调度，fetcher抓取网页内容， processor执行预先编写的py脚本，输出结果或产生新的提链任务（发往 scheduler），形成闭环。每个脚本被认为是一个project，taskid（默认为url的md5）唯一确定一个任务。通过设置回调函数分别解析不同类型的页面。

实际问题1 如果pyspider上已经部署了100个项目，想让某些项目优先跑，应该怎么办
答：有三种方案：1、调整项目内页面调度的优先级，默认是0，@config(priority=2)，priority越高的，越先被处理，按照任务的优先级被调度。2、调整项目的rate/burst，默认是1.0/3.0，rate代表每秒爬取次数，burst代表并发数，把对应速率rate/burst设置大一点。3、在项目内的self.crawl函数里增加priority参数。

2 pyspider的瓶颈在哪？挂掉了咋办？
答：短时间产生海量任务，即new taskid，可能会出现内存不够，主要是redis占用的内存，redis里有个newtask_queue，里面存着所有的new taskid，限制长度小于100，其实就是小于100个数组，每个数组长度限制在1000以内，也就是说最多可以一次性存储100×1000=100000个任务，如果短时间产生的任务超过100000的话，就可能崩掉。想要重启的话，先杀掉pyspider的所有进程，然后关闭一些项目，减少新任务的添加，再启动pyspider。因为数据都存在mysql里，挂掉和重启都不会影响所有项目的运行。

3 pyspider夯住了，日志不滚动咋办？一般有多种原因

processor模块异常退出，会发现scheduler2fetcher和fetcher2processor队列爆满。需要启动processor，即可恢复正常。
schedule模块连接mysql异常

如何防止这种情况发生对processor日志监控

4 pyspider重启夯住，为什么，怎么办？pyspider的taskdb里各个项目的任务太多了，pyspider会把各个项目里的所有task都读取一遍，导致莫名其妙的问题。
pyspider项目已经抓取过的详情页url，如果需要重复抓取，在on_start时候需要清空taskdb对应的project表，这样task_db的数据量就能大幅减小，重启就舒畅了。
5 如何监控模板解析失效？负载均衡，抓取的任务如何均分到每个fetcher，fetcher多机部署，防止被封禁，控制站点压力
调度适合单站点
调度策略: 轮询随机

		自动登录	找回密码
密码			(注-册)加入51Testing

[PySpider] 架构及实际问题

本帖子中包含更多资源

站长推荐 /1