设为首页收藏本站

开启辅助访问软件测试门户软件测试培训软件测试论坛测试解决方案文章资料精选软件测试博客软件测试招聘

51Testing软件测试论坛 »软件测试论坛 › [管理工具] › [自动化测试工具及框架] › Pyspider使用过程教程以及若干问题记录

发新帖

查看: 1456|回复: 0

上一主题

下一主题

Pyspider使用过程教程以及若干问题记录

测试积点老人

TA的每日心情

	擦汗 3 天前

签到天数: 527 天

连续签到: 4 天

[LV.9]测试副司令

电梯直达

跳转到指定楼层

1^#

发表于 2018-11-29 13:35:47 | 只看该作者回帖奖励

回帖奖励

|正序浏览 |阅读模式

本帖最后由测试积点老人于 2018-11-29 13:47 编辑

使用教程

4, pyspider中的操作界面截图 Case2
代码示例如下：

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-08-13 13:22:16
# Project: repospider
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://www.reeoo.com', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('div[class="thumb"]').items():
detail_url = each('a').attr('href')
print(detail_url)
self.crawl(detail_url, callback=self.detail_page)
@config(priority=2)
def detail_page(self, response):
header = response.doc('body > article > section > header')
title = header('h1').text()
tags = []
for each in header.items('a'):
tags.append(each.text())
content = response.doc('div[id="post_content"]')
description = content('blockquote > p').text()
website_url = content('a').attr.href
image_url_list = []
for each in content.items('img[data-src]'):
image_url_list.append(each.attr('data-src'))
return {
"title": title,
"tags": tags,
"description": description,
"image_url_list": image_url_list,
"website_url": website_url

复制代码

问题记录1.
问题1
问题的错误信息：

Exception: HTTP 599: Unable to communicate securely with peer: requested domain name does not match the server's certificate.

复制代码

解决的办法：

将代码中的基于https开头的地址，切换为http即可。实际的url地址还是https。

办法2：设置crawl的参数 validate_cert=False

使用Pyspider过程中的nginx代理
将Nginx的端口直接映射到web服务的端口：

vim /etc/nginx/nginx/conf

复制代码

文件的内容修改如下：
“`
server {
listen 8080;
server_name localhost;
root /usr/share/nginx/html;

location / {
proxy_pass http://127.0.0.1:5000;
}
error_page 404 /404.html;
location = /40x.html {
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
}
}

复制代码

“ 将nginx代理的主机8080映射为内部的5000端口。
pyspider中的操作界面截图 Case1

case1:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-08-13 14:05:23
# Project: test
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://reeoo.com/', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)
@config(priority=2)
def detail_page(self, response):
return {
"url": response.url,
"title": response.doc('title').text(),
}

复制代码

本帖子中包含更多资源

您需要登录才可以下载或查看，没有帐号？(注-册)加入51Testing

x

分享到: QQ好友和群 QQ空间 腾讯微博 腾讯朋友

【你来问我来答第140期】：敏捷测试的奥秘与挑战：问答专场！

回复

使用道具举报

发新帖

站长推荐 /1

小黑屋|手机版|Archiver|51Testing软件测试网 ( 沪ICP备05003035号 ) 关于我们

GMT+8, 2024-11-18 06:47 , Processed in 0.059782 second(s), 24 queries .

Powered by Discuz! X3.2

© 2001-2024 Comsenz Inc.

快速回复 返回顶部 返回列表