TA的每日心情 | 擦汗 3 天前 |
---|
签到天数: 527 天 连续签到: 4 天 [LV.9]测试副司令
|
PySpider 是一个我个人认为非常方便并且功能强大的爬虫框架,支持多线程爬取、JS动态解析,提供了可操作界面、出错重试、定时爬取等等的功能,使用非常人性化。
网上的参考文档:
- http://www.jianshu.com/p/8eb248697475
- http://cuiqingcai.com/2652.html
- https://yq.aliyun.com/articles/75518
1.搭建环境:
python版本:3.6.3
系统环境:centos7.3
1.1.搭建python3环境:
# 下载依赖
- <p>yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libcurl-devel</p>
复制代码 # 下载python
- wget https://www.python.org/ftp/python/3.6.3/Python-3.6.3.tgz
复制代码 #解压
#编译安装
- ./configure --prefix=/usr/local/python3.6 --enable-shared
复制代码 # 建立软链接
- <p>ln -s /usr/local/python3.6/bin/python3 /usr/bin/python3</p><p>
- </p><p>echo "/usr/local/python3.6/lib" > /etc/ld.so.conf.d/python3.5.conf</p><p>
- </p><p>ldconfig</p>
复制代码 # 验证python3
- <p>[root@ceph-host-01 local]# python3</p><p>
- </p><p>Python 3.6.3 (default, Oct 9 2017, 04:01:24) </p><p>
- </p><p>[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux</p><p>
- </p><p>Type "help", "copyright", "credits" or "license" for more information.</p><p>
- </p><p>>>> </p><p>
- </p><p>
- </p><p>
- </p><p>#pip</p><p>
- </p><p>/usr/local/python3.6/bin/pip3 install --upgrade pip</p><p>
- </p><p>ln -s /usr/local/python3.6/bin/pip /usr/bin/pip</p><p>
- </p><p>
- </p>
复制代码 1.2.安装pyspider
启动python中的pycurl模块出现如下问题
- ImportError: pycurl: libcurl link-time ssl backend (nss) is different from compile-time ssl backend (none/other)
复制代码
解决方法:
- <p>pip uninstall pycurl</p><p>export PYCURL_SSL_LIBRARY=nss</p><p>pip install pycurl</p>
复制代码
1.3.安装phantomjs
官网下载:http://phantomjs.org/download.html
- wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
复制代码 解压:
- <p>yum -y install unbzip2</p><p>
- </p><p>bzip2 -d phantomjs-2.1.1-linux-x86_64.tar.bz2 </p><p>
- </p><p>tar -xf phantomjs-2.1.1-linux-x86_64.tar</p><p>
- </p><p>mv phantomjs-2.1.1-linux-x86_64 phantomjs</p><p>
- </p><p>ln -sv /usr/local/phantomjs/bin/phantomjs /usr/bin/phantomjs</p>
复制代码 1.4.启动pyspider
由于放在公网,编辑了一个配置文件config.json ,用于登录认证
- <p>[root@ceph-host-01 local]# vim config.json </p><p>
- </p><p>
- </p><p>
- </p><p>{</p><p>
- </p><p> "webui": {</p><p>
- </p><p> "port": "5000",</p><p>
- </p><p> "username": "abc",</p><p>
- </p><p> "password": "123456",</p><p>
- </p><p> "need-auth": true</p><p>
- </p><p> }</p><p>
- </p><p>}</p><p>
- </p><p>开启进程</p><p>
- </p><p>nohup pyspider --config config.json &</p>
复制代码
|
|