51Testing软件测试论坛

 找回密码
 (注-册)加入51Testing

QQ登录

只需一步,快速开始

微信登录,快人一步

手机号码,快捷登录

查看: 1470|回复: 2
打印 上一主题 下一主题

scrapy基本用法详解(入门)

[复制链接]

该用户从未签到

跳转到指定楼层
1#
发表于 2019-2-25 17:22:23 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
我们要抓取的网站是:quotes.toscrape.com
流程框架:
1.抓取第一页
2.获取内容和下一页的链接
3保存爬取结果、
4翻页爬取,请求下一页信息,分析内容并请求再下一页的链接

在命令行下如下操作:
创建项目:scrapy startproject quote
创建spider文件:scrapy genspider quotes quotes.toscrtapy.com
然后用pycharm打开

quotes.py
  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from quote.items import QuoteItem


  4. class QuotesSpider(scrapy.Spider):
  5.     name = 'quotes'
  6.     # name指定spidedr的名称
  7.     allowed_domains = ['quotes.toscrape.com']
  8.     start_urls = ['http://quotes.toscrape.com/']

  9.     def parse(self, response):
  10.         # parse方法是默认的回调,当爬虫开始时会从start_url得到链接,
  11.         # 然后自动调用parse方法进行解析
  12.         quotes =response.css('.quote')
  13.         # 选择出每一个区域块
  14.         for quote in quotes:  # 循环每一个区域块
  15.             item = QuoteItem()
  16.             text = quote.css('.text::text').extract_first()
  17.             author = quote.css('.author::text').extract_first()
  18.             tags = quote.css('.tags .tag::text').extract()
  19.             item['text'] = text
  20.             item['author'] = author
  21.             item['tags'] = tags
  22.             yield item


  23.         # .text::text意思是获取class=text的文本内容
  24.         next = response.css('.pager .next a::attr(href)').extract_first()
  25.         # 得到的只是url的一部分
  26.         url = response.urljoin(next)
  27.         # 使用join构建完整的url
  28.         yield scrapy.Request(url=url,callback=self.parse)
  29.         # 生成request

  30.         # 命令行下数据的保存 :scrapy crawl quotes -o quotes.json
  31.         #也支持数据格式:('json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle')
复制代码

setting.py
  1. # -*- coding: utf-8 -*-

  2. # Scrapy settings for quote project
  3. #
  4. # For simplicity, this file contains only settings considered important or
  5. # commonly used. You can find more settings consulting the documentation:
  6. #
  7. #     https://doc.scrapy.org/en/latest/topics/settings.html
  8. #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  9. #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

  10. BOT_NAME = 'quote'

  11. SPIDER_MODULES = ['quote.spiders']
  12. NEWSPIDER_MODULE = 'quote.spiders'


  13. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  14. #USER_AGENT = 'quote (+http://www.yourdomain.com)'

  15. # Obey robots.txt rules
  16. ROBOTSTXT_OBEY = True

  17. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  18. #CONCURRENT_REQUESTS = 32

  19. # Configure a delay for requests for the same website (default: 0)
  20. # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
  21. # See also autothrottle settings and docs
  22. #DOWNLOAD_DELAY = 3
  23. # The download delay setting will honor only one of:
  24. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  25. #CONCURRENT_REQUESTS_PER_IP = 16

  26. # Disable cookies (enabled by default)
  27. #COOKIES_ENABLED = False

  28. # Disable Telnet Console (enabled by default)
  29. #TELNETCONSOLE_ENABLED = False

  30. # Override the default request headers:
  31. #DEFAULT_REQUEST_HEADERS = {
  32. #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  33. #   'Accept-Language': 'en',
  34. #}

  35. # Enable or disable spider middlewares
  36. # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
  37. #SPIDER_MIDDLEWARES = {
  38. #    'quote.middlewares.QuoteSpiderMiddleware': 543,
  39. #}

  40. # Enable or disable downloader middlewares
  41. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
  42. #DOWNLOADER_MIDDLEWARES = {
  43. #    'quote.middlewares.QuoteDownloaderMiddleware': 543,
  44. #}

  45. # Enable or disable extensions
  46. # See https://doc.scrapy.org/en/latest/topics/extensions.html
  47. #EXTENSIONS = {
  48. #    'scrapy.extensions.telnet.TelnetConsole': None,
  49. #}

  50. # Configure item pipelines
  51. # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  52. #ITEM_PIPELINES = {
  53. #    'quote.pipelines.QuotePipeline': 300,
  54. #}

  55. # Enable and configure the AutoThrottle extension (disabled by default)
  56. # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
  57. #AUTOTHROTTLE_ENABLED = True
  58. # The initial download delay
  59. #AUTOTHROTTLE_START_DELAY = 5
  60. # The maximum download delay to be set in case of high latencies
  61. #AUTOTHROTTLE_MAX_DELAY = 60
  62. # The average number of requests Scrapy should be sending in parallel to
  63. # each remote server
  64. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  65. # Enable showing throttling stats for every response received:
  66. #AUTOTHROTTLE_DEBUG = False

  67. # Enable and configure HTTP caching (disabled by default)
  68. # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  69. #HTTPCACHE_ENABLED = True
  70. #HTTPCACHE_EXPIRATION_SECS = 0
  71. #HTTPCACHE_DIR = 'httpcache'
  72. #HTTPCACHE_IGNORE_HTTP_CODES = []
  73. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
复制代码

items.py
  1. # -*- coding: utf-8 -*-

  2. # Define here the models for your scraped items
  3. #
  4. # See documentation in:
  5. # https://doc.scrapy.org/en/latest/topics/items.html

  6. import scrapy


  7. class QuoteItem(scrapy.Item):
  8.     text = scrapy.Field()
  9.     author = scrapy.Field()
  10.     tags = scrapy.Field()





  11.      # define the fields for your item here like:
  12.     # name = scrapy.Field()
复制代码

piplines.py
  1. # -*- coding: utf-8 -*-

  2. # Define your item pipelines here
  3. #
  4. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  5. # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
  6. ## 对item进行处理,保存到数据库当中
  7. from scrapy.exceptions import DropItem
  8. class TextPipeline(object):
  9.     def __init__(self):
  10.         self.limit = 50

  11.     def process_item(self, item, spider):
  12.         if item['text']:
  13.            if len(item['text']) > self.limit:
  14.                item['text'] = item['text'][0:self.limit].rstrip() + '...'

  15.            return item
  16.         else:
  17.             return DropItem('missing Text')
复制代码


分享到:  QQ好友和群QQ好友和群 QQ空间QQ空间 腾讯微博腾讯微博 腾讯朋友腾讯朋友
收藏收藏
回复

使用道具 举报

本版积分规则

关闭

站长推荐上一条 /1 下一条

小黑屋|手机版|Archiver|51Testing软件测试网 ( 沪ICP备05003035号 关于我们

GMT+8, 2024-11-18 17:53 , Processed in 0.061607 second(s), 23 queries .

Powered by Discuz! X3.2

© 2001-2024 Comsenz Inc.

快速回复 返回顶部 返回列表