TA的每日心情 | 无聊 4 天前 |
---|
签到天数: 530 天 连续签到: 2 天 [LV.9]测试副司令
|
1测试积点
这是爬虫文件,就这个一个爬虫
- class FirstSpiderSpider(scrapy.Spider):
- name = 'first_spider'
- allowed_domains = ['movie.douban.com']
- start_urls = ['https://read.douban.com/?dcm=original-nav']
-
- def parse(self, response):
- title = response.xpath('//*[@id="react-root"]/div/div/div[3]/div/div[2]/div/div/div[2]/div/div[1]/div['
- '2]/h4/a/span/text()').extract_first()
- print(title)
-
复制代码 有一个下载中间件,而且启用了
- class LolDownloaderMiddleware:
- def process_request(self, request, spider):
- url = request.url
- # 开启selenium
- driver = webdriver.PhantomJS(executable_path=r'D:\tool\phantomjs-2.1.1-windows\bin\phantomjs.exe')
- driver.get(url)
- c = driver.find_element(By.XPATH, '//*[@id="react-root"]/div/div/div[3]/a[1]')
- c.click()
- time.sleep(1)
- data = driver.page_source # 获取页面源代码
- driver.close() # 关闭selenium
- return HtmlResponse(url=url, body=data, encoding='utf-8', request=request)
复制代码 我的问题就是它的执行流程,爬虫启动一开始,注意我说的最开始,start_urls里面的这个唯一地址,是先经过下载中间件,再到下载器,然后再返回给爬虫处理?
|
|