关于爬虫selenium的使用问题
这是爬虫文件,就这个一个爬虫class FirstSpiderSpider(scrapy.Spider):
name = 'first_spider'
allowed_domains = ['movie.douban.com']
start_urls = ['https://read.douban.com/?dcm=original-nav']
def parse(self, response):
title = response.xpath('//*[@id="react-root"]/div/div/div/div/div/div/div/div/div/div/div['
'2]/h4/a/span/text()').extract_first()
print(title)
有一个下载中间件,而且启用了
class LolDownloaderMiddleware:
def process_request(self, request, spider):
url = request.url
# 开启selenium
driver = webdriver.PhantomJS(executable_path=r'D:\tool\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get(url)
c = driver.find_element(By.XPATH, '//*[@id="react-root"]/div/div/div/a')
c.click()
time.sleep(1)
data = driver.page_source# 获取页面源代码
driver.close()# 关闭selenium
return HtmlResponse(url=url, body=data, encoding='utf-8', request=request)我的问题就是它的执行流程,爬虫启动一开始,注意我说的最开始,start_urls里面的这个唯一地址,是先经过下载中间件,再到下载器,然后再返回给爬虫处理?
等大神 中间 可以的 先启动吧
页:
[1]