51Testing软件测试论坛

标题: selenium模拟登陆+爬取数据 [打印本页]

作者: 测试积点老人    时间: 2022-6-16 10:00
标题: selenium模拟登陆+爬取数据
  1. from selenium import webdriver
  2. from time import sleep
  3. from selenium.webdriver.common.by import By
  4. from lxml import etree
  5. import requests
  6. import time
  7. headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}

  8. driver = webdriver.Chrome()
  9. driver.get('https://www.timeshighereducation.com/')
  10. sleep(5)

  11. driver.find_element(By.XPATH,'//div[@class="col-sm-12"]/div[@class="navbar-button__wrapper"]/ul/li/a[@title="User account"]').click()
  12. sleep(5)
  13. driver.find_element(By.XPATH,'//div[@class="region region-secondary-navigation"]/section/div/ul/li/a[@id="modal-login"]').click()
  14. sleep(5)

  15. sleep(5)
  16. driver.switch_to.frame(driver.find_element(By.XPATH,'//div[@id="modal-content"]/iframe'))

  17. driver.find_element(ByXPATH,'//input[@placeholder="Username or email"]').send_keys("123456")
  18. driver.find_element(By.XPATH,'//input[@placeholder="Password"]').send_keys("123456")
  19. # driver.refresh()
  20. driver.find_element(By.XPATH,'//form[@class="user-login-form"]/div/input[@value="Log in"]').click()
  21. sleep(5)
  22. driver.switch_to.parent_frame
  23. sleep(5)
复制代码
解析html字符串,获取需要的信息
  1. def parse_html(html):
  2. text = etree.HTML(html)
  3. node_list = text.xpath('//tbody/tr[@class="odd row-1 js-row"]')
  4. # print(node_list)
  5. for i in node_list:
  6.     try:
  7.         # rank
  8.         rank = i.xpath('/td[@class="rank sorting_1 sorting_2"]/text()')
  9.         # name
  10.         name = i.xpath('/td[@class=" name namesearch"]/a/text()')
  11.         # region
  12.         region = i.xpath('/td/div/div[@class="location"]/span/a/text()')
  13.         #ratio
  14.         # ratio = i.xpath('')
  15.         # 构建json格式的字符串
  16.         items = {
  17.             "排名": rank,
  18.             "名称": name,
  19.             "地区/国家": region
  20.         }
  21.         print(items)
  22.     except:
  23.         pass
  24. def main():
  25. # 循环获取第0~15的网页源码,并解析
  26. for page in range(0, 16):
  27. # 每个网页的网址
  28. url = 'https://www.timeshighereducation.com/world-university-rankings/2022#!/page/'+ str(page) + '/length/25/sort_by/rank/sort_order/asc/cols/stats'
  29. # 爬取网页源码
  30. html = requests.get(url, headers=headers).text
  31. # 解析网页信息
  32. parse_html(html)
复制代码


程序运行入口
  1. if name == 'main':
  2. main()
复制代码
为什么我爬不到数据,有没有能人赐教,本人初次接触


作者: qqq911    时间: 2022-6-17 10:29
下个断点调试
作者: jingzizx    时间: 2022-6-17 13:16
单步调试,
作者: 郭小贱    时间: 2022-6-17 15:45
debug看看什么问题呢。




欢迎光临 51Testing软件测试论坛 (http://bbs.51testing.com/) Powered by Discuz! X3.2